
Why I love databases - strzalek
https://medium.com/@jeeyoungk/why-i-love-databases-1d4cc433685f
======
jandrewrogers
Nice article. I love databases too for similar reasons but, as someone that
designs database engines, some of the technical points are off the mark. I
never really stop learning in this area, the technical range is incredibly
deep and nuanced.

Some of the points that caught my eye as being quite off:

\- Contrary to footnote 2, modern database designs bypass the OS file system
cache and schedule their own I/O. This has an enormous performance impact
versus the OS cache (2-3x is pretty typical) and is a good litmus test for the
technical sophistication of the database implementation. It is the primary
reason many open source database engines, even "write-oriented" ones, have
relatively poor write performance to disk.

\- The three data retrieval models enumerated are textbook but if you were
designing a new engine today you probably would not use any of them for a
general purpose design. Modern spatial access methods (ex: Hyperdex,
SpaceCurve, Laminar) are superior in almost every way though the literature is
much sparser. Also, a number of real-time analytical databases use bitmap-
structured databases (ex: ParStream), which are incredibly fast for some types
of query workloads.

\- Many distributed database challenges are a side effect of "one server, one
shard" type models. It is not necessary to do things this way, it is just
simpler to implement; some distributed database systems have thousands of
shards per server. The latter model is operationally more robust and better
behaved under failure and load skew.

\- Tombstones are usually trivial if the database engine is properly designed.
Complications are a side effect of poor architecture. The big challenge for
tombstones is deciding when and how tombstoned records are garbage collected.
It is outside the scope of the normal execution pathways but you also don't
want a Java-like GC thread in the background.

Of course, any of these is a long blog post in itself. :-)

~~~
BadassFractal
Doesn't Postgres rely on OS caching? Would you say it's missing out on large
performance gains based on that?

~~~
justinsb
I believe that active data is stored twice (once in the Postgres buffer pool,
and once in the FS cache). This is not ideal, because it is not making optimal
use of RAM. To minimize this effect, PG recommends a relatively small buffer
pool, which is not great if you believe the DB can do a better job than a
generic OS.

I think this is quite fixable as well (I think I've even fixed it myself once)
- just use O_DIRECT. Any PG maintainers able to tell me if PG still "double-
buffers" and whether this patch would be useful if I could recreate it?

~~~
jeffdavis
It _can_ be stored twice, but I don't think that's the ordinary case. Pages
that are hot in PG's buffer cache are likely to stay there, making the same
page in the OS buffer cache cold (because there aren't many requests for it).
That's not always true, because writes to hot pages will end up going through
the OS buffer cache maybe a couple times per checkpoint cycle, but it's still
not (on average) stored twice. I don't have empirical numbers here,
unfortunately, so someone else can correct me and fill in details.

At least from the discussions I've seen, there isn't a lot of interest in
using O_DIRECT or otherwise taking on the I/O scheduling problem into
postgres. It's not particularly exciting to me, because

* Takes on the I/O scheduling problem, rather than the simpler method now of just handing pages to the kernel for lazy writing, and fsync'ing at checkpoint time.

* Requires a lot of new code, tuning, configuration, etc. with a high maintenance cost.

* Not portable, so it's easy to make a tweak that helps on kernel and HW configuration X but hurts on kernel and HW configuration Y.

* Not very strategic: it helps some workloads with a lot of real I/O by some small constant factor; which doesn't necessarily open up new use cases or market opportunities

In my opinion, it's much better to focus on innovative features, or more low-
hanging performance gains (which exist in postgres), or scale-out features.
All of those will be slowed down if the code becomes bulkier (making correct
patches harder to write, especially for new developers) and the maintainers
become distracted by I/O scheduling issues.

It seems more like something to do when innovation slows enough, otherwise it
doesn't seem worth it.

~~~
justinsb
Fair enough. Thank you for the very thorough answer - and for saving me the
effort of putting together the patch!

------
pavlov
I hate databases. People tend to have way too much faith in them (or their
surrounding marketing), and thus make poor database choices that don't
actually fit the shape of their data. Persistence is fundamentally the
programmer's responsibility; a magic box behind a socket can't design it for
you.

Most applications I've seen wouldn't even need a database, but apparently a
lot of programmers are conditioned into believing that writing anything to
disk must involve building a database query string and transmitting it over a
socket to another process which parses the string, executes it on an
interpreter and stuffs the extracted data into a generic 1970s data model that
finally gets written to disk in an opaque format from where it can only be
retrieved by sending more strings over sockets. This stuff made sense when 1MB
was a huge amount of memory, but today it's just not necessary.

~~~
bbcbasic
Hmm. You are not thinking in business terms.

You run a software house: do you want your developers reinventing data storage
on each application? Or using a fairly decent data storage that is RDBMS.

Most of the time RDBMS is a very good choice. Think about the tooling,
support, hiring knowledgeable people etc. Lets face it RDBMS are good at the
very small (single table, replacing a text file) up to the very large.

In the RDBMS/SQL world you have everything from sqllite to save as a single
file, to Oracle for your Enterprise app.

~~~
sklogic
Most of the time a plain text file is good enough. Databases are overrated.

~~~
ams6110
This is simply not true, otherwise we would all be using text files. Text
files can work in some situations. I disagree that this is "most of the time."

~~~
sklogic
We _are_ using text files. How many programs in a typical Unix installation
need to communicate to a database? Next to none. How many are communicating
via text files? Almost all of them.

~~~
SamReidHughes
How do you randomly access variable length fields in a text file and modify
them and add new ones?

~~~
sklogic
Why would I want to randomly access variable length fields in a file, to start
with? It's a very niche thing to do. Filesystem driver may want to do
something like this, some large scale caching proxy, probably. In most cases
it's just an optimisation (like a berkeley db records backing text files such
as /etc/passwd).

~~~
SamReidHughes
> Why would I want to randomly access variable length fields in a file, to
> start with? It's a very niche thing to do.

If by "niche" you mean every interactive website does it.

------
jeeyoungk
Thank you everybody for your valuable comment. Like I've said in the post,
this was my first attempt to explain my passion. I read every one of your
comments, and spent the next few days fully understanding their implications.

This post is a bit biased towards what I've interacted with. I think I do have
a good sample of various workflow, but there are obviously large number of
databases and use cases that I'm missing, and my view of what is "modern" may
not incorporate a lot of bleeding edge theory and technologies.

Operational complexity is definitely my #1 concern. The service I originally
maintained is a OLTP system, located at the top of the service & data
dependency graph. Availability is the top concern.

The current system I'm writing is metrics database. The operational burden is
much lighter. It is almost a leaf node in the service & data dependency graph,
and I can take downtime to restart the cluster. A very different workflow,
indeed.

Thanks!

------
gfodor
"Designing Data-Intensive Applications" is shaping up to be an excellent
treatement of modern databases and their underpinnings. It's at an excellent
level of abstraction, deep enough to convey database internals while high
level enough (so far at least) to be able to cover a wide variety of database
systems. It also has its feet firmly planted in database history, and is
NoSQL-koolaid free. Highly recommended.

[http://shop.oreilly.com/product/0636920032175.do](http://shop.oreilly.com/product/0636920032175.do)

~~~
T-R
When you say "NoSQL-Koolaid free", do you mean "focuses on Relational" or
"treats the topic objectively/academically"?

~~~
andrioni
Not OP, but the latter. The author talks both about data models (relational,
document, graph-based, with some nice historical tidbits about IMS and
CODASYL) and storage models (from B-trees to SSTables, passing through bitmap
encoding and everything else).

It is an amazing book so far, and I'm pretty excited for the rest of it to be
written. It's the best text I've read so far on databases that gives all the
references you need if you want to go deeper, but still allows you to get a
very good higher-level understanding without it.

~~~
T-R
That sounds fantastic. I'll give it a look

------
Slackwise
I've always loved databases, but after having discovered write-only
timestamped databases like Datomic, I can't imagine going backwards. It's a
real shame that Datomic isn't fully open source.

(Aren't BigTable and Spanner also write-only and timestamped?)

~~~
collyw
Ok, assuming write-only means what I think it does, what is the point in a
write only database? /dev/null/ is write only.

~~~
michaelochurch
Append-only means that every fact is timestamped and nothing is ever thrown
away. (Actually, Datomic provides excision, because some regulatory use cases
require it, but it's a "black magic" feature.) The idea is that you should
never use UPDATE or DELETE, unless it's a hard requirement or a well-studied
optimization. The fact that it's 58 degrees today doesn't invalidate the fact
that it was 45 degrees yesterday. New facts may replace old ones relative to
time-sensitive queries ("what is the _most recent_ temperature?") but you
never get rid of the old ones.

This means that queries often have an additional parameter, which is the time.
So a type signature for a query type (in a Haskell-like language) would be:

    
    
        type Query a = ([Fact], Time) -> a
    

where the typical use case would use all stored facts and the current time.
(Obviously, the DB itself isn't going to do a full-table scan for typical
queries. There are optimizations involved. But, conceptually, a query is a
function over _all_ facts.)

~~~
dllthomas
I've actually recently been thinking of extending this logic to my Anki
geography deck. Populations change, but population at some specific point in
time shouldn't.

------
quonn
"The study of databases intersects almost every topic in computer science" \-
I've heard this before, especially for Compilers. But it has been false for a
long time, CS is far more diverse now. For example, how do Databases intersect
AI/Machine Learning/Computer Vision, Computer Graphics, Numerics/Simulation,
Robotics, Bioinformatics, Computer Architecture or Cryptography?

~~~
JadeNB
> how do Databases intersect AI/Machine Learning

I am not in any way a computer-science expert, but surely these topics have
substantial overlap? (For example, a machine that is learning must store its
accumulated knowledge somewhere.)

~~~
quonn
There is data mining, which is like machine learning + databases. But for
many, many machine learning problems, the learning occurs offline and the data
is simply stored in files, then the learning algorithm is run. The result is a
model, which in many cases has a small constant representation (common
exceptions: kNN and SVM). There is no database involved. Now you can of course
try to learn from an existing database (data mining) or from huge amounts of
data, probably stored in a distributed database (big data). But many systems
will not and even then the database part is usually not of much concern. So I
still don't see much overlap.

~~~
siddboots
The overlap goes well beyond the question of whether an literal database
software system is used: Knowledge representation (as in modern AI), feature
representation (as in ML), and relational data modelling are essentially all
different branches of the same domain of study.

Consider, for example, the connection with logic resolution languages like
Prolog.

------
Pxtl
I love databases, but I loathe SQL. And no, I don't mean NoSQL is better -
that's throwing out the baby with the bathwater.

To me, SQL is the Common Lisp of relational languages - a brilliant invention
of its time that has since long-overstayed its welcome and should be replaced
by modern considerations of the problem it solves. The difference is that
there are a million rethinks and descendents and redesigns of LISP out there
that happily threw out the mistakes and made great strides in the language.
You could argue that _every_ modern programming language is a descendant of
Lisp thanks to the prevalence of great concepts like lexical closures. SQL, on
the other hand, has a teeny tiny few spiritual fringe descendents like the
various attempts at Date and Darwin's "Tutorial D".

I love the relational model, but who says the only way to manage the
relational model is this hoary old thing? It's immensely frustrating that
every implementation of SQL bolts on a tacky and half-assed procedural
language, but doesn't solve simple underlying frustrations.

Simply accessing related objects is immensely wordy for a "relational"
language. In an algol-derived language, I can say
Group.Manager.Person.Address.PostalCode to walk the graph. In SQL, I have to
deal with a zillion joins.

Yes, some SQL variants let you join by the foreign key name to make the join a
little more terse, but it's still hairy compared to every modern functional or
procedural language.

And the APIs - maybe the reason so many sites have SQL injection problems is
the hideous APIs. Ever tried to build a WHERE IN (id1, id2, id3... idN)
statement with a proper parametrized queries? Holy crap what a icky mass of
boiler plate. I mean, it's not a _hard_ problem, but how many times have you
solved it, and how many times have you found a tedious bug in your solution?
Just give me a proper way to concatenate the parametrization _inline_ with the
query FFS.

    
    
      db.RunQuery("""
        SELECT * 
        FROM MYTABLE 
        WHERE ID IN + " + db.SomeParameterListFunc(a, b, c) + " 
        ORDER BY HOLYCRAP_WAS_THAT_SO_HARD"
      """);
    

The above syntax would be trivial in any language with operator overloading on
the "+" sign, on the off chance that your SQL dialect is so messy it's
impossible to safely build a properly-escaped initializer for the list
containing a,b and c in text form.

And that's not even getting into real actual first-class language support like
ORMs give you.

And speaking of APIs, the fact that a single "SELECT" is the baseline
operation... that you work on one resultset at a time. I don't want a single
pile of rows. This is not an excel spreadsheet, it's a relational database,
and that means I want a _graph_ of data. I don't want to write three queries
to get my Customers, their Personnel, and their Addresses, nor do I want a
single row of CustomerPersonnelAdddresses. Once and Only Once is good for the
data, why the heck isn't it good for result sets?

Where's the code reuse? Why can't I have a pile of SELECTs and a pile of
WHEREs and combine them however I see fit? Oh right, I can use a VIEW... but
see the previous point, a VIEW is a single glorified Excel spreadsheet, not a
proper graph of data. If I want to bundle a bunch of SELECTs together, I have
to just write a stored procedure, but then I can't use the proc with a JOIN
statement against my VIEW that provides a custom WHERE clause. You could do
something monstrous with table-valued parameters, I guess, but those aren't
generally well-supported at the API level. This is not a hard problem in every
modern language (except Go, of course - yes, you _do_ freaking need
map/reduce/filter).

Namespaces. Real, actual, organizational tools for your giant list of 9000
tables and their related objects. No, schemas don't freaking count - you can't
nest them and they're overly tied to the security model - using schemas for
organization instead of security leads to madness, besides the fact that you
can't nest them.

And of course, so many common problems simply _aren 't_ nice to work with
using the relational model. How do I make a nice audited row where I have the
full history of all the row's changes? Well, I can insert it every time, but
that's a lot of wasted space. Yes, again, there are ways to do this, but it's
something I'd expect to come out-of-the-box since it's such a common problem.
Common problems should be solved by the standard library. But SQL can't solve
things like this with the standard library, because a SQL standard libraries
are limited to crude things like functions and views and procs and not actual
large-scale reusable constructs. It's like a programming language where they
gave you a bunch of general tools for manipulating unicode points and
dynamically sized arrays but no coherent "string" object.

Or _trees_. Holy crap, you have a "relational" database where a relationship
like a "tree" is a nightmare to actually query out! I know that's not what
"relational" means, but still - this ins't exactly a rare edge-case, y'know?
But it's not in the standard library because the standard library is limited
to crude objects like data-types, functions, procedures, etc. that work below
the row-level. Any concept of reusable _schema_ concepts is completely left
off the table.

/rant

~~~
boyter
The language issue you raise is in part solved by C# and Entity framework.

Because of the lazy way it works you can define a single select over a table,
then apply extension methods, joins etc.. over the return of this select and
you end up with a highly reusable clean easily tested data layer. Done right
you get code like the following,

    
    
      var url = _urlRepo.GetUrl().OrderByCreateDate().ByCreatedUser("boyter");
      var url2 = _urlRepo.GetUrl().OrderByCreatedUser().ByDomain("geocities");
    

Or with joins,

    
    
      var users = _userRepo.GetUser();
      var locations = _locationRepo.GetLocation();
    
      var result =  from user in users
                  join location in locations on user.locationid equals location.id && location.name = "Parramatta"
                  select user;
    

I wrote about this a bit here [http://www.boyter.org/2013/07/clean-repository-
data-access-c...](http://www.boyter.org/2013/07/clean-repository-data-
access-c/) and it is to date the cleanest way I have come across in any
language to deal with SQL. Compile time checked, reusable and testable.

Its the best ORM I have found so far across any language. I would dearly love
to have it ported to other languages and frameworks.

~~~
NDizzle
It's funny you mention entity framework. That's what blows up tempdb on one of
my many environments with some of the stupidest queries I've ever seen hit the
server.

Maybe your experience with entity is better than mine, but I personally think
it and most other ORMs are complete garbage.

~~~
dasboth
I agree with this (sort of). I find Entity Framework very usable and it does
help you write descriptive and maintainable code, but the SQL queries it ends
up writing makes it a performance nightmare. YMMV of course but for
applications where performance is important I haven't found EF to perform
adequately. As ORMs go all I really want is an abstraction layer and the
mapping from query to POCO classes without the performance hit. Currently
using Dapper.

------
read
> You cannot give up partition tolerance.

It'd be more accurate to say you don't want to give it up all the time. You
don't want CAP; you want PACELC.

[http://dbmsmusings.blogspot.com/2010/04/problems-with-cap-
an...](http://dbmsmusings.blogspot.com/2010/04/problems-with-cap-and-yahoos-
little.html)

~~~
jeeyoungk
This is an interesting categorization - somewhat like adding an axis to the
political compass. Thanks.

------
rattray
I wonder why RethinkDB hasn't gotten a mention here. Have folks here used it?
Thoughts?

------
cfolgar
Since we're discussing databases...is there any "golden standard" learning
resource/introduction to PostegreSQL? As a college student, I do not have much
experience with them yet but I am aware of how important it would be to be
comfortable with them in your day to development. Something tells me that my
usual approach of diving in and tinkering by building out an idea wouldn't
serve me as well for db's; it just seems to me that there's some fundamental
database concepts that I would be missing if I went down that path.

Any advice as to where to start a structured approach to learning about
databases would be highly appreciated :)

~~~
wowmsi
I would recommend "Architecture of a Database System" by Joe Hellerstein,
Micheal Stonebraker, and James Hamilton [0] as a start if you want formal
perspective without getting too much into theory.

[0]
[http://db.cs.berkeley.edu/papers/fntdb07-architecture.pdf](http://db.cs.berkeley.edu/papers/fntdb07-architecture.pdf)

~~~
jbergens
As a sidenote, Michael Stonebreaker himself is nowdays building VoltDb which
is a memory-db that he thinks is much more evolved than relational systems. I
havn't used it but it looks interesting. I did try OrigoDb which is an open
source .NET based memory db and it worked great for my needs and the queries
are really fast.

~~~
bballard
(I work at VoltDB)

VoltDB is easy to try. There is a free 30-day trial available on voltdb.com,
and a docker-based demo ([http://voltdb.com/voltdb-
demo](http://voltdb.com/voltdb-demo)).

------
Roboprog
I have lived a very sheltered existence. I have never worked on an application
which had a database cluster or sharding, rather than just running on a single
server.

Of course, the servers are a little bigger now than they were back around say
1990.

~~~
ams6110
These things are necessasry only at internet scale. Your typical business
application does not need a sharded distributed database even in a large
enterprise system.

------
fogleman
I love databases for the data they contain. And for the ability to make sense
of that data more easily when it's in a nice, structured format.

I don't care as much for the dev-ops side of it.

------
pm90
What I find both fascinating and scary about databases is how to choose
between the wide variety of databases without understanding exactly how they
work? And it doesn't help that there are new databases springing up all the
time.

Is there a way for application developers to understand these databases
quickly without spending weeks working with them?

~~~
bbcbasic
Use a "NoSQL" database when you have very large amounts of data (so you need
1000s of separate disks) or large volumes of queries (so you need a 1000s of
separate servers). Otherwise MySQL, SQL Server, Oracle etc. is probably fine.

NoSQL has an learning and 'new technology' overhead that isn't worth paying
unless the pain of using traditional databases is too high.

Don't forget SQLLite too - nice for the very small applications.

~~~
nickik
There are many other reasons to use "NoSQL" not just big data. Sure MySQL
would probebly be fine, but somebody might just think that RandomDB might just
be nicer to use, easier to set up, closer to your data model or any number of
other things.

> NoSQL has an learning and 'new technology' overhead that isn't worth paying
> unless the pain of using traditional databases is too high.

You say that as if everybody is born with knowlage of SQL. I would of course
teach SQL to every new programmer still but im just point out that is not the
best argument.

~~~
bbcbasic
Support for tooling, ease of hiring both programmers who are familiar with
querying and DBAs is much better on relational.

The pain caused by relational being a 'bad fit' needs to be pretty high to
consider NoSQL in my opinion. Except for hobby projects (I did some CouchDB
stuff in my spare time for the thrill, but MySQL would have been adequate.)

True, noone is born with SQL knowledge (learning to breath is the first
priority). However stick your ad up for an experienced Cassandra DBA and then
an experienced Oracle DBA and see how you do.

------
tkyjonathan
I like SQL. It makes sense to me, personally. Even when I get really unwieldy,
I can always take a step back and break it down to parts. If I'm allowed to
use temp tables, I can do almost anything in it and usually a lot faster than
developers in their language can.

------
metaphorm
this was a great article. extremely helpful to me, coming from a position of
being a server-side application developer often tasked with getting several
disparate data-stores talking to each other.

the author does a great job of showing the full depth of the field while
providing useful hooks and links for further study. this one made my bookmarks
folder. i'll surely be going back to check it out again.

------
smartpants
This is what i needed. Great read

------
digital-rubber
In my opinion, something more important, does the database love your data?
Only then you can truly love your database.

------
marknadal
I'm glad he loves databases, databases have been the bane of my existence.

However, the torment they have given me has also lead to a similar fascination
- and now I'm writing my own database! So I've become very familiar with the
topics he writes on, and they are very good points for anybody interested in
the subject.

Why would I write my own database? Because databases are hard, and I am
determined to make them easy (even if that means me sacrificing years of my
life into doing all this crazy research). Check out
[http://github.com/amark/gun](http://github.com/amark/gun) !

\- CAP Theorem, he is correct, P cannot be sacrificed. GUN is AP with eventual
consistency. The beauty of this, though, is that you can always build strong
consistency out of eventual consistency (it just requires knowing X amount of
peers in advance, and doing a trivial lock until you've heard back from all of
them - in fact, I do this in one of the example apps) but you can never go
from strong consistency down to an eventually consistent system.

\- Distributed Systems, this is incredibly incredibly important. I cannot
repeat this enough times, there should be no "master" or "single source of
truth" in any database. If there is, you're going to have a nightmare of a
time being woken up at 3am to fix it when it crashes (my personal experience
with other databases). Why? Because single points of failures will fail,
centralized systems suck. Solution: Distribute and Decentralize! We make this
easy for you.

\- Correctness vs Efficiency, as he says, Paxos is difficult - all of them,
Raft, Quorum, leader election, etc. DO NOT USE THEM unless you are Google,
Amazon, Walmart, or what not. Even then, do not use them. Instead, I've solved
this challenging problem by developing a new Conflict Resolution system that
(very poorly) can be summarized as Vector Clocks + Timestamps, you get the
advantages of both without either of their weaknesses. What his means is that
data integrity is guaranteed because every machine is using a deterministic
algorithm, without any extra gossip between machines. Let me repeat, you'll
get the same result on every machine, eventually consistent, without any
multi-machine coordination. This means every peer is a master, and that is
awesome, even if you are running it on an ephemeral server/cloud - completely
resilient to failures, terminations, restarts, and reboots.

\- Empowering the App. Yes. Databases should serve you, not the other way
around. Answers to his questions about abstractions are at
[http://github.com/amark/gun](http://github.com/amark/gun) .

\- Operational Challenges. This is where I diverge from him. If something
seems wrong, like things suddenly becoming slow, you can easily just restart
it without any damage/harm/failure occurring. And then you can look through
your logs, taking your time, to see what went wrong.

\- Basic Building Blocks. Because GUN is a graph database, you get key-value
like access, as well as documents and relational styles. That is because
mathematically graphs are the superset of relational algebra and hierarchy
trees.

Happy to answer any questions!

~~~
kapv89
Hey, can you suggest some good reads for understanding databases enough to
build a simple relational database ...

~~~
jerrysievert
coming soon! (i also write my own databases, because it's fun)

~~~
marknadal
Jerry, would love to see your list. Have any links to your database work?

------
MrBra
Well, intro was promising and full of energy. Like "I am really going to
transmit you some good love for databases and explain why it's so stimulating
to deal with them!", but then right after this sparkling start it's all just
about same old redundancy, consistency, scaling...

