
How does a relational database work? - mgachka
http://coding-geek.com/how-databases-work/
======
why-el
Good write up. Another excellent resource straight out of the UC Berkeley
Database Group that I keep close by is "Architecture of a Database System"[1]
by three researchers in the field. It is very readable.

[1]
[http://db.cs.berkeley.edu/papers/fntdb07-architecture.pdf](http://db.cs.berkeley.edu/papers/fntdb07-architecture.pdf)

~~~
mikethedyke
how do you find these articles?

~~~
why-el
Honestly for this one I do not remember since it was two years ago, but I'd
start over at [http://arxiv.org/](http://arxiv.org/), which is a treasure.
Granted most are research papers, but even those will link to overview papers,
of which the one I link to is an example.

------
jandrewrogers
Good overview of traditional OLTP architectures. As complicated as they look
from the article, it is just scratching the surface of a sophisticated
implementation. There are many internals common to more advanced designs that
are not even mentioned, and the article is already quite long!

The thing I love most about database engines is that there is probably more
hardcore computer science per line of code than any other software system of
similar scope. It is a very rich ecosystem for an algorithms and data
structures geek.

------
faragon
Be careful with theoretical asymptotic complexity (big O) related to execution
time. E.g. if your algorithm time complexity is O(1), but internally calls a
higher complexity function, e.g. malloc(), implemented with higher complexity,
e.g. O(log n), your algorithm time complexity would be O(log n) and not O(1).
It could be even worse: _on average_ or _typical_ constant time algorithm
could be in reality an O(n) one: e.g. case of hash table reindexation (that's
the reason of why many big data structures, including most SQL databases,
requiring real time behavior, are implemented as trees, tree
hierarchies/division/clustering, instead of big hash tables).

~~~
duaneb
I don't think that the `n` in the case of malloc would always be relevant to
the semantics of the query. In that case, it would still be appropriate to
refer to it as constant time.

For instance, you don't typically look at the size of the literals in the
query when evaluating query complexity. If it's really unbounded, you probably
shouldn't use a relational database.

~~~
faragon
Not necessarily, but could be the case. Typical malloc implementations have
different management for at least small and big memory requests, using
different pools, in order to reduce fragmentation. Also, because operations
involving virtual address remap are expensive (realloc on a small block is
faster with a full memcopy to adifferent location, rather than doing stuff
involving the OS kernel doing virtual address remap).

The "problem" of malloc() function (or any other equivalent allocation stuff)
is that internally manages free blocks (it is a middleman between the OS and
user process -the purpose is to reduce OS calls-), if you have lots of them,
dynamic memory could take time. For "malloc" I meant malloc/realloc/free, the
whole kit. Those operations are not free (in most cases you're not going to
have millions of allocations in one process, that was just an example of
hidden things that could make your algorithm not behave like expected).

------
brudgers
Because I am interested in databases, I found the Se-radio's 2013 interview
with Michael Stonebreaker [1] interesting, particularly in regard to
traditional database design and more recent ideas:

[http://www.se-radio.net/2013/12/episode-199-michael-
stonebra...](http://www.se-radio.net/2013/12/episode-199-michael-stonebraker/)

[1]:
[http://www.theregister.co.uk/2015/03/25/mike_stonebraker_win...](http://www.theregister.co.uk/2015/03/25/mike_stonebraker_wins_turing_award/)

------
otis_inf
> When it comes to relational databases, I can’t help thinking that something
> is missing. They’re used everywhere. There are many different databases:
> from the small and useful SQLite to the powerful Teradata. But, there are
> only a few articles that explain how a database works.

That's because the inner workings are really old, as in: emerged before
blogging etc. was popular, hell before the internet was invented.

In the 'before/early internet days', we read books like 'An introduction to
Database Systems' by C.J. Date. (I had to blow the dust off my copy to read
the exact title ;)), which are more in depth than this article, but I like the
article better, because it's more to the point and easier to understand. Well
done!

~~~
jeffdavis
"An Introduction to Database Systems" is more about the relational model,
whereas this article is more about the implementations.

~~~
otis_inf
Only chapter 5 and 6, the rest isn't. Are we talking about the same book?

------
mgrennan
Not that Cassandra and Hadoop don't have a place. But because NO-SQL is hot I
see lots of young coders (I'm and old DBA) try to turn document store systems
into relational databases. They should all be made to read this post.

~~~
mgachka
(I'm the author of the article) I'm 28 and I’m currently a Big Data developer
(I use Hadoop, HBase, Hive …) and I don’t understand the buzz surrounding Big
Data and NoSQL.

With a relational database the complexity is hidden (more or less…) whereas
with Big Data and NoSQL the developer needs to deal with this complexity
himself/herself. As a result, most of the Big Data applications I’ve seen
don’t work well.

A really like Big Data because it’s more complex but to be honest, most of the
time my work does not required the “Big Data scale”.

~~~
kodablah
"With a relational database the complexity is hidden"

That is my main issue. I use Cassandra over relational firstly for its linear
scalability and multi-master-esque HA. But even ignoring those, I understand
exactly what is being scanned and what is not, I don't have to fight with an
optimizer at runtime based on several parameters.

~~~
mgachka
I understand your point (and it’s a good one) and here is mine: Unless you're
working in a team with a lot of good IT guys, you're likely to end up with
worse performances and problems.

For example, when I started in Big Data, in less than 3 weeks I was able to
optimize some batches just because I read the documentation of the framework
used (PIG in this case) and read a small part of the source code to dig
deeper. And it was not some touchy optimizations: I used in-memory joins and
reduced the number relations in the scripts to reduce the generation of Hadoop
jobs (which led to batchs 4 times faster).

There are often problems with our HBase database because it’s often overloaded
(I’m not an IT operator so I can’t give more details) and no one really
masters this database whereas it’s in production since 2014.

I do understand that in some cases a NoSQL database is mandatory and like you
I like to understand what I’m doing. But:

\- I’m not working in Silicon Valley

\- Most of my co-worker are not geeks (and I respect that)

\- It's VERY hard to find guys with real Big Data or NoSQL skills (this comes
from a French technical recruiter)

So, if the geek part of me loves Big Data and NoSQL, the rational part prefers
using well known technologies. If NoSQL and Big Data becomes mainstream and
more known then the rational part will love them too.

~~~
kodablah
While I agree with a couple of your premises, I don't think they all apply to
Cassandra as much and is too broad of a brush to use.

I don't believe NoSQL means no validation. In fact, I've found things like
Cassandra CQL actively prevent me from running expensive queries unless I opt
in (e.g. ALLOW FILTERING). Validation is DB specific, but I don't think it's
fair to say it's a footgun in NoSQL any more than in SQL databases.

As for choosing what is known by the employee market, I personally don't
choose technologies that way (but I do choose based on maturity of course). I
rarely look for skills as much as the ability to learn new ones, but I
understand it can be a pipe dream when in the market for juniors.

------
n0us
Actually this is the best post I have ever seen on this website.

------
jlees
This is also a pretty accessible quick intro to complexity and data
structures, nicely done. Definitely the sort of thing I would include as
further reading in a beginner course -- some beginners love to understand
"why" and this post answers pretty much all the "why" possible.

------
codezero
I decided to spend some time digging into SQLite. I highly recommend the
overviews of their architecture and the details about each part of the puzzle.

It's really understandable, very straight forward, even if a lot of it refers
to SQLite v2, it still seems very relevant.

[http://www.sqlite.org/arch.html](http://www.sqlite.org/arch.html)

------
emehrkay
I'm I crazy for wanting to write a database after reading this? Noting too
serious, just to flex that dev muscle

~~~
ddorian43
Try this:

Use lmdb for db library.

Use redis for the protocol.

Use twitter.gizzard for replication+sharding.

Boom! Your own webscale nosql!

~~~
collyw
All NoSQL is webscale, that's why it was invented after all....

------
aikah
Great Article , I wish a book was written where a simple database with a query
language was implemented from start to finish , even a nosql one, I always
wanted to implement my own.

------
buckbova
I've read sql server internals cover to cover and in many respects this is a
much better read. Thank you.

------
Cakez0r
> Nowadays, many developers don’t care about time complexity … and they’re
> right!

That's a pretty bold statement...

Very thorough explanations though!

~~~
michaelmior
I think the important thing is knowing when not to care. Unfortunately, a lot
of developers don't care because time complexity isn't even on their radar. So
in the times when it does matter, they get burned.

------
njharman
I believe that to be the best technical document I've ever read. Surely biased
as I learned so much.

~~~
mgachka
Hi, I'm glad to read this comment.

If you liked this article, maybe you'll like my article on Shazam. I used the
same pattern: I start from the basics of sound processing and computer science
and finish with an in-depth explanation of Shazam.

------
0xCMP
Wow, I wanted this recently. Anyone know stuff related to graph databases and
how those work?

~~~
vkat
Take a look at Cayley -
[https://github.com/google/cayley](https://github.com/google/cayley)

~~~
balquhidder
Is there a good way of actually inserting data using it?

------
mgrennan
Good read. How long do your keep your transactions logs and how often do you
make backups?

------
0xCMP
Anyone know of any rust based databases being worked on? Relational or
otherwise...

~~~
jamii
The database part of [http://witheve.com/](http://witheve.com/) is written in
Rust. It's not very technically interesting yet (eg no query optimiser) but
the basics all work.

------
beenpoor
Great article!

