
Readings in Databases - luu
http://rxin.github.io/db-readings/
======
tokenrove
Great list. Almost anything Jim Gray wrote, or any of Michael Stonebraker's
C-store stuff, is worth reading.

Database Systems: The Complete Book by Garcia-Molina, Ullman, and Widom seems
to be one of the few full books with decent coverage of implementing a
database, even though anyone wanting to implement a modern database will have
to go a long way from what it covers.

There are a ton of great papers out there easily accessible, though, at least
in part thanks to the VLDB Endowment. The people at MS Research working on
databases publish a lot of great papers, too.

~~~
binarymax
Absolutely agree. Also for those interested in query processing and
optimization I highly recommend some of the fundamental stuff by Goetz Graefe.
[http://dblife.cs.wisc.edu/person/Goetz_Graefe](http://dblife.cs.wisc.edu/person/Goetz_Graefe)

------
rxin
Author of the GitHub repo here.

I wanted a list of papers that would be essential in building database
systems. It was little bit sad to see many members of the community re-
discovering and re-inventing the wheels from 30+ years of relational database
and systems research. This list of papers should help build a solid
foundation.

~~~
tinco
It's a nice list but I have the feeling you can read each of these articles
and still not be able do write a halfway decent database because there's
nothing on how to actually write data to the disk efficiently on any operating
system.

I for one would be very interested in that, although perhaps the deprecation
of spinning disks has reduced the complexity of the issue.

~~~
batbomb
"halfway decent database" is a very vague notion. What would that even be?
Something similar to SQLite, MySQL, Cassandra, VoltDB? Are you talking about
write ahead logging of flushing dirty pages from the buffer (aka sequential or
random writes)?

All of these database systems have extremely different host requirements and
write characteristics. Even SQLite and MySQL can be pretty different depending
on the storage layer (SQLite can use BerkeleyDB)

The future isn't MySQL and spinning disks, the future is a tens of different
databases that do things and handle reliability in a variety of ways. I know
that complicates the issue even more, but my point would be that writing a
halfway decent database is possible (or impossible, depending on your
perspective) within a lot of this literature.

( Also, spinning disks even have abstractions since mostly you are using a
RAID controller with a decent size of cache to alleviate random writes. )

~~~
tinco
I completely agree with you with regards to the future of databases, it will
definitely be running many databases.

What I meant with halfway decent database is one that can actually guarantee
the data it committed is actually on the hard drive in a readable state, and
that can sustain random writes at a rate close to the theoretical limit of the
persistent storage device. I'm not judging any databases in particular. I'm
asking how can _I_ make a database that's as awesome as MySQL or Cassandra.

All these databases have already taught us a lot of what's in these articles,
but one thing that isn't written about a lot is how they actually achieve high
performance, not just in a theoretical way, but in a practical way. What
system calls do they make, what schemes do they use to minimize seeks, how is
the memory laid out. None of that high level "do we use Paxos of Raft", that's
kids stuff.

~~~
batbomb
You should check out Database Systems: The Complete Book.

Those papers have a lot of that information, but it's usually a bit buried.

> What system calls do they make

I think that's actually kind of out of scope for a paper, because already you
are assuming an OS and a language. Many papers/articles assume neither.

> what schemes do they use to minimize seeks

Look for information on buffer management/buffer managers

> how is the memory laid out

Almost always in pages that correspond to disk blocks. Also it's dependent on
lots of things (transactional model, index types, column vs row orientation,
etc..) This is heavily tied into the buffer management stuff too.

------
etrain
See also: [http://www.cs286.net/home/reading-
list](http://www.cs286.net/home/reading-list) \- Joe Hellerstein's graduate
database class.

~~~
bjg
You should send a pull request, to add that as one of the "External Reading
Lists"

~~~
etrain
Done: [https://github.com/rxin/db-readings/pull/2](https://github.com/rxin/db-
readings/pull/2)

------
walterbell
These lists (of lists ..) are reinventing early Yahoo/DMOZ/webrings. If lists
were available in a parseable format with accurate meta _data_ (title, author,
date, publisher), one could monitor github RSS and generate a local master
list for analysis.

Related: [https://github.com/inukshuk/jekyll-
scholar](https://github.com/inukshuk/jekyll-scholar)

~~~
mtdewcmu
Assuming you're right that this is a trend, I don't think it should be too
hard to heuristically pick out likely examples. Assuming most of them are good
quality, there should not be so many that it's impossible to pick through the
results and do the final formatting manually. Waiting for a standard format to
come along could take a while (seen the semantic web yet?).

~~~
walterbell
The jekyll-scholar format already exists and is more "semantic" than Markdown.
After people read these database papers, perhaps one will understand the value
of indexing structured data and convert/fork the list repo to structure the
source format. This is easy at the birth of a collaboratively edited project
and impossible later.

> Assuming you're right that this is a trend

Here is one list of lists,
[https://github.com/sindresorhus/awesome](https://github.com/sindresorhus/awesome)

~~~
mtdewcmu
My experience has been that expecting people to do more work, even a little
bit more, doesn't often pan out.

~~~
walterbell
At least MS/Google will thank humans for enriching their proprietary knowledge
graphs / semantic webs. It may take a few more years for people to realize why
they need "personal algos" as much as "personal computing devices". Humans can
then feed delectable metadata to their private algos, in addition to feeding
large public spiders.

------
AlisdairO
Wonderful list! The C-Store papers are particularly fine reading.

For people who looked at this wanting to get going with using database systems
(as opposed to creating them), I'd recommend:

    
    
        Learning SQL by Alan Beaulieu
        SQL Performance Explained by Markus Winand
        SQL Cookbook by Anthony Molinaro
    

The three of these will take you up to a pretty useful level of SQL
knowledge/understanding. The first two are fairly easy reading to boot.

~~~
walterbell
See also associative databases:
[http://www.associativemodel.com/associative_intro.htm](http://www.associativemodel.com/associative_intro.htm)
& [http://www.bloorresearch.com/analysis/the-return-of-the-
asso...](http://www.bloorresearch.com/analysis/the-return-of-the-associative-
database/)

------
St-Clock
I really loved reading "CAP Twelve Years Later: How the "Rules" Have Changed".

This sentence nailed what I thought was wrong with some early decisions in
NoSQL systems: "because partitions are rare, there is little reason to forfeit
C or A when the system is not partitioned."

~~~
michaelt
If you duplicate your data in locations a few hundred milliseconds apart (e.g.
America, Europe and Asia [1]) you either need to tolerate inconsistency or
have every write take a few hundred milliseconds.

If your system requires response times faster than that, the data centres you
don't have time to contact behave sort of like they are partitioned all the
time.

Me, I tolerate writes that take a bit longer to keep things simple :)

[1] [http://www.nsmith.net/articles/2011-08/latency-between-
amazo...](http://www.nsmith.net/articles/2011-08/latency-between-amazon-web-
services-regions/)

~~~
St-Clock
"Me, I tolerate writes that take a bit longer to keep things simple :)"

It's not just to keep things simple, it's to keep things consistent :-) Some
NoSQL systems (e.g., MongoDB) introduced a new P in the equation (Performance)
and used this to justify a lot of questionable decisions that affected
consistency (and durability!). They have come a long way since these early
decisions though.

If a system considers moderate latency as a partition, it would need to
perform partition recovery on every write. That's why I'm not sure considering
latency as partition is in line with the CAP theorem.

Maybe it is more related to availability though (not sure).

------
ahmett
Paxos paper is in "Basics" category just like it is meant to be a joke. Even
the "Paxos made simple" paper is not easily understood by graduate students as
many studies have shown. (see Raft paper for a study on this.)

~~~
marcosdumay
A "basic" concept is one that other concepts improve upon. The popular
association with "easy" is not always true.

~~~
twic
That sounds more like a fundamental concept than a basic one.

------
handojin
Chris Date, Database in Depth

No idea if legal...

[http://uk-corp.org/Books/Database/Database_in_Depth.pdf](http://uk-
corp.org/Books/Database/Database_in_Depth.pdf)

~~~
Nzen
Nope, this is an illicit file sharing site. O'Reilly publishes Date's book.
uk-corp's whois record has a privacy stand in. These could be legit copies or
packaged with malware.

------
checker659
If you'd like to learn about databases (at least RMDBs anyways), there's a
paid course at Harvard's Extension School. It's offered online for graduate
credit at $2k a pop.

Link : [http://www.extension.harvard.edu/courses/database-
systems](http://www.extension.harvard.edu/courses/database-systems)

~~~
zerr
Link to downloadable video lectures of this would be useful.

~~~
checker659
It's a paid course. You probably didn't read my comment in its entirety.

~~~
zerr
I've read it. Actually, there are no video lectures on this topic at all (i.e.
Database implementation in general, not necessary the paid course you
mentioned).

------
jmcatani
Is there anything like this list for Operating Systems?

~~~
mtdewcmu
I'm going to speculate that it's less necessary for operating systems, because
the equivalent information for OS has found its way into textbooks.

------
justinmk
link to "Anatomy of a Database System: Joe Hellerstein"[1] appears to be
broken.

[1]
[http://mitpress.mit.edu/books/chapters/0262693143chapm2.pdf](http://mitpress.mit.edu/books/chapters/0262693143chapm2.pdf)

~~~
etrain
[http://db.cs.berkeley.edu/cs286/papers/anatomy-
redbook2005.p...](http://db.cs.berkeley.edu/cs286/papers/anatomy-
redbook2005.pdf)

------
sean_the_geek
Nice list, bookmarked!I am going through the SHARK paper and would say that
its definately worth a read.

------
ExpiredLink
Readings in database _research_.

~~~
lvh
Many of the papers linked are case studies based on production systems. Do
they count as research, too?

------
jedanbik
Really looking forward to reading these. Thanks for contributing.

