
Ask YC: How to write a database - tumba
Most algorithm and data structure texts deal with in-memory algorithms and I'm having difficulty making the leap to secondary-storage equivalents. I know that one should think twice before writing one's own database, but I want to learn how they work. What books and/or source code would you recommend I read?
======
stonemetal
_Most algorithm and data structure texts deal with in-memory algorithms_

Nope, they deal with Algorithms in general without regard to storage. Usually
when an algorithm has been developed with some sort of memory in mind it is
titled as such. As an example cache oblivious algorithms are algorithms which
are cache aware but oblivious to cache size (aka supposed to be good no matter
the cache size).

Sqlite is the only database I can think of that is small enough to be read. If
all you are interested in is algorithms applied to disk storage I don't think
I would recommend writing your own database since that is going to involve
writing a large chunk of code for SQL parsing, query planing, etc. Why not
just write some data processing tools for data mining.

~~~
tumba
I agree and should have been more precise. Most algorithm and data structure
texts fail to deal with the issues surrounding secondary-storage application,
including optimizing disk accesses, buffering, etc.

~~~
fleitz
Optimization is a black art. It generally only applies to very specific cases.
This is why most RDBM systems expose a lot of knobs to tune. The DB must be
tuned both to hardware and work load, what is a good idea for one workload may
be a disaster for another.

Here is an example, you can usually increase DB performance by about 15% by
not allocating the first 8 MB of your hard drive. (Hint: the partition table
causes misaligned reads/writes on raid systems if your partition starts at
sector 1)

If you want to see the tricks of the trade, go read the TPC results, find the
auditor's report and look through the source, the hardware configuration, and
the software configuration.

------
fleitz
Theoretically, B-trees/ISAM. That's pretty much all you need to know. Most
algorithms aren't optimized for their storage layer(s), they exist primarily
in big-O notation that ignores latencies, and especially non-uniform
latencies.

On most systems RAM is the 5th level of storage (registers,l1 cache,l2
cache,l3 cache, ram). Just because L1-3 cache isn't addressable doesn't mean
it won't have a serious effect on the performance of your code.

------
bockris
Check out the source of SQLite.

~~~
gdl
I'll second this. I haven't looked through it myself, but have heard it
praised many times as an example of exceptionally clean and efficient code.

------
1331
If you would like to learn database theory, then I recommend the following
book:

_An Introduction to Database Systems_ (8th Ed.) By C. J. Date
[http://www.amazon.com/Introduction-Database-
Systems-8th/dp/0...](http://www.amazon.com/Introduction-Database-
Systems-8th/dp/0321197844/)

