Hacker News new | past | comments | ask | show | jobs | submit login
The “Build Your Own Database” book is finished (build-your-own.org)
657 points by tim_sw 39 days ago | hide | past | favorite | 99 comments

While I’m not in the target audience for this, I skimmed through the first chapter and this seems really cool - great job! I will definitely keep this one in mind if it ever becomes more relevant to me.

One small comment: the /about page only has the books listed and nothing else. Some people will probably be interested in knowing a thing or two about the author before getting the books (and there are probably a million people with the same name, so difficult to Google)

+1 reg the author point. I would like to know whose book I am getting.

I understand that sometimes people want to keep their identity private but for an author it's tough.

This is like wanting someone authoring a PR on github to have their about section filled out. Maybe it's nice for the reviewer to have, but some authors prefer to let the content to speak for itself. I think that's fair.

For someone trying the learn new information (and thus lacking the ability to discern), its not possible to let the content speak for itself

Past HN posts for author's other book, "Build Your Own Redis":

https://news.ycombinator.com/item?id=34557389 (Show HN - 5 comments)

https://news.ycombinator.com/item?id=34572263 (129 comments)

https://news.ycombinator.com/item?id=35212660 (65 comments)

build-your-own is actually good unlike that link though..!

The book recommends that website: https://build-your-own.org/database/

Why do you say the codecrafters stuff is lower quality?

Can you provide some context? Curious what informs your thoughts on codecrafters' quality...

Well, one's a decently comprehensive book and the other's SaaS aimed at embezzling your employer's learning budget - wider lang support but lower quality.

Maybe the problem is the format...

haha yeah nicely put

For a supplementary angle I’d highly recommend having a nose in both the Postgres docs and source. They’re both very readable.

I think the section on internals is a good jumping off point that leads to a lot of deeper content.


We have so many database products out there. But if this book leads to more of them, I'm all for it. I think databases, while old as IT itself, still has room to evolve, especially in the distributed realm, and especially in the multi-master configuration where I do not see many products being offered, as compared to one-writer configurations.

The book suggests building a database on top of a KV store. This is precisely what I did with my project. I was initially building a product that I hoped would be able to replace existing file systems and I needed an architecture that made it easy to create meta-data tags for every file and then find every file that had certain tags very quickly. I implemented the tags using a set of novel Key-Value stores that I invented.

Once I had it working, I realized the KV stores I used for tags were just like columns in a relational table built within a columnar database. Querying for files based off their tags was very much just like SQL queries for table rows. So I tried using them to create relational tables.

They turned out to be incredibly fast at a variety of queries (the bread and butter of databases) without needing to create separate indexes in order to get optimal performance. I thought database experts would be intrigued when I showed how much faster my system was than other conventional RDBMS setups on the same hardware. I guess I was surprised when almost no one was even curious how it did it.

Here is a simple, short video comparing it to SQLite: https://www.youtube.com/watch?v=Va5ZqfwQXWI

I think the problem is that every relational query can be implemented a KV store, but usually the trade offs and bugs in your nascent query engine IS boring, and we can already use a bunch of hella fast KV stores out there if you dont care about ACID or you are willing to give up your decades long implementation details on SQL engine of choice.

I believe the aim of the book is more to promote a basic understanding of how relational databases work internally, by way of implementing a simple one oneself, an understanding which is generally helpful when using databases, and not so much to cause new database products to be created.

As someone who lacks a formal CS education and wants to know more about how databases work, I have been eagerly awaiting this book. I also want some practical golang projects to work on so this is perfect! I'm so excited!

its kind of frustrating how this description of the book doesn't even mention what language is used in the examples.

I agree it could have been mentioned. Section 0.2, however, part of the short introduction page [0], provides the information:

The book uses Golang for sample code, but the topics are language agnostic. Readers are advised to code their own version of a database rather than just read the text.

[0] https://build-your-own.org/database/00a_overview

Built college newspaper website back in ‘99 got tired of maintaining it by hand. Discovered PHP when it was new. Wanted to build a content management system. SQL sounded hard so I wrote my own database. Worked great for the years I was at the college.

Is the final solution available somewhere that I can build and run?

I see the code in sparse but wanted to get the feel end to end.

It's a very short book, takes half an hour at most to read through if you skip implementation.

Woah great! I’d love an Andy Pavlo review of it ;)

What language is the tutorial in?

the book uses golang

Indeed. The Redis book was C/C++ so I was hopeful that this one would be as well. Given that database essentials are so closely tied to system calls, I would have hoped for a language that doesn't abstract them away. At least in the first fsync() call, the author does explicitly mention that `fp.Sync()` ends up invoking `fsync` system call, but as someone who has no intention of returning to Golang I'd rather not have to add complexity by requiring me to build and maintain a mental map of Golang calls to syscalls (the worst kind of abstraction layer IMHO: leaky and unnecesary).

It’s interesting that it doesn’t seem to actually say that anywhere.

write on the introduction it says https://build-your-own.org/database/00a_overview

Or if you see any chapter you can literally see Go code.

You're right, I missed the one place in the introduction where it mentions Golang.

Other than the "if err != nil" I wouldn't have recognized the code samples. Go's error handling is a big reason I've never taken a closer look.

never understood the issue with go error handling. As soon as you start dumping exceptions as a valid error forwarding mechanism ( which seems like a totally acceptable design decision), you end up manually having to check every call you make that can raise an error, on each line.

I don't really see any alternative. It also makes you carefully think about how you plan on managing errors in your codebase, which also seems like a very sane thing to enforce.

You don't have to trap exceptions separately on every line. You trap a whole block of code and match on the exception type to determine how to handle the problem. It isn't perfect, nothing is, but I prefer systems languages where typical failures cannot easily be ignored and yet you are not burdened with constantly thinking about it.

> I prefer systems languages where typical failures cannot easily be ignored and yet you are not burdened with constantly thinking about it

This is self-contradictory. In particular, the only way you can have reliable error handling is if you are forced to think about each possible failure.

I assume by "cannot easily be ignored" you mean the way exceptions blow up at runtime? I don't find that an acceptable default for any non-scripting language.

people who aren't used to actually handling their errors get annoyed at typing return err all the time.

They'd prefer an easier way to not bother dealing with them with them without outright ignoring them via _

I'd get annoyed too, because I know there are much simpler alternatives. Language design should encourage doing things right by making it more convenient than the shortcuts.

In Rust that entire check can be a single "?" symbol. How much syntactic sugar is too much is a matter of preference, but I personally think that properly handling all errors without syntactic sugar turns into an unreadable mess because there's just a lot of things which could go wrong.

almost 50% of the time i want to add some debugging info to the error before forwarding it one layer above, to provide more context.

I think just forwarding all low-level errors is a really bad habit, and go forces you to at least think about this.

> I think just forwarding all low-level errors is a really bad habit

Why exactly is that a bad habit? In almost all situations where I return an error I already have enough context, I'm just wondering what else I'd add to that.

> go forces you to at least think about this

Boilerplate code definitely doesn't incentivize thinking.

most errors you encounters are with i/o and are stupid "can't read, can't write or can't serialize".

In a network environment (which is originally what go was made for) you often need to add tracing information, business-level identifiers or processing information related to your state etc.

I'm currently writing a fairly complex api in go, and to be honest this really hasn't bothered me once.

Not to say it doesn't exists, but with time i've come very suspicious of people complaints over go. Most of the time those complaints come from people that didn't realize they missed an opportunity to have written a much much more elegant solution to their problem.

or usually just try again with backoff.

just returning the error isn't handling it.

I never said it is, but to handle an error you have to pass it to the caller that can actually do something with it, whether it's trying alternatives, repeating the step, just logging it, or whatever else - a lot of functions just need to pass it on a couple of times and making this verbose in every single case doesn't make sense to me.

My favorite language is Erlang, which is about as far from Go as it gets from an error handling perspective.

Interesting choice. I'd think that as the context is "As many of today’s (2023+) coders do not have a formal CS/SE education" and the goal is education, a more popular language like Javascript or Python (or, heck, even PHP[1] /s) would be used, instead.

I've taken a look at Go, and while it does seem pretty approachable, it's definitely not nearly as common as Python/JS, and it's always significantly harder for me to learn a new concept when the examples are also in a language I'm unfamiliar with. Maybe that's just me, though.

[1] https://survey.stackoverflow.co/2022/#most-popular-technolog...

I think Go is a pretty good choice for the purpose. It balances high-level ease of use and learning curve with decent access to the system-y parts of coding that are so important to databases. What you learn to do in Go will translate reasonably well to a true systems language if the user wanted to take database engine design to the next level.

Languages like Python or Javascript are so far removed from the system-y side of programming that the way you would implement the concepts in those languages would not translate to the way you would actually build a "real" database which is I think the purpose of the book. I think the objective isn't to teach the abstract concepts but how those concepts are expressed in real systems.

I'm a data engineer and I only know Python. It appears Golang hits a sweetspot on many metrics such as performance, parallelism, ease of use etc and since 2016 there's been a lot of new data products and tools written in Golang. So it makes sense to me that the book would use a popular language for the domain.

Databases really do push runtimes in such a way that I think it makes sense to urge folks to use a system level language, or something close to it. In particular it'd be hard to cover concurrency (and parallelism) properly using vanilla CPython or JS, and I think that would impinge on the lessons learned.

That said, it'd be an interesting read on how to make a DB in pure Python.

I don't normally think of Node and CPython as in the same performance bracket.

Regardless, you can do some crazy things in Node. See these notes about Node and MySQL:


In this context I meant that the parallelism primitives are different than what you'll get in a systems level language, which might make teaching those parts unnecessarily awkward and probably the concepts less translatable to other runtimes.

For sure they've got different performance profiles.

Very impressive what your group was able to do with Node, and continuing on with Zig. It's got me interested in learning more.

you may be interested in this https://github.com/avinassh/py-caskdb

I could be wrong, but my perception is that Go is so opinionated that you'll either write idiomatic Go or use another language. So, from that perspective there's some goodness in using Go as a learner's systems language.

It's opinionated, but it's not _that_ opinionated IMO. You can write awkward Java, shiny C, or whatever idiomatic Go is. Most people I worked with were from .net/Java worlds, and learned just about enough Go to be able to subject others to their coffee bean ideologies.

There's somethings the compiler will fail on like unused variable and the likes, but for the most part you need added static analysis and style checking -- some of which ships with the Go compiler.

This is something I’m going to be buying for sure. I saw the site also has a “Build your own Redis”.

I’ve always wanted to understand how databases work so I can build my own.

Excited about this!

I'd be curious, if somebody bought the book, to know what kind of concurrency they write about, as in just 2 phase locking, or if it talks about MVCC.

If you want some sample code to implement MVCC, I implemented MVCC in multithreaded Java as a toy example


First read TransactionC.java then read MVCC.java ( or follow the methods that TransactionC calls)

Here's the relevant headers in the ToC.

11. Atomic Transactions

11.1 KV Transaction Interfaces

11.2 DB Transaction Interfaces

11.3 Implementing the KV Transaction

12. Concurrent Readers and Writers

12.1 The Readers-Writer Problem

12.2 Analysing the Implementation

12.3 Concurrent Transactions

  Part 1: Modify the KV type

  Part 2: Add the Read-Only Transaction Type

  Part 3: Add the Read-Write Transaction Type

 12.4 The Free List

 12.5 Closing Remarks

Good question. I've no idea but if I was the author I wouldn't bother with concurrency, I'd assume single user only and always; a database without concurrency is still perfectly good database so it's hardly making the book title a lie.

Have to disagree, concurrent transaction processing is super interesting.

Sure it's interesting! Maybe a bit much to load on beginners though? Maybe the author can say.

Can we get a "build your own spreadsheet" next?

"A spreadsheet in fewer than 30 lines of JavaScript, no library used"


Not a book, but. :-)

Turbo Pascal came with a public-domain sample spreadsheet implementation (CALC.PAS aka MicroCalc) since version 1.0 (from 1983, 40 years ago!). Here is the version from Turbo Pascal 3 on GitHub: https://github.com/hindermath/MicroCalc

Spreadsheet Implementation Technology exists, although I cannot give a very strong recommendation

Thank you!

Here’s one in scheme: http://people.eecs.berkeley.edu/~bh/ssch24/spread.html

It’s ... part of a book! :)

Spreadsheets have been teaching the fundamentals of functional programming to the unwashed masses for decades.

Couldn't be a better story.

And now it’s come full circle with scheme in excel https://www.codemag.com/Article/2207071/The-Excellent-Scheme... ... it’s a commercial plugin, though

There are many kinds of databases, I've used similar designs to [0] (Lisp warning, ymmv) successfully in several projects.

[0] https://github.com/codr7/whirlog

Maybe I'm missing something, but after a cursory glance, I can't find any place where that library does an fsync() call. How does it handle durability?

your comment made me laugh. I never thought about entering a codebase by searching for fsync, but in the case of a DB that's probably the best place to start :))

True, it's missing a call to (finish-output), I'll take a closer look later.

It costs $10 extra for the code alongside the ebook? Is that a standard practice?

Not sure if I would call it "standard", but it's a widespread practice among project-based books.

Interesting. I wonder if the knowledge one could acquire in the book would easily translate to mainstream databases.

For relational databases it certainly does.

How long is the ebook version?

oh, I just remember mit 6.830, use java to build a mini database ,free.

No coverage of distributed databases?


1. Memory map a file

2. Manage access to it through a server

There you go, you know have a database.

1. Draw some circles

2. Draw the rest of the owl


Memory mapping is a poor way to make a database[0], not all storage can be memory mapped, and this doesn't provide any of the functionality you expect of a database like indexing and concurrency control.

[0] https://db.cs.cmu.edu/mmap-cidr2022/

I saw a headline about the book on Reddit and made a joke to a friend of mine: "How much you wanna bet it uses mmap?" Well...

It uses mmap.

The concurrency control is managed by the server that manages access.

All databases that are any different than these two points are just bad.

> Manage access to it through a server

This has real "draw the rest of the owl" energy.

How to make a DBMS? 1. Get a file 2. Make a DBMS

> 1. Memory map a file

Complex when you want to

> 2. Manage access to it through a server

A database is only "easy" IF:

- Append only

- No real "delete" or "updates" just to reiterate the above.

- Only Sequential scan

- Only need simple iterator-per-row

- No maintain secondary stuff like indexes, so not need to coordinate changes

- No concurrency

- Fit in RAM, and I mean in few MB

- No need to deal with SQL, use his own DSL (sql is so bad! so much weird stuff!, but is ok to have something sql-ish like LINQ)

- No need to deal with recursive data types, only scalars

- Is only embebed

- No need auth or security validations

Ok, after making this list, I sure forgot some other tips to make this easy!

There is no requirement to have the file fit in RAM to mmap it, that's the point.

Obviously you'd have a file per column (or index), and use directories to represent tables.

This is the correct way of doing it and yet so few databases do it.

"correct" is not an objective measure, it's a function of use case

mmap is not a panacea, it improves specific access patterns by incurring specific costs, it's definitely not true that mmap is the right choice for all databases

Yes, I understood, but "mmap a file" that should be a database is not "easy", because now you can deal with segfaults and other weird things that are out the normal playbook.

Of course, depending on the other things of the list this is or not a major issue. Is more about how combining several ideas leads to a easy or complex implementation.

I worked on a commercial product that did just this, except it also had a "transaction log" component. All access through the server was logged, and the log could be used for replay/recovery. While it worked for them for thousands of simultaneous clients, it was not a general purpose solution. There were no real indexes: the "indexes" (hash tables, really) were built in memory at startup.

i don't understand what you mean by no real indexes

why does generating indexes at start-up not count as having indexes?

(asking because i do this all the time)

I suppose I should've said "persistent" indexes. There was also only a single type supported: hash table. And it was hard coded to 2 fields. You had to change the code (which was C) to change the database.

This system was rarely restarted, so in practice it didn't matter what it did at startup, as long as it didn't take more than a few minutes, but it did place some limitations on data size. (This was a 32-bit system and everything was memory mapped.)

1. Build your own database

There you go!

You seem like a real go-getter who can cut through red tape and find simple solutions to complex problems :D

The power of abstraction!

why is this so heavily down-voted?

it feels correct, even though it's not a complete guide

(reliably persisting changes to disk is a big part of what dbs do, but is missing here)

Good to now.

Applications are open for YC Summer 2023

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact