
How Does Sqlite Work? (2014) - _tw9j
https://jvns.ca/blog/2014/09/27/how-does-sqlite-work-part-1-pages/
======
gtrubetskoy
The details of how the data is stored are described in this 1979 paper by
Douglas Comer: "The Ubiquitous B-Tree" [1]. Since it was invented, much hasn't
really changed: practically all relational databases use some variation of
this structure. I wrote about it in [2], which might help explain some aspects
of how it works.

[1]
[https://dl.acm.org/doi/10.1145/356770.356776](https://dl.acm.org/doi/10.1145/356770.356776)

[2] [https://grisha.org/blog/2013/05/11/relational-database-on-
to...](https://grisha.org/blog/2013/05/11/relational-database-on-top-of-key-
value-store-explained/)

Edit: also [3] is about how to store SQLite in a Redis hash (rather than on-
disk file).

[3] [https://grisha.org/blog/2013/05/29/sqlite-db-stored-in-a-
red...](https://grisha.org/blog/2013/05/29/sqlite-db-stored-in-a-redis-hash/)

~~~
nn3
A lot of modern databases don't use Btrees anymore. With a LSM structure you
can use other suitable data structures because they can be write-only. For
example leveldb uses a simple binary search.

~~~
tannhaeuser
[https://symas.com/lmdb/](https://symas.com/lmdb/) begs to differ

~~~
craftyguy
They didn't say 'all', they said 'many'. So your one example doesn't 'beg to
differ' much in the way of OP's point.

------
zoomablemind
The SQLite code is nicely split into modules, and builds together reatively
quickly. So the easier way to explore the codebase is to load into editor the
individual modules at src/ rather than the aggregated sqlite3.c file which is
too big and unmanageable in the editor.

IIRC, the heart of the db engine appears to be some sort of vm that processes
the parsed out SQL. Stepping through in debugger will inevitably capture you
in the loop which executes the statements.

Of course, there're plentiful design docs, all on
[https://sqlite.org](https://sqlite.org).

~~~
bntyhntr
My c is pretty limited and I was able to hack in support for table prefixed
columns in result sets over a night or two from never seeing the code base
before. It was fun to work with!

------
twic
Anyone looked at DuckDB [1]? Like SQLite, it's an embedded relational database
with a C API. In fact, it has a wrapper which presents the same API as SQLite
[2]. But it's optimised for analytical rather than transactional work - a few
huge queries rather than lots of little reads and writes.

[1] [https://duckdb.org/](https://duckdb.org/)

[2]
[https://github.com/cwida/duckdb/tree/master/tools/sqlite3_ap...](https://github.com/cwida/duckdb/tree/master/tools/sqlite3_api_wrapper)

~~~
st1ck
For analytical workloads, DuckDB may be better than SQLite, but it's just
incomparably slower than e.g. ClickHouse (which is admittedly not embedded).

~~~
mytherin
The main reason for that is that DuckDB is currently single-threaded, we are
currently actively working on parallelism for DuckDB which will resolve this
:)

~~~
st1ck
I think there are also tons of other optimizations currently missing in
DuckDB, vectorization probably the most important.

~~~
mytherin
Very true, DuckDB is very young and we have so far focused on correctness over
performance. As a result we are still missing several optimizations that more
mature systems do have, for example, proper cardinality estimation. We are
working on adding those as well.

------
justinclift
Part 2: [https://jvns.ca/blog/2014/10/02/how-does-sqlite-work-
part-2-...](https://jvns.ca/blog/2014/10/02/how-does-sqlite-work-
part-2-btrees/)

------
Animats
Lookup is easy. Update is hard. Sqlite does quite well at update, given the
limitations it works under. Multiple instances can update the same database
while maintaining strong consistency, yet there's no common server process
that can see all the transactions. This costs you some speed in updating
(don't do your web service this way) but it's pretty good for most low-write-
traffic use cases.

------
oscargrouch
> I’m still not super clear on how exactly the interior pages are structured
> and how the pointers to their child pages work. It’s time to sleep now, but
> perhaps that will happen another day!

Luckily. The SQLite source code is very well documented, and the 'btreeInt.h'
has something about how the btree pointers are laid out

I happen to use the SQLite btree directly without the SQL layer (the VDBE) as
a key-value store on top of a userspace filesystem (chrome cachefs) and even
with this configuration it works wonderfully.

    
    
       ** The basic idea is that each page of the file contains N database
       ** entries and N+1 pointers to subpages.
       **
       **   ---------------------------------------------------   -------------
       **   |  Ptr(0) | Key(0) | Ptr(1) | Key(1) | ... | Key(N-1) | Ptr(N) |
       **   ----------------------------------------------------------------
       **
       ** All of the keys on the page that Ptr(0) points to have values less
       ** than Key(0).  All of the keys on page Ptr(1) and its subpages have
       ** values greater than Key(0) and less than Key(1).  All of the keys
       ** on Ptr(N) and its subpages have values greater than Key(N-1).  And
       ** so forth.
       **
       ** Finding a particular key requires reading O(log(M)) pages from the 
       ** disk where M is the number of entries in the tree.
       **
       ** In this implementation, a single file can hold one or more separate 
       ** BTrees.  Each BTree is identified by the index of its root page.  The
       ** key and data for any entry are combined to form the "payload".  A
       ** fixed amount of payload can be carried directly on the database
       ** page.  If the payload is larger than the preset amount then surplus
       ** bytes are stored on overflow pages.  The payload for an entry
       ** and the preceding pointer are combined to form a "Cell".  Each 
       ** page has a small header which contains the Ptr(N) pointer and other
       ** information such as the size of key and data.
    

I guess if she go straight to use the btree layer, bypassing the SQL layer, it
will be easier to figure it out how the pages and pointers works and relate to
each other.

After this, try it out with the SQL and see how it works.

\---

Edit: just to add to this, this line here

> Each BTree is identified by the index of its root page

When you call 'sqliteBtreeCreateTable()' it will return to you the root page
of the table it creates. In my experience it starts from 2, and my guess is
that 1 is probably some master root page (i bet Julia will dig into this
later).

This is how i've manage to have "keyspaces" here. i use the first created
table as a meta-table, keeping the index of keyspace names with their root
page number counterparts, and on opening the db, is just a matter of
recovering this info and put it in a hash table (keyspace => page number) for
later use.

~~~
dunham
Yeah, if I remember right the first block is the root of the master table,
listing all the other tables. (Name, schema, root block id.)

------
ramoz
I encourage anyone interested in this to also explore embedded dbs (leveldb,
badger, lmdb, rocks, etc)

~~~
refset
Embedded KV stores are as comparable to SQLite as Cassandra is to Postgres
(i.e. lacking joins or other data modelling essentials), but I agree that
embedding a database directly inside your app is interesting and opens up a
world of possibilities for efficiently blending N+1 code and queries. It gets
even more interesting to take it a step further and embed a networked-replica
of a database à la Datomic Peers.

~~~
dfox
It was probably meant as a learning exercise. And for me it is that
implementation details of the KV/heap layer with some meaningful transactional
properties seem more interesting than how to build SQL/whatever database on
top of that, which in comparison looks like mostly obvious but tedious SMOP.

~~~
refset
True, I was being somewhat pedantic. Whenever I see the word "db" I can't help
but think "DBMS". It's frustrating that "DBMS" is such a mouthful that nobody
bothers using it anymore, and that KV store projects choose to (ab)use the DB
suffix.

> implementation details of the KV/heap layer with some meaningful
> transactional properties seem more interesting

There is undoubtedly a real beauty in LMDB's data structures, and it is
impressive to see how Rockset have re-engineered RocksDB to become cloud-
native …but my own feeling is that the SQL/whatever, distributed consistency
and federation layers of a DBMS encompass some seriously hard and fascinating
problems. What if the web could behave like a cohesive offline-first database
for all of human knowledge after all?

Now I think about it, "interesting" has a pretty loose definition too!

------
unnouinceput
SQLite true value is in its cross-platform usability. If you want to have same
codebase for your application on all current major operating systems
(Win/Droid/Lin/Mac/iOS) for a local database you go with SQLite.

SQLite is not as good at keeping big records on server side, for that you go
with PostgreSQL. And if you're Windows only, then Access is faster for a local
database. But if you want a good enough local database, that performs the same
on all of them, then your only choice is SQLite.

~~~
GuiA
There should be some sort of UNESCO World Heritage Open Source Software
selection, and SQLite should definitely be in it.

------
dang
Discussed at the time:
[https://news.ycombinator.com/item?id=8378894](https://news.ycombinator.com/item?id=8378894)

------
ed25519FUUU
> _May you do good and not evil._

> _May you find forgiveness for yourself and forgive others._

> _May you share freely, never taking more than you give._

If only more software engineers started out their projects with this mindset,
the world would be a very different place.

~~~
mosselman
Why is this down voted?

> If only more software engineers started out their projects with this
> mindset, the world would be a very different place.

Apparently this is a horrible thing to say?

~~~
edoceo
Not sure about the comment you replied to (is that called GP¿) but I think
you've missed this guidelines:

"Please don't comment about the voting on comments. It never does any good,
and it makes boring reading."

------
blackrock
Is SQLite able to do multiple simultaneous writes to it at the same time?

Or is only one agent at a time, allowed to write to it?

~~~
pdimitar
You can technically have many writer agents but they will all compete for an
internal sqlite mutex so it's best you serialize write access in your own
code.

------
tekstar
Years ago I spent an evening at Recurse Center (at the time Hacker School) in
NYC. My employer sent me with some recruiters to chat with participants about
hiring possibilities. It was just so cool.

That night I met Julia (she and a friend were, if I remember correctly, trying
to boot a handwritten kernel in a VM and figured it out around 11pm that only
the first 300 bytes were getting loaded into memory).

I met Andrew Kelley, who had recently written an NES emulator that translated
the code to LLVM (if I remember correctly).

I briefly met Filippo Valsorda. I think he was the person who, after I said I
wanted to make a bitcoin arbitrage bot, immediately came up with putting the
prices into a matrix and then computing optimal paths.

The place just seemed.. magical. I have a family and kids in Canada but i
thought a long time about trying to reorient my life to NYC for a bit. It was
just so cool.

It reminded me of the joy of learning for the sake of learning, hacking stuff
because it feels like a magic superpower to bring ideas to life, and the joy
of building things you have no intention on trying to make money off of.

I'm certain none of those people remember me but I remember the evening fondly
whenever I see any of them show up on HN. Cheers :)

~~~
forgotmypw17
I have a lot of respect for Recurse Center, and am a fan of the work that
comes out of it. However, I feel that it is a very exclusive place, and if you
are not good enough, you're just not welcome.

I applied for an internship there a couple years ago. After writing an essay
on why I want to do RC, and an interview where, IIRC, I shared my camera but
my interviewers did not, I was rejected with a boilerplate "not for us"
explanation, and that was that.

I would've felt much better about it if they had at least invited me to visit,
but this experience gave it a very corporate and impersonal feel.

~~~
4gotunameagain
To be honest, elitism in such places keeps them awesome. Not everything is for
everyone, despite how much modern movements want to believe so.

~~~
forgotmypw17
I think the strategy you describe works to an extent, but eventually leads to
echo chamber effect.

I think, perhaps arrogantly, that RC would have benefited from my pollination.

We'll both be fine on our own, too.

------
mkchoi212
Wow, I've been using sqlite all this time without knowing how the internals of
it actual work. Would love more insight on its memory usage and how it handles
memory pressure, etc

------
jokoon
Btree are not trivial data structures, and you generally don't have access to
a btree software without a database engine.

~~~
alexhutcheson
If you use C++ then you can use the Abseil B-tree containers:
[https://abseil.io/blog/20190812-btree](https://abseil.io/blog/20190812-btree)

------
seemslegit
Quite well actually

