Hacker News new | past | comments | ask | show | jobs | submit login

For context, SQLite4 explored reimplementing SQLite using a key-value store on log-structured merge trees, like RocksDB and Cassandra.

I'd be interested to hear why they stopped. Presumably reimplementing SQL on a KV store was seen as not worth it, when applications that are satisfied with an embedded KV store backend (which is much faster and simpler to write!) already have many options.

I've had the chance to hear Richard Hipp talk about SQLite yesterday! He mentioned that the LSM tree storage engine is available as an extension to sqlite3. More specifically, he mentioned that he didn't really get the performance improvements he had hoped for, for insertion-heavy use cases.

I think part of this is because of a fundamental limitation of sqlite that it's an embedded database that has to persist data on disk at all times: The design of LSM trees works well with databases with a resident in-memory component because it's an approximation of just dumping every new thing you see at the end of an unordered in-memory array. This is as opposed to a data structure like a b-tree where you have to /find/ exactly where to put the data first, and then put it there. This finding bit means you're doing a lot of random access in memory, which is thrashing all of your caches (CPU / disk etc). LSM trees avoid this thrashing by just dumping stuff at the end of an array. However this means you have to scan that array to do lookups (as opposed to something easier like binary search). Then as your array gets big, you merge and flush it down to a lower "layer" of the lsm tree which is slightly bigger and sorted. And when that one fills, you flush further. And these merge-flushes are nice big sequential writes so that's nice too.

Anyway, with SQLite, the highest layer of your LSM tree would probably (this is conjecture) have to be on disk because of the way that there is no server component, versus in an in-memory system it'd probably be in your L2/L3 cache or at least your main memory. So this could be one reason why that model didn't work out as well for them.

I'm jealous you got to hear Dr. Hipp, that sounds cool. Would love to hear more about the circumstances :)

Regarding the LSM engine, you can find all the relevant implementation details here: https://sqlite.org/src4/doc/trunk/www/lsm.wiki#summary

> The in-memory tree is an append-only red-black tree structure used to stage user data that has not yet flushed into the database file by the system. Under normal circumstances, the in-memory tree is not allowed to grow very large.

I stumbled on this video presentation by Dr Hipp recently (may have been from another HN comment). I really enjoyed it. Probably because he seems so enthusiastic and passionate.

SQLite: The Database at the Edge of the Network with Dr. Richard Hipp


I got lucky! It was a class event, and he's known to speak at databases / data systems classes. He's a very fun (and opinionated!) speaker.

> The in-memory tree is an append-only red-black tree structure used to stage user data that has not yet flushed into the database file by the system.

Hmm, ok, so this contradicts my assumption. Actually, now that I think about it, other LSMs like rocksdb / leveldb work like this too (library-like model with some in-memory component when you "open" the database).

Anyway, without diving into the details of the code, there's other technical decisions that would affect this stuff.

One thing is how big your in-memory structure is (in relation to available memory and insertion workload) and how often you flush to disk is a key thing. Another thing is what your LSM tree looks like - aside from the data structure, how many tiers/levels you have is a big thing. I assume some of these are configurable parameters. E.g. rocksdb has an enormous set of parameters that handles this stuff. It's also annoying to tune.

I found this benchmark here that is illustrative: https://sqlite.org/src4/doc/trunk/www/lsmperf.wiki

The first graph is underwhelming, but when you adjust the buffers (look at the last graph) ~250k writes / second constant regardless of database size (this is why you want an LSM tree) is darned good! And this is on a spinny drive, not an SSD. Their "large" buffer sizes aren't even that large IMHO.

So maybe his mention that the LSM storage was underwhelming was overblown :-) I don't know.

Another difference is with other LSM-based systems that aren't just key-value, it's usually in the context of column stores: you keep a separate LSM for each column family (could be 1-n columns). But I can't think off the top of my head how this would cause a difference. Perhaps in how reads happen - the query engines work quite differently.

Anyway, my talk is cheap, I'm just guessing here, actually doing the analysis is the hard work :-) Also, I'm something of an amateur currently, so take my words with a grain of salt. Anyone else have any ideas re: this?

This is the other perf graph: https://sqlite.org/src4/wiki?name=LsmPerformance

What would be performance for SQLite3 in comparable scenarios? I don't see anything comparative on that benchmarks page.

I've gotten it to do around 40-50k inserts / sec but that's a different scenario - nfs drive, different table and indexes, different queries, different configuration, etc etc. Also I didn't know if he meant that the inserts were disappointing or the overall results were (e.g. a suite of tests including reads / writes of all sorts).

I just want to hijack this thread to say hipps other creation Fossil SCM is a great SCM. Better than git imo. Everyone should check it out.

We've been using it for the last 3/4 years. It's great and really user friendly, with integrated help and a web interface - everything in a single binary. The trouble with it is that it gets slower once you have a lot of history and many files. I don't know if you can use it for huge things like the Linux kernel or the FreeBSD ports tree. I once tried to import the ports tree into fossil and gave up after 2G and an hour. It will import anything that can do git fast-export. Now it also imports svn dumps as well. Fossil is a very good replacement for SVN. You can set up a central repo where everyone syncs on commit and update.

To be fair, Fossil's intended use case is the exact opposite of the Linux kernel. See 3.3 and 3.4 of the fossil vs git page[0].

[0] https://www.fossil-scm.org/xfer/doc/trunk/www/fossil-v-git.w...

I find most of the differences listed there contrived.

One big difference is that fossil includes wiki and ticketing.

Philosophy differs: Fossil intentionally limits "distributedness". For example, fossil push/pull always transfers all branches with their name. Private branches are considered an anti-feature.

Minor differences are the licence (GPL vs BSD) and the data store (files vs sqlite). Under some circumstances these details matter, but not for the majority of developers.

The rest is not significant, imho. For example, "Lots of little tools vs stand-alone executable". Who cares? In both cases you type "$VCS $SUBCOMMAND $PARAMETERS" commands.

You're right, philosophy differs. I generally dislike private branches. It goes back to the origin of git -- intended for the linux kernel, one of the most widely used open source projects with tens of thousands of contributors. Linus doesn't want or need to see a million private branches. None of the projects I work on are of that scale. When your team is under a dozen people, being able to see what your coworkers are playing with in real time (autosync) is actually incredibly useful.

Stand-alone executable is pretty significant. Git is available on most servers -- fossil is not. If it's packaged in your OS, it's often outdated. Stand-alone kinda makes up for this as you can easily get the latest version with a wget & chmod on any computer, on all 3 platforms.

As for sqlite, it is an astoundingly solid rdbms that is well battle-tested. I consider that a big difference.

Why does everybody persist in calling the great man Dr.?

He refers to himself as D. Richard Hipp.

D is an initial but Dr. D. Richard Hipp has a PhD from Duke - graduated in 1992 (his thesis can be found here - its worth a read: http://bit.ly/2ygiDWx ) hence the Dr.

Thank you. Always thought it was a weird comprehension error from the slightly unusual character-combo.

The weird error may have been mine.


I did something like that - LSM backend for OO-relational DBMS (inhouse).

I also hoped for big win on the insertion-heavy loads, and I also haven't succeed in that. The problem is that every insert statement must read back something from DB to verify DB state against schema for correctness. As reads in LSM are slower, the net win is either absent or negligible. I have to say I wrote "must" in sentence above because you sometimes can get away without reading back, but not always. In the end, worst case scenario is always "read and write", not just "write".


I devised a scheme to lay out layers' data so that they are as contiguous as they can be. Or get a very good approximation to that contiguousness, basically (O(1) "pages" per level). Thus contiguous reads got very high performance and beat old storage on read scheme, despite the need of level merging, etc.

Perhaps I'm misunderstanding you, but sqlite supports ramdisks/"In-memory databases"


Perhaps the key (ha!, pun) is that you're talking about using RAM _and_ disk with the RAM being for caching/fast access that eventually hits the disk. Whereas, I think, in this case sqlite is either on the disk, or in RAM. There is no multiple tiers.

Correct me if I'm off here. Thanks.

> Perhaps the key (ha!, pun) is that you're talking about using RAM _and_ disk with the RAM being for caching/fast access that eventually hits the disk.

This is closer to what I mean but not quite. Specifically, in-memory (or main-memory) is a technical term that talks about a specific type of database that focuses on doing operations /mostly/ in memory, with some spillover to disk as necessary, but only as an edge case. You usually then handle persistence with a sequential transaction log, perhaps in NVDIMM storage if you have some. This is in contrast to other systems where the memory is a buffer, but the "actual stuff" happens on disk. There is of course many hybrid schemes - it's really more of a spectrum. Some examples: SAP HANA, ArangoDB, MapD, MemSQL, MS SQL Server Hekaton.

There's also a lot of techniques that come with this style of thinking - often these are column stores, and often since these databases don't really hit disk, the bottleneck is moving data in and out of memory to the CPU caches, and sometimes just CPU speed. Often these use lightweight compression and SIMD instructions to tackle those problems.

So SQLite's "in-memory" database scheme doesn't quite count as something like this, it's more of a disposable database. But that's ok - it's not bad, it's just a different thing.

> Whereas, I think, in this case sqlite is either on the disk, or in RAM. There is no multiple tiers.

See the thread above - I kind of assumed this was the case, but like you mentioned, it looks like with the LSM implementation it does kind of sort of have a memory component - it lasts until you close the database "connection", and gets flushed often.

Actually in WAL mode with `PRAGMA synchronous = NORMAL`, it only flushes to disk periodically, not after every transaction.

According to a presentation Richard Hipp did at Ohio State, they just didn't get the same level of performance from a log-structured merge vs. the b-tree in SQLite 3. If I remember correctly, while log structured merges where able to accumulate in memory actual to consolidate writes there was still a number of reads required to support joins etc. So they just determined based upon their testing that the b-tree based code was just faster. I don't recall the exact benchmarks, it is probably on the web somewhere.

Everything I hear about SQLite3 always suggests that, essentially, it is considered "done". It does what it is supposed to do with great performance. There is nothing major left to do. If it doesn't meet your needs, pick a different SQL database.

Which, while a totally alien concept in the modern software world, is actually a pretty cool thought.

(I'm sure under the hood bugs are getting fixed and all)

Check out the changelog...they've added a ton of new stuff. Native json support, completely new full-text search extension, lsm key/value extension, performance improvements. I think they're looking at some changes to locking in the near future as well. Lots of stuff to find if you look.

Done? Almost every .0 version I checked in https://www.sqlite.org/chronology.html saw some new features.

Well, obviously they should rewrite it in Rust.

Actually having an alternative implemention would be terrific.

Look how Open Office development charged after the introduction of Go OO, or MySQL after MariaDB, or more generally the entire smartphone market after the introduction of the iPhone. Competition is good.

That said, Dr. Hipp is known to be passionate about SQLite. In my previous examples the software was pretty much neglected before the competition came, which is certainly not the case with SQLite.

From the canonical source (as in, [Richard Hipp](https://en.wikipedia.org/wiki/D._Richard_Hipp)):

Rewriting SQLite in Rust, or some other trendy “safe” language, would not help. In fact it might hurt.

Prof. Regehr did not find problems with SQLite. He found constructs in the SQLite source code which under a strict reading of the C standards have “undefined behaviour”, which means that the compiler can generate whatever machine code it wants without it being called a compiler bug. That’s an important finding. But as it happens, no modern compilers that we know of actually interpret any of the SQLite source code in an unexpected or harmful way. We know this, because we have tested the SQLite machine code – every single instruction – using many different compilers, on many different CPU architectures and operating systems and with many different compile-time options. So there is nothing wrong with the sqlite3.so or sqlite3.dylib or winsqlite3.dll library that is happily running on your computer. Those files contain no source code, and hence no UB.

The point of Prof. Regehr’s post (as I understand it) is the the C programming language as evolved to contain such byzantine rules that even experts find it difficult to write complex programs that do not contain UB.

The rules of rust are less byzantine (so far – give it time :-)) and so in theory it should be easier to write programs in rust that do not contain UB. That’s all well and good. But it does not relieve the programmer of the responsibility of testing the machine code to make sure it really does work as intended. The rust compiler contains bugs. (I don’t know what they are but I feel sure there must be some.) Some well-formed rust programs will generate machine code that behaves differently from what the programmer expected. In the case of rust we get to call these “compiler bugs” whereas in the C-language world such occurrences are more often labeled “undefined behavior”. But whatever you call it, the outcome is the same: the program does not work. And the only way to find these problems is to thoroughly test the actual machine code.

And that is where rust falls down. Because it is a newer language, it does not have (afaik) tools like gcov that are so helpful for doing machine-code testing. Nor are there multiple independently-developed rust compilers for diversity testing. Perhaps that situation will change as rust becomes more popular, but that is the situation for now.


Even if Rust is at parity with C in terms of tooling and ecosystem (which it probably will be in a few short years), SQLite is probably not high on the list of software to rewrite in Rust, given its fairly high quality. Let's rewrite the vulnerability-ridden ones first.

> it does not have (afaik) tools like gcov

You can just use C tooling with Rust, generally; I know kcov has worked for a long time, but it looks like easy gcov support does exist: https://github.com/kennytm/cov

Rewriting in Rust would not make any sense, sure. But rewriting in a faster and safer language, like Pony or ATS might, esp. for lots of CPU's.

Although i get the joke, it would be a fun and interesting experiment wouldn't it ? Espacially since the sqlite test suite is so exhaustive.

There is a version rewritten in C#, back in the day - not sure how up to date is:


Because we’ve missed the Rust evangelism strike force so very much.

I honestly hope against hope that you’re being sarcastic.

SQLite is one of my favorite pieces of software for this exact reason, that it is actually a more or less "finished" program instead of a mire of shifting design requirements and constant security updates, a comfortably static and unchanging object against the chaotic backdrop of modern software development.

There is also a project called UnQLite for UnQL (unstructured query language) designed to be an embedded document store, announced in 2011 [1]. Looks like it's now maintained by another company. [2]

1. https://www.infoq.com/news/2011/08/UnQL

2. https://unqlite.org/

Would have been a neat way to experiment around with putting a SQL front end on various KV interfaces. Redis or etcd, for example.

CockroachDB has a good blog post[0] that describes how they implemented SQL. (CockroachDB is a key-value store.)

[0] https://www.cockroachlabs.com/blog/sql-in-cockroachdb-mappin...

>(CockroachDB is a key-value store.)

Correction: CockroachDB is based on a KV store, it's a full SQL RDBMS on top of this KV store.

Yes, that note was misleading. Thanks for clarifying :)

Presto has connectors for various types of non-SQL databases including Redis: https://prestodb.io/docs/current/connector/redis.html

Presto is a distributed SQL query engine for big data, so basically the complete opposite of SQLite, though it often gets used in federation scenarios, as does SQLite.

An interesting anecdote is that the team working on what would become osquery (https://osquery.io/) asked if they could reuse the SQL parser from Presto. We get that question a lot, and after explaining that the parser is the easy part (semantic analysis and execution is the real work), I determined that what they really wanted were SQLite virtual tables: https://sqlite.org/vtab.html (and those worked out great for them)

Fairly straightforward to make a mysql storage engine for it, e.g. https://github.com/AALEKH/ReEngine

Scaevolus, I recognise that name. #0x10c-dev on freenode? rmmh? skybot?

I created a logsql.py plugin for skybot, it just logs to the DB instead of to text files. Is that something you'd be interested in merging back in?

Developer of some optimization mods for Minecraft, one of which became OptiFine.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact