I'd be interested to hear why they stopped. Presumably reimplementing SQL on a KV store was seen as not worth it, when applications that are satisfied with an embedded KV store backend (which is much faster and simpler to write!) already have many options.
I think part of this is because of a fundamental limitation of sqlite that it's an embedded database that has to persist data on disk at all times: The design of LSM trees works well with databases with a resident in-memory component because it's an approximation of just dumping every new thing you see at the end of an unordered in-memory array. This is as opposed to a data structure like a b-tree where you have to /find/ exactly where to put the data first, and then put it there. This finding bit means you're doing a lot of random access in memory, which is thrashing all of your caches (CPU / disk etc). LSM trees avoid this thrashing by just dumping stuff at the end of an array. However this means you have to scan that array to do lookups (as opposed to something easier like binary search). Then as your array gets big, you merge and flush it down to a lower "layer" of the lsm tree which is slightly bigger and sorted. And when that one fills, you flush further. And these merge-flushes are nice big sequential writes so that's nice too.
Anyway, with SQLite, the highest layer of your LSM tree would probably (this is conjecture) have to be on disk because of the way that there is no server component, versus in an in-memory system it'd probably be in your L2/L3 cache or at least your main memory. So this could be one reason why that model didn't work out as well for them.
Regarding the LSM engine, you can find all the relevant implementation details here: https://sqlite.org/src4/doc/trunk/www/lsm.wiki#summary
> The in-memory tree is an append-only red-black tree structure used to stage user data that has not yet flushed into the database file by the system. Under normal circumstances, the in-memory tree is not allowed to grow very large.
SQLite: The Database at the Edge of the Network with Dr. Richard Hipp
> The in-memory tree is an append-only red-black tree structure used to stage user data that has not yet flushed into the database file by the system.
Hmm, ok, so this contradicts my assumption. Actually, now that I think about it, other LSMs like rocksdb / leveldb work like this too (library-like model with some in-memory component when you "open" the database).
Anyway, without diving into the details of the code, there's other technical decisions that would affect this stuff.
One thing is how big your in-memory structure is (in relation to available memory and insertion workload) and how often you flush to disk is a key thing. Another thing is what your LSM tree looks like - aside from the data structure, how many tiers/levels you have is a big thing. I assume some of these are configurable parameters. E.g. rocksdb has an enormous set of parameters that handles this stuff. It's also annoying to tune.
I found this benchmark here that is illustrative: https://sqlite.org/src4/doc/trunk/www/lsmperf.wiki
The first graph is underwhelming, but when you adjust the buffers (look at the last graph) ~250k writes / second constant regardless of database size (this is why you want an LSM tree) is darned good! And this is on a spinny drive, not an SSD. Their "large" buffer sizes aren't even that large IMHO.
So maybe his mention that the LSM storage was underwhelming was overblown :-) I don't know.
Another difference is with other LSM-based systems that aren't just key-value, it's usually in the context of column stores: you keep a separate LSM for each column family (could be 1-n columns). But I can't think off the top of my head how this would cause a difference. Perhaps in how reads happen - the query engines work quite differently.
Anyway, my talk is cheap, I'm just guessing here, actually doing the analysis is the hard work :-) Also, I'm something of an amateur currently, so take my words with a grain of salt. Anyone else have any ideas re: this?
One big difference is that fossil includes wiki and ticketing.
Philosophy differs: Fossil intentionally limits "distributedness". For example, fossil push/pull always transfers all branches with their name. Private branches are considered an anti-feature.
Minor differences are the licence (GPL vs BSD) and the data store (files vs sqlite). Under some circumstances these details matter, but not for the majority of developers.
The rest is not significant, imho. For example, "Lots of little tools vs stand-alone executable". Who cares? In both cases you type "$VCS $SUBCOMMAND $PARAMETERS" commands.
Stand-alone executable is pretty significant. Git is available on most servers -- fossil is not. If it's packaged in your OS, it's often outdated. Stand-alone kinda makes up for this as you can easily get the latest version with a wget & chmod on any computer, on all 3 platforms.
As for sqlite, it is an astoundingly solid rdbms that is well battle-tested. I consider that a big difference.
He refers to himself as D. Richard Hipp.
The weird error may have been mine.
I also hoped for big win on the insertion-heavy loads, and I also haven't succeed in that. The problem is that every insert statement must read back something from DB to verify DB state against schema for correctness. As reads in LSM are slower, the net win is either absent or negligible. I have to say I wrote "must" in sentence above because you sometimes can get away without reading back, but not always. In the end, worst case scenario is always "read and write", not just "write".
I devised a scheme to lay out layers' data so that they are as contiguous as they can be. Or get a very good approximation to that contiguousness, basically (O(1) "pages" per level). Thus contiguous reads got very high performance and beat old storage on read scheme, despite the need of level merging, etc.
Perhaps the key (ha!, pun) is that you're talking about using RAM _and_ disk with the RAM being for caching/fast access that eventually hits the disk. Whereas, I think, in this case sqlite is either on the disk, or in RAM. There is no multiple tiers.
Correct me if I'm off here. Thanks.
This is closer to what I mean but not quite. Specifically, in-memory (or main-memory) is a technical term that talks about a specific type of database that focuses on doing operations /mostly/ in memory, with some spillover to disk as necessary, but only as an edge case. You usually then handle persistence with a sequential transaction log, perhaps in NVDIMM storage if you have some. This is in contrast to other systems where the memory is a buffer, but the "actual stuff" happens on disk. There is of course many hybrid schemes - it's really more of a spectrum. Some examples: SAP HANA, ArangoDB, MapD, MemSQL, MS SQL Server Hekaton.
There's also a lot of techniques that come with this style of thinking - often these are column stores, and often since these databases don't really hit disk, the bottleneck is moving data in and out of memory to the CPU caches, and sometimes just CPU speed. Often these use lightweight compression and SIMD instructions to tackle those problems.
So SQLite's "in-memory" database scheme doesn't quite count as something like this, it's more of a disposable database. But that's ok - it's not bad, it's just a different thing.
> Whereas, I think, in this case sqlite is either on the disk, or in RAM. There is no multiple tiers.
See the thread above - I kind of assumed this was the case, but like you mentioned, it looks like with the LSM implementation it does kind of sort of have a memory component - it lasts until you close the database "connection", and gets flushed often.
Which, while a totally alien concept in the modern software world, is actually a pretty cool thought.
(I'm sure under the hood bugs are getting fixed and all)
Look how Open Office development charged after the introduction of Go OO, or MySQL after MariaDB, or more generally the entire smartphone market after the introduction of the iPhone. Competition is good.
That said, Dr. Hipp is known to be passionate about SQLite. In my previous examples the software was pretty much neglected before the competition came, which is certainly not the case with SQLite.
Rewriting SQLite in Rust, or some other trendy “safe” language, would not help. In fact it might hurt.
Prof. Regehr did not find problems with SQLite. He found constructs in the SQLite source code which under a strict reading of the C standards have “undefined behaviour”, which means that the compiler can generate whatever machine code it wants without it being called a compiler bug. That’s an important finding. But as it happens, no modern compilers that we know of actually interpret any of the SQLite source code in an unexpected or harmful way. We know this, because we have tested the SQLite machine code – every single instruction – using many different compilers, on many different CPU architectures and operating systems and with many different compile-time options. So there is nothing wrong with the sqlite3.so or sqlite3.dylib or winsqlite3.dll library that is happily running on your computer. Those files contain no source code, and hence no UB.
The point of Prof. Regehr’s post (as I understand it) is the the C programming language as evolved to contain such byzantine rules that even experts find it difficult to write complex programs that do not contain UB.
The rules of rust are less byzantine (so far – give it time :-)) and so in theory it should be easier to write programs in rust that do not contain UB. That’s all well and good. But it does not relieve the programmer of the responsibility of testing the machine code to make sure it really does work as intended. The rust compiler contains bugs. (I don’t know what they are but I feel sure there must be some.) Some well-formed rust programs will generate machine code that behaves differently from what the programmer expected. In the case of rust we get to call these “compiler bugs” whereas in the C-language world such occurrences are more often labeled “undefined behavior”. But whatever you call it, the outcome is the same: the program does not work. And the only way to find these problems is to thoroughly test the actual machine code.
And that is where rust falls down. Because it is a newer language, it does not have (afaik) tools like gcov that are so helpful for doing machine-code testing. Nor are there multiple independently-developed rust compilers for diversity testing. Perhaps that situation will change as rust becomes more popular, but that is the situation for now.
You can just use C tooling with Rust, generally; I know kcov has worked for a long time, but it looks like easy gcov support does exist: https://github.com/kennytm/cov
I honestly hope against hope that you’re being sarcastic.
Correction: CockroachDB is based on a KV store, it's a full SQL RDBMS on top of this KV store.
Presto is a distributed SQL query engine for big data, so basically the complete opposite of SQLite, though it often gets used in federation scenarios, as does SQLite.
An interesting anecdote is that the team working on what would become osquery (https://osquery.io/) asked if they could reuse the SQL parser from Presto. We get that question a lot, and after explaining that the parser is the easy part (semantic analysis and execution is the real work), I determined that what they really wanted were SQLite virtual tables: https://sqlite.org/vtab.html (and those worked out great for them)
I created a logsql.py plugin for skybot, it just logs to the DB instead of to text files. Is that something you'd be interested in merging back in?