Besides that, the comprehensive testing and evaluation of SQLite never ceases to amaze me. I'm usually hesitant to call software development "engineering", but SQLite is definitely well-engineered.
The SQLite project has a mind boggling 711 times more test code than SQLite itself has. Put another way, only 0.1% of the project's code is SQLite itself. The other 99.9% consists of tests for that 0.1%.
1/711 = 0.00140646976
0.00140646976 ~= 0.14%, not 0.01%.
I'll go put on my "pedant" hat now.
Although it was more a typo than a math error (as the 99.9% figure was correct). I'll chalk it up to posting when I should be sleeping (4:30am local time).
A lot of material I've seen has been like the classic image of "How to draw an owl. First draw two circles, then draw the rest of the owl", where they tell you the super basic stuff, then assume you know everything.
Having an understanding of relational algebra (not the symbols, but the concept; math is always about the concept) generally helps a lot as well; it's the reason why compsci database lectures often start with relational algebra.
> We would like to find the total weight of cats grouped by age. But only return those groups with a total weight larger than 12.
The total weight of cats grouped by age.
SELECT sum(weight), age
GROUP BY age
SELECT sum(weight), age
GROUP BY age
HAVING sum(weight) > 12
SELECT sum(weight), age
GROUP BY age
HAVING sum(weight) > 12
ORDER BY age
SELECT sum(weight) AS total_weight, age
GROUP BY age
HAVING sum(weight) > 12
ORDER BY age
I can also recommend Learning SQL by Alan Beaulieu, which was recommended in some HN comment. After looking into various books, some of them Postgres specific, I found this one to be the best for learning from scratch.
I created an issue for this here: https://github.com/ibis-project/ibis/issues/1597
Considering it was written for the U.S. Navy to run software on guided missile destroyers, that makes sense.
Even though, I don't really understand that one : "The IN-early-out optimization: When doing a look-up on a multi-column index and an IN operator is used on a column other than the left-most column, then if no rows match against the first IN value, check to make sure there exist rows that match the columns to the right before continuing with the next IN value.
". I would think there's no need to check the right column(s) if the leftmost one has no match...
SELECT * FROM tab WHERE key1=1 AND key2 IN (2,3,4,5);
The insight here is that a multi-column key value can be resolved using a single binary search. It is not a sequence thing where we first look for the key1=1 and then do a separate lookup in a subtree for key2. Both key1 and key2 are resolved in the same binary search.
The insight is very interesting. I always thought there would be 2 separate binary searches...
Unsurprisingly they were wrong. It was small, but sqlite couldn't handle multiple users. I believe this was before sqlite had WAL support, so reading would lock the DB. The 'solution' was to split the sqlite database into many smaller DBs that would allow users to use the site at the same time as long as they were in different areas. This greatly added to the complexity of the application. Some reports would need to access multiple databases to get what it needed, so it would still lock out people. Complexity was much higher then having all the data in a single postgresql/mysql database. All the users hated the system and often ran into DB lock issues.
> sqlite explicitly says that it is not meant for client-server configuration
They are right and their advice should be heeded.
Basically an Redis module that embed SQLite, I offer replication on the PRO version, if you want to try it out you can download the trial version for free.
If the trial version is not enough, send me an email and we could work something out ;)
name, ...., version = 1
name, ...., version = 2
and sync manually. Now I use a log/event sourcing like setup, where every write to the BD is also stored in a log table and sync from there. Much better!
...which is itself a special case of an audit trail or log: https://en.wikipedia.org/wiki/Audit_trail
The issue with any application-based approach is of course that it can't provide total access mediation.
I love relational databases, but you're almost forced into a NoSQL approach when developing a SPA since the client (browser) only supports simple key -> value storage. It would be a dream to use LINQ-to-SQL, or similar type safe query DSLs like Slick or Quill (Scala), or Esqueleto (Haskell) in the browser.
Combine that with a single language driving the backend and frontend and voila, no duplication of model, validation, etc. layers on server and client.
One can dream I guess, but the reality is NoSQL fits the modern web app like a glove, for better or worse.
That almost happened. There was a thing called WebSQL  that was W3C was working on to add SQL to the browser. Everyone who implemented it used SQLite.
Apparently, that disqualified it from standardization. To move ahead, they wanted to see independent implementations of the standard.
No browser makers stepped up to reduce the quality of their implementation by replacing some of the best designed, best written, best tested code on the planet with some other SQL back end to satisfy the committee, and so Mozilla was able to push IndexedDB as the standard browser DB interface.
They had to, since what was standardized was specifically the SQL dialect of SQLite v3.6.19.
> No browser makers stepped up to reduce the quality of their implementation by replacing some of the best designed, best written, best tested code on the planet with some other SQL back end to satisfy the committee
There were only two implementations at all: WebKit and Opera. Mozilla and Microsoft weren't going to implement it without a spec decoupled from particular backend.
The W3C dropped it in November 2010 because they were uncomfortable supporting a web standard which was entirely tied to a single implementation of SQL (SQLite). I adore SQLite but I think the W3C made the right call there - web standards need more than one compatible implementation.
could you elaborate on that? I get the impression that now, eight years later, we're much worse off. I don't really see the upside of not going for what is, AFAIK, a relatively vanilla implementation of SQL.
I don't have a particularly strong practical argument here: for me this is more of a fundamental principle for how the web should work.
SQLite is open source that would have been compiled into the browser by the browser vendor. A typical user would not have the means to update it separately from the rest of the browser.
So why couldn't W3C pick a particular release of SQLite, list a specific subset of its features, and declare that this, as described in the documentation for that specific version, is the standard?
Seems like a similar situation to WebGL. I think they would have looked past implementation count if the feature was really needed.
I have used this and it is slow. But it was interesting!
I really prefer that syntax and I’m always a little sad when I have to copy-paste windows across average, total, count, standard deviation, max, min… I fully admit that it’s syntactic sugar but it’s the elegant kind.
Two things holding this back: 1. code being inside our outside sqlite would not be much different. 2. the amount of additional test code would be humongous for an operation of that level of complexity.
So, just do it yourself. It's not that hard.
Good database migration tools (e.g. Alembic) do this automatically.
When you wrote "backfill the remaining rows incrementally", did you mean having some background process or thread that reads a batch of rows from the old table (for example a few thousands of row), then inserts them in the new table, then commits, and keep doing this until all rows are copied? This way, other writers will be blocked only for the duration of each incremental transaction and will have an opportunity to lock the database for themselves between two batches?
Do you have experience doing this on production servers?
Someone else already mentioned that the vast majority of SQLite's codebase are tests. Well, on top of that, of the real "working" codebase I'd say the majority of it is comments. It's incredible. The source is more book than code. If you have any when, why, or how question about SQLite, I guarantee it's answered in the code comments (or at least one of the hundreds of superb documents on their website).
Another surprise I discovered: SQLite has a virtual machine and its own bytecode. All queries you execute against a SQLite database are compiled into SQLite's own little bytecode and then executed on a VM designed for working with SQLite's database. Go ahead, start `sqlite3 yourdb.sqlite` and then run `explain select * from yourtable;`. It'll dump the bytecode for that statement; or any statement you put after `explain`. So cool!
In hindsight, it makes a lot of sense, and a well built VM can be nearly as efficient as any other alternative.
Fun bit of history. The VM used to be stack based, but now it's register based. I guess they learned the same lessons the rest of the industry learned over that time period :P
(N.B. the bytecode is for internal use only; it's not a public facing API. You should never, ever use bytecode directly yourself.)
There are some painful parts of the codebase though. These aren't "cons" per se. More like necessarily evils.
1) It is filled to the brim with backwards compatibility hacks that make the code more complex than it strictly needs to be. (Most of these are the result of various users of the library misusing the API. The SQLite devs are generous enough to grandfather in the bugs that made those applications work. That's excellent, but it definitely makes the code more "crusty".)
2) One of SQLite's big features is its flexible memory subsystem. It handles OOM, and provides an API for completely customizing the memory subsystem. But given that this is C and memory allocation and interaction is pervasive, the code ends up littered with function calls and clauses. Handling OOM is no small task, and often how to handle the OOM is different in different places. So you can imagine the complexity that adds to the codebase.
Again, those are necessary evils, so its not something I'm "complaining" about. But I thought they were worth mentioning for fellow adventures like me who decide to dive in (which I highly recommend).
So, thanks to how well designed SQLite is overall, and their great documentation, I was able to write a parser in Rust for the SQLite file format in a handful of hours (https://sqlite.org/fileformat2.html). The file format is surprisingly simple. I'm now writing a Cursor to walk the tables, which is a fun exercise of classic B-Tree algorithms.
They finally ended sqlite4 as an architectural misadventure, but perhaps they should try again with the current codebase/features but all of the backwards cruft jettisoned.
Does this make SQLite turing complete or has it been turing complete before?
Does anyone know how to upgrade python3's sqlite module to the latest version?
It even supports user-defined window functions using the new sqlite apis.
Or if you're feeling lazy (like I was), there's a fork of his library at https://github.com/karlb/pysqlite3 which compiles the 3.25.0 amalgamation by default. This worked for me:
$ python3 -mvirtualenv venv
$ source venv/bin/activate
$ pip install git+git://github.com/karlb/pysqlite3
Installing collected packages: pysqlite3
Successfully installed pysqlite3-0.2.0
Python 3.6.5 (default, Mar 30 2018, 06:41:53)
[GCC 4.2.1 Compatible Apple LLVM 9.0.0 (clang-900.0.39.2)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import pysqlite3
>>> pysqlite3.connect(":memory:").execute("select sqlite_version()").fetchall()
Or just a benefit?
Ive had python libraries break with updates.
And it works:
sqlite> with t(x) as (values (1), (2)) select sum(x) over (order by x) from t;
Sqlite doesn't need too much more memory than what its binary needs whereas with postgresql, you need all sorts of bells and whistles just to get the database system to boot.
Anyway, not trying to make the case that postgresql is small compared to sqllite, it obviously isn't, just wanted to point out that it's not _that_ big either.
Is it possible to dynamic link system openssl to reduce size?
It is dynamically linked to openssl, but you can configure out the internal code to support the functionality.
"strip" however meant the unix strip command to remove debug symbols.
I considered the Lite meant offline storage.
Still a huge fan, I went from noob programmer to knowing databases because how quickly I could make and play with databases in sqlite.
- it only has b*-tree indexes
- it only has one index per-table source
I don't know for sure what this means, but it sounds like it is incorrect.
SQLite tries to only uses indexes in situations where they help the query run faster. SQLite is not limited in its use of indexes. It is just that the use of multiple indexes for a single FROM-clause term is rarely helpful.
SQLite has had user-defined functions since forever. https://sqlite.org/c3ref/create_function.html
Another underappreciated feature (while we're at it) would be WAL instead of undo-journaling (https://www.sqlite.org/wal.html), which enables concurrent reading and writing of SQLite databases. Has been available for some ten years or so, but is off by default.
Functions are available through the Python bindings (which are not maintained by the SQLite project), virtual tables I think, aren't. Alternate bindings (https://rogerbinns.github.io/apsw/) claim to achieve "Everything you can do from the SQLite C API" interop, which would include virtual tables.
Another good example is joins -- they aren't pushed down, and SQLite can only do joins as nested loops. All you can do is stream raw data into the SQLite engine and allow it to perform the join, even if you can do it more efficiently yourself.
In my case I was passing queries to another database engine, so every single thing that could be pushed down to the virtual table provider, I would have been able to take advantage of. But not enough stuff is.
The parent made this statement:
> ... for instance there is no way for `SELECT COUNT(star) FROM table` to do anything but retrieve every field of every row and return them to the SQLite engine.”
The only response here is, “Of course! That’s what you told it to do! Why would it do anything else—and why would you ever tell it to do that?!”
Edit: To be clear and not sound as though I’m just being dismissive, you can of course use `COUNT(star)`, but the parent made it sound like this was something non-trivial. There are better ways to write such a query when you care about the exact count of rows and, assuming you can’t depend on grabbing the max primary key due to deletions, you’re complaining about the database wasting cycles pulling back irrelevant info. Get to know your database, as it often has features to query such metadata in a much smarter and more performant way.
PS: I replaced the asterisk with the word “star” because it’s making the formatting of this comment atrocious. Apologies for all the edits. I give up now.
Now, where SQLite starts to fall down in analytics workloads is that it’s row-oriented rather than column oriented. Performance could be better. Still, even for analytic workloads SQLite can be good enough for medium-sized data!
(Edit to clarify: I think it's awesome for embedded apps. But it never struck me as a good choice for data science.)
For that, SQLite isn’t necessarily bad.
Clever support for multilevel indexes would be top of my wishlist