If all this doesn't convince you to use SQLite, it's also one of the most well documented large C codebases that is fantastic to learn from.
I'd go as far as to say that many modern startups could get pretty far with SQLite + aggressive caching before even bringing in a big database like Postgres (though with recent deployment/ops advancements it's easier than ever to run postgres).
>>> import sqlite3
>>> con = sqlite3.connect(':memory:')
>>> cur = con.cursor()
>>> q = '''select x, sum(a * power(x, e))
... from (select -2 as x union select -1 union select 0 union select 1 union select 2) xt,
... (select 1 as a, 2 as e union select 4, 1 union select 2, 0) et
... group by x'''
Traceback (most recent call last):
File "<stdin>", line 4, in <module>
sqlite3.OperationalError: no such function: power
>>> con.create_function('power', 2, lambda base, power: base**power)
[(-2, -2), (-1, -1), (0, 2), (1, 7), (2, 14)]
Btw, user-defined scalar functions aren’t remotely as sexy as user-defined window functions.
As you mentioned, it ships in every major browser, most OSes, and a number of extremely popular instances of consumer software. If that's still a "toy" I don't know what isn't…
It's fast, simple, and easy. Perhaps my favorite feature is that backups are just a copy (in fact, when I'm debugging an app, I often copy the database into a temp directory and go to town).
I totally agree that SQLite is a great first option for startups. If your database will fit on your filesystem (gigabytes, not terabytes), it'll probably do really well. It's really easy to switch to something like Postgres later, and it's really nice to not have to worry about the complexities of a database server until you actually need them (e.g. you're considering hiring dedicated IT/DBA staff instead of just developers). And there's a good chance you'll never need anything other than SQLite.
Which is mostly the reason why I created RediSQL, to allow a more widespread use of it also in "modern"/"cloud native" application where you want to just connect to a database via the network, ideally through common API like the Redis one.
It has the great benefit, that if you create your database in memory is crazy fast, and it retains the persistency properties of Redis, which are good enough in most cases.
Sorry for the annoyance!
Modern startups need their services to be available, and this is difficult to achieve with SQLite because you can't run long database migrations without blocking writes during the migration. Personally, I'd use SQLite more often if there was a good solution to this (and also DROP COLUMN).
- I'm not sure I agree that the delay of the blocking write from a migration is so long as to cause significant (if any) damage to the business.
- If SQLite makes development and deployment easier (I think it does), this becomes a tradeoff between a deployment downtime/partial operation window likely on the order of milliseconds and hours of time spent setting up/managing the alternatives. With really robust systems like Postgres out there this is a pretty weak argument on my part, but there is at least something there.
- As I've also mentioned elsewhere, a high read-to-write workload would be less affected by this problem. If your startup produces aggregated business analytics dashboards for manufacturing plant stakeholders for example, your data set could be completely static.
- WAL mode improves SQLite's resiliency to concurrent readers/writers.
- In the extreme case, you could treat SQLite as a hybrid row & JSON document store, and manage your own schemas (I do not endorse this approach), but you'd still be saving some time because you can do stuff like make use of SQLite's FTS search features, and you'd have access to other relational features that you might not get with other document stores.
All this said, SQLite is not for lots of concurrent writers, and they so themselves, but I think it can absolutely be done. More to the point, I think that SQLite can help you get something out and easily deployable, and is not hard to move off of later.
I think there's room for a lot fewer 9s for most businesses generally, provided you have enough isolation that one thing being down doesn't also take down everything else. The exceptions are mainly ones that are trying to be someone else's infrastructure.
For lots of businesses, especially early on, maintenance windows are entirely fine and perfect bigco rolling deploys to giant HA clusters and such are way overkill that introduce all kinds of costs and complexity (=risk) that could be avoided by just accepting that every now and then some part of your site/service will be down. Just have backups, know how to restore from them quickly, and make sure you have scripted from-scratch deployments that actually work (having developers build them daily or weekly for their own local work does a pretty decent job of sussing out brokenness) and you're probably fine.
[EDIT] but of course that's the opposite of Résumé Driven Development and is very Not Cool and likely to be unpopular with clueless managers.
manager: "Why's our site down?"
you: "a deployment broke, it's fine, we can restore from scratch in ten minutes flat if we have to."
versus, also you: "as you know we follow Google's best practices and use Kubernetes and blah blah blah and you see [translated from Bullshit Speak] we don't actually understand it very well and it's super complex and it shit the bed for some reason but we're fixing it, and as you know this is all best practices and Google like and such as"
Unfortunately the latter is often "safer" than the former... for one's career, not for the product or service you're providing.
And the admin overhead can be reduced by using a managed service (Amazon RDS, Google Cloud SQL, Digital Ocean Managed Databases, etc.). With a managed service, we can even go serverless for the rest of the app. With SQLite, we have to closely manage and backup our server since the data are stored on it.
> I'm not sure I agree that the delay of the blocking write from a migration is so long as to cause significant (if any) damage to the business.
I agree that most migrations, in the early stage of a project, run for a few seconds to a few minutes maximum. But in my experience that's usually enough to be noticeable by our users.
> If SQLite makes development and deployment easier (I think it does)
I agree. That's why I use it sometimes :)
On the same topic, about the “server-process-edition” branch of SQLite: https://news.ycombinator.com/item?id=17766799
By using Sqlite as the SQL frontend, it retains features other NewSQL engines struggle with (recursive CTE, deep optimization, compatibility with existing queries etc)
Looks interesting. Any experience using it in real projects?
Haven't tested out Geo replication though, but it's pretty straightforward. Clock drift maximum and all that.
There are also previous threads discussing them but here's one.
I know different databases support different features (e.g., atomic transactions, ACID compliance, etc.), but it seems like the big differentiator is size, namely:
1. Can your database live within a single computer's RAM?
2. Can your database live within a single computer?
3. Can your database live across a cluster of computers in a single data center?
4. Can your database live across a cluster of computers spread across the planet?
Once you pick your category from the above, there are still choices to be made, but the "best" options narrow down significantly.
I agree that the possible second biggest (if not the first) differentiator is size though -- despite the amount of choices out there, once you start narrowing down answering those questions, many options fall away very quickly.
I'd also add on that the only real way to scale a database is to shard. There are a bunch of tricks you can do to improve throughput and those are always welcome, and sometimes I'll see articles that proclaim "we just changed to a system with 256GB of RAM to fix our scaling problem!" -- but these improvements don't get past the basic premise that once your data does not fit on one computer/harddrive you need to start figuring out a way for multiple to coordinate. You need to shard, and you need to organize your data in such a way that it can shard, then you need to start worrying about data access patterns. There's a lot of space before something like this is necessary (you really do have to have a lot of data for it to not fit on one machine these days), but I think people don't see this chain coming fast/often enough.
Handling atomic transactions or ACID on a single computer is a completely different problem from dealing with them across multiple, and still a different problem from dealing with them across datacenters. I'd argue that the size of your data is the first question you need to ask, only then you can figure out if you need/desire these features.
> once your data does not fit on one computer/harddrive you need to start figuring out a way for multiple to coordinate
Going from one computer to two generally more than doubles the complexity of the system. But when done right, going from two computers to two hundred can be easy. However, going from two hundred (in a single datacenter) to ~two thousand (across multiple datacenters) brings its own unique set of challenges.
Picking the database for an app is one of the biggest tech decisions any team can make, certainly more important than the underlying OS, and even potentially more important than choosing a predominant programming language. Even when starting a tiny startup with no initial customers, considering how big about the absolute max you think you your data can grow, and picking a database that can grow with you can save you huge headaches down the road.
Case Study: I started a company on Python 2.7 (fine, but not enlightened decision), a fork of Django that supported NoSQL (bad decision), and an underlying NoSQL database that scaled with no configuration (great decision). In the first year, the database had around a gigabyte of data, so NoSQL was definitely overkill. But now, it's over 100 Terabytes, with auto replication making it much larger than that. Changing the Django middleware or even the Python language to 3.x will be difficult, but not impossible. Moving the database is far more terrifying, a non-starter.
I used Google's NoSQL datastore database. It's generally worked well for me, and has scaled from 1GB to 100TB incredibly well, it feels like magic. But I made the decision to go with it years ago. If I were starting from scratch today, I'd want to do a lot more research before endorsing it.
Postgres can run in a 256MB VPS(if one can find one these days) and has more features that you might need later.
Installing is as easy as 'apt-get install postgresql', create user, create database, done.
From their site: https://www.sqlite.org/whentouse.html
>SQLite does not compete with client/server databases. SQLite competes with fopen().
With Postgres, you run pg_dump on the database (which is a shell one-liner, only with slightly more arguments than cp) and that's even better for putting into git because it's a text file that git can diff natively and store efficiently with delta compression.
Postgres is a fantastic database, but most of the time getting started with and using SQLite will be easier. For example -- you'd never have to secure firewall rules to a SQLite instance. Bad defaults (unlikely but possible) or some bad post-install configuration and you've got an exposed server.
I want to emphasize that my point about SQLite being viable for startups was with the assumption of a low-write high read workload -- most of the time queries won't even hit the DB but instead be served by some caching layer. It's also really easy to scale read only versions of your applications horizontally if all you've got to do is ship/swap out DB files (again at this point it might make more sense to look into a bigger DB and just scaling vertically).
The primary purpose is to replace custom (and fragile) single-file databases with something that is far more resilient and provides uniformity in implementation. The fact that it is a suprisingly good SQL database (to the point where folks are actively arguing that much more fully-fledged databases are better in some aspects) is incidental.
Agreed that testing against copies of production or staging instances that are as close as possible to production is important, but what I'm suggesting is that you could ride SQLite all the way into production if you wanted to.
I'd love to hear more about the ignoring column types/size limit/other issues that you've run into though, have you written about them anywhere or is there an SO post? I haven't run into anything super egregious though I've had some problems with FKs before.
If you can add "PRAGMA foreign_keys = ON;" to your testing script SQL calls - somewhere soon after connecting to SQLite - that enables it.
>Many modern startups could get pretty far with SQLite + aggressive caching
Not only did we get "pretty far", we got to what is effectively a final destination for this generation of our product. We don't even need "aggressive caching" beyond what the filesystem provides to us by way of accessing the SQLite db file.
Today, we have a single SQLite database persisting stateful data for an application that is used concurrently by 100+ users. I have benchmarked 512KB state update writes in a multithreaded scenario at >500 per second. 64KB updates exceed 1000 per second. It took us a lot of time and understanding to get to this point. At the end of the day with SQLite, less is always more. The first attempts we did involved connection pools, queues and all kinds of other fancy over-engineering in an attempt to extract more performance. We finally ended up with just the following:
- Enable Write Ahead Logging
- Set Synchronous to Normal
- Use a single SQLiteConnection
Justification for a single connection: We are using .NET with application-level locking and an RWLS where appropriate. This is in contrast to relying on the database to push these transactional scopes to disk. For us, losing an in-flight transaction tracked in memory is not catastrophic so we can tolerate this 1-100ms danger zone for the huge upside we get in throughput. This is a great example of something you can only do with SQLite. No hosted database solution offers this degree of exclusivity over your data store from an application perspective.
One other application note to keep in mind is that SQLite (by default) operates in serialized mode, so the above approach is consistent regardless of the fact that we are hitting a single connection from multiple threads at once. The usages of RWLS are required in our case simply because we need to do multiple dependent sequential operations (select+modify+update) or insertion of new items (insert+last insert row id from connection). It is hypothetically possible to avoid locking altogether (aside from the implicit SQLite serialization) if you can structure your database such that you perform all operations in a single command - E.g.: Use application-generated keys upon insertion rather than waiting for the DB to generate one.
Also, backups and restores are super trivial as mentioned elsewhere in this discussion. I have considered writing a backup agent directly into our software that is transactionally-aware. E.g.: During a backup request, I can take out a write lock on the RWLS for the SQLiteConnection while copying the database to a backup destination. Alternatively, since the DB lives on the same machine as the application, you can just take a single VM snapshot throughout the day, but this won't be as consistent a backup (tolerable for our use case though).
The biggest advantage we see with using SQLite as our principal data store is that it has allowed us to completely avoid operational annoyances regarding how we actually ship our stack to various environments. .NET Self-Contained Deployments + SQLite = one bin folder that "just works". We were able to build custom deployment tooling because of the simplicity of this model, rather than relying on off-the-shelf tech that doesn't quite fit 100% into our use case.
While I've not had time to research it yet, I'd be curious on your (or anyones) knowledge / experience in this context.
My plan, if SQLite didn't offer concurrent access safety, was simply to put concurrent primitives around SQLite to ensure safety. However I'm not doing anything until I can make heads or tails of that original comment.
- Concurrent reading has always been possible.
- Reading while writing is possible when the relatively new WAL mode is enabled, at no significant cost.
- Writing still results in a lock, and other writers can decide how long they’re willing to wait on it.
God forbid I ask for experiences from people who seem seasoned in this area on Hacker News of all places. The gall I must have, right?
Our CTO made a cluster wide distributed configuration filesystem, started about 9 years ago. For that a local backing store was required, and a a small DB seemed to fit the bill as it should allow to worry less about data safety, when it's really written out, and it could be a general interface which could be swapped out rather easily if required one day.
He checked out lots of the then available options, sorry I only remember BerkleyDB and SQLite but there were more.
Anyway, only SQLite never crashed, had corrupt data when doing weird things with it and was still very fast, just great.
Almost 10 years later we still use it, never a single issue with SQLite as the cause - really impressive IMO.
A lot of people in the HN community like types. Me included. I really don't want my database to ignore type errors. I basically want Postgres but embedded like a library. I cannot think of any reason why that couldn't work. Perhaps just no one has built it yet.
I see that even with "INTEGER" affinity, I can insert and read back text to columns :/
CREATE TABLE test(
x INTEGER, y TEXT);
INSERT INTO test
SELECT * FROM test;
In other words, typechecking is also a cost for language designer.
Even SQL itself has not particularly strong type system. Perhaps you're looking for something else entirely.
If you read back a set of values and one of the fields in a row doesn't match the data type given by the column, the libraries will error out.
One library that works correctly is github.com/gwenn/gosqlite, which checks the data type of each field of the result set (per row), so handles things gracefully.
Unless you have really strict requirements (generally not the case), SQLites storage classes (NULL, INTEGER, REAL, TEXT, BLOB) are sufficient. If it really bothers you, you can use extensions to provide whatever abstractions you like. If you're mostly doing CRUD, you really shouldn't have any issues.
If you have only one thread/process, use it as is; if you need multiple readers and one writer (who do not block each other), use WAL mode; if you need simultaneous multiple readers and writers, use Postgres.
If you are using SQLite, be aware that the “copy the file to move it to a different system” comes with a caveat - a regular Unix “cp” may get the file in an inconsistent state if it was modified while copying, or in the middle of a transaction.
Iirc “rsync” if used to copy, will verify that the file was not modified while copying and restart the copy if it did (a number of times and then error out). But that’s still not enough - you may need to also copy the shm file (in wal mode) and the journal file - and you need to do that atomically. Which is harder still.
The right way to copy an SQLite file from place to place us:
Any way you like, if it was properly committed and closed last time it was used, and is not currently open by any other process;
Or creating a backup (e.g. with the sqlite3 command), which will be a nonblocking operation if in WAL more - and then you can ship the backup whichever way you want.
With SQLite, it's as simple as dumping the .db file from one machine to another and running your analysis tools over it - aka executing simple SQL queries from various reporting views. We also see really good compression on this type of structured data (we just compress the entire .db file on disk).
Our analysis tools are typically just DB Browser for SQLite - A tool I would say is 10x better & faster than SSMS: https://sqlitebrowser.org
Recent versions of SQLite's CLI also has a "sqlar" mode  that lets you treat sqlite database files as an archive file. Unlike some other archive files, adding and removing files to an existing sqlar file is a first class operation, since it's just a database file. It's basically what I'm doing for my "ton of tiny files" case, just with a documented schema.
1 = https://www.sqlite.org/sqlar.html
- Lost decimal places on a simple database round-trip.
- Issues with accessing the DB from multiple threads.
This seems like a misunderstanding of the types that SQLite was able to store natively and which one was right for your use-case. While SQLite certainly has less options out of the box than a larger database has for types, this seems like it could happen in another database as well with a wrongly chosen type (ex. integer vs bigint in postgres).
> - Issues with accessing the DB from multiple threads.
How SQLite acts under multiple thread access is well documented. You can even get closer to bigger systems by using WAL mode. The fact that SQLite isn't the best for concurrent writes is also discussed in the page on when to use SQLite:
> Many concurrent writers? → choose client/server
> If many threads and/or processes need to write the database at the same instant (and they cannot queue up and take turns) then it is best to select a database engine that supports that capability, which always means a client/server database engine.
> SQLite only supports one writer at a time per database file. But in most cases, a write transaction only takes milliseconds and so multiple writers can simply take turns. SQLite will handle more write concurrency that many people suspect. Nevertheless, client/server database systems, because they have a long-running server process at hand to coordinate access, can usually handle far more write concurrency than SQLite ever will.
A bit of a reach but I'm also willing to bet that the architectural pattern that you were using that required multiple threads to write to the DB at the same time might benefit from passing that responsibility to a single thread which could possibly employ some batching, which is normally one of the low hanging fruits of DB write performance tuning.
I figure you could just throw hardware at it like you mentioned. Move them to nvme backed S3 if needed.
And our use case is only ever load, do a read or write, and then save. So they DBs aren’t open for very long.
And with S3 compression you could save on download time but pay a decompress cost.
This approach has its downsides don’t get me wrong but it scales nicely but forget running aggregates across the databases at least not for a real-time result.
Sqlite3 was my saviour. It was the work of minutes to have the data in a db and be chopping out the bits I needed.
This was probably the right decision 5 years ago, but I'm not sure Sqllite is the proper solution for "ad hoc" analysis now. Excel now has plugins to handle very large datasources. And of course, the company leading this charge here is Tableau, with many others doing something similar.
Sqllite can still do all the manipulation that these others can, but the others can also be used by non developers and come with pretty graphs and pictures. For large CSV analysis on an ad hoc basis, SQLlite doesn't stack up well to the competition.
It's pretty standard Excel stuff: summing columns, averaging, functions combining multiple columns, pivot tables. The significant difference is that I routinely deal with 20k+ rows, and occasionally 100k+ rows. For simple stuff, standard Excel starts breaking down with 10k+ rows, and it even struggles with 1k rows with more complicated manipulation.
Great to see FILTER on aggregates, solves a small pain I have with my home automation stats :)
Specifically I’m now wondering if a case-insensitive replace() string function is already a feature request or perhaps even implemented in a recent version.
I’ve used SQLite and like it a lot but I’ve recently started using these two for some pet projects and found them easier to work with and in theory they should scale easier if I understand their strengths
Although I don't know the underlying cause so it may be infeasible.
> SQLite4 was an experimental rewrite of SQLite that was active from 2012 through 2014. All development work on SQLite4 has ended. Lessons learned from SQLite4 have been folded into the main SQLite3 product. SQLite4 was never released. There are no plans to revive it. You should be using SQLite3.