I built a rules engine around SQLite queries for the last product I was working on. This article was the reason.
You can run an unbelievable # of select statements per unit time against a SQLite database. I think of it like a direct method invocation to members of some List<T> within the application itself.
Developers who still protest SQLite are really sleeping on the latency advantages. 2-3 orders of magnitude are really hard to fight against. This opens up entirely new use cases that hosted solutions cannot consider.
Databases that are built into the application by default are the future. Our computers are definitely big enough. There are many options for log replication (sync & async), as well as the obvious hypervisor snapshot approach (WAL can make this much less dangerous).
> Databases that are built into the application by default are the future. Our computers are definitely big enough.
Interesting point. It's easy to forget that in the past local storage was small and expensive so that it was necessary to have a separate database machine.
It wasn't just storage. In the 90s, most servers were single core. Databases already had network interfaces so they could be shared by non-web applications running on desktops, and so were the natural place to split a website over multiple machines. This turned into the received wisdom of always putting your database on a separate machine, despite processing power and storage latency improving so much since the 486 and pentium days that network overhead dominated response times.
I'm still not understanding this push toward using SQLite as a production backend database. It's great for what it is, a tiny embeddable client-side application database. Like an address book on your phone. But even the developers themselves have steadfastly refused to allow it to expand beyond that scope. For instance, they won't add native types for any useful things like dates/times, or uuids. Because that would bloat the code and the size of the embedded object. So you're stuck with "everything is a string". Ref integrity can be enabled, but even those constraint options are very limited.
Not sure why people are still trying to shoe-horn it into a role that it's not meant to be in, and not even really supported to be.
Maybe it's not for you, but the "everything is a string" thing is just the default. SQLite has STRICT table option since 2021 that people really should be using if possible: https://www.sqlite.org/stricttables.html
This brings strict types that people expect from the other server-based databases.
Just to clear up the error in the parent post: SQLite has native blobs, floats, and integers, not just strings. It doesn't have a bunch of other types for things like dates and JSON - you just represent those things using the native times of integer, float, string or blob. But it is not limited to only strings. This has been true for 20 years.
But it has functions for operating on dates and times though, right? date(), strftime() etc. Those just silently fail with a NULL if the value in the tuple isn't something that it recognizes.
So it seems like it _tries_ to do the things that people want, but only half-heartedly, and just fails silently if it can't. Was the column actually NULL? Or just a wrong string format? Who knows?
The point is, SQLite is very "flimsy". And that's perfectly fine for what it is (or what it intends to be). But if I try to insert or retrieve a date() or do strftime() on a string that isn't a valid date/time, don't just return a NULL. That's where the "flimsiness" makes it a deal-breaker for anything where integrity is important and the database itself is supposed to be the authority.
It's absolutely supported for that use case if you can get away with single-writer multi-reader architecture, which IMHO most medium sized applications can. [1]
> It's great for what it is, a tiny embeddable client-side application database. Like an address book on your phone
Size is no real concern, if the user of a client side application has many gigabytes of data a sqlite database is still well suited for the role. There's no shoehorning, it just works.
Just saying that it's "worth using it" gives the impression that it's a good choice for all levels of traffic, all security requirements, all isolation requirements (yeah it's serialisable, but this comes back to levels of traffic), the kind of use case where you need SKIP LOCKED instead of just a single writer, etc.
> Databases that are built into the application by default are the future.
No they're not, because for web servers, you have many (tens? hundreds? thousands?) of web servers that all need to talk to a single database. And you sure don't want to replicate and sync a gigantic database across each web server -- that would be a disaster.
While for local apps on your phone or computer, usage of SQLite is already widespread -- it's not the future, it's here. And cloud-connected apps that work offline already do some sort of sync for that.
Most people don't have tens or thousands of webservers and could be running with a more efficient inprocess database that syncs to another networked readonly replica for a very long time. I'm very surprised that MySQL or Postgres still don't have an inprocess mode because it's such an obvious win and on a technical coding level it should (naively) be very easy to switch from network calls to direct calls.
And syncing databases is just waaay harder than having a dedicated one. It really seriously is a pain to configure and debug and manage storage and performance requirements around and dealing with scaling and managing failure modes and so forth. If you can even find someone who has the skills to do it properly so you never lose data.
It's far, far easier to just write normal SQL queries that don't have to make hundreds of requests, than to deal with syncing a database across all your webservers.
And because designing for 2 is really different from designing for 1, but not much harder if you do it from the beginning. But expanding from 1 to 2 when you unexpectedly need to because of sudden traffic, and having to re-architect it, can be really hard.
I'm not talking about hobby sites here, I'm talking about anything designed for a business, where downtime of a day is unacceptable.
But doesn't the same logic apply to the database? Is the webserver / application layer really so much more CPU intensive (and/or prone to failure) than the database in sites "designed for business", such that you can't get away with not designing the webservers for load balancing, but you can get away with not designing the databases for sharding?
> for web servers, you have many (tens? hundreds? thousands?) of web servers that all need to talk to a single database
Maybe if you're in the top 1000 or so largest websites.
Back when the alexa 10k was a thing, $work was on it - and we're serving that level of traffic with a rails app running on 30 CPU cores. It would fit _easily_ onto a single machine.
Single machines aren't fault-tolerant. If your machine goes down, your site goes down. And god help you when you have to scale to a second machine and you haven't architected anything for that, and it's a project that takes months.
There are a lot of good reasons that we use a fleet of cheap scalable web servers in front of a beefy database server that has a live failover.
> There are a lot of good reasons that we use a fleet of cheap scalable web servers in front of a beefy database server that has a live failover.
"You can't fit on one big machine" is virtually never one of them - and at that point you're firmly in the realm of 'months of bespoke engineering to handle the load'.
Similarly, "beefy database server that has a live failover" is not inherently better than "beefy web+database server that has a live failover".
I agree there are a lot of good reasons. There used to be a _lot more_ good reasons, and - while I haven't actually tried it yet - I suspect that it's far more tractable than it used to be.
If I _were_ to try it, my design would be multiple failover targets (at least 3, geographically dispersed, probably 4), plus a separate control plane to configure BGP.
> "You can't fit on one big machine" is virtually never one of them - and at that point you're firmly in the realm of 'months of bespoke engineering to handle the load'.
That's just not true. There are a lot of web servers where their operations are CPU- or memory-intensive, depending on what they do. Are they working with images? Mapping? Video? Large data throughput? Especially when so much server code is written in slow interpreted scripted languages.
Well-architected databases often can fit on one machine, unless you're at Facebook scale. Webservers however -- absolutely not. The front-end for a simple CRUD interface, or for a blog with static pages, sure. But for a lot of interactive sites, definitely not.
> That's just not true. There are a lot of web servers where their operations are CPU- or memory-intensive, depending on what they do. Are they working with images? Mapping? Video? Large data throughput? Especially when so much server code is written in slow interpreted scripted languages.
So the "traditional scalable" way of doing things right now assumes your "app" layer (sorry if I'm not using the right term -- the thing between the HTTP request and the database) is inherently going to be CPU intensive in a way that the database is not; and so assumes you're going to need to make a distributed system with load balancing, network connection to a database, etc.
But it seems to me that making that distributed system 1) requires a huge amount of "complexity budget" 2) actually increases aggregate CPU utilization, since one way or another everything has to be marshaled across the network, usually involving interpretation, copies, verification, maybe encryption, and so on.
So the question is, could you instead spend your "complexity budget" on actually optimizing the "app" layer? Use a performant compiled language, like Go or Rust, use profiling and optimization, etc. Sure, you'll have to spend more of your development time optimizing inner loops, but you also spend less time designing and debugging distributed systems.
I remember hearing somewhere that the entirety of StackOverflow ran on two physical boxes. Given how much traffic they must have gotten, I'm betting that I can do the same for the website I'm currently developing.
And connected to SQLite: Combine the fact that databases are just files, and that you can easily run "ATTACH <file>" to connect multiple database files on the same "connection", you should in many cases be able to design your overall structure to effectively "pre-shard" data. Right now I've got a separate .sqlite file for the per-user data, but all users share common site data; so if I ever reach the point where I really can't optimize further, I should be able to break my app into multiple servers without having to do further work sharding the database.
If everything goes up in smoke, I'll write a blog post admitting I was wrong and DM you. :-D
> So the question is, could you instead spend your "complexity budget" on actually optimizing the "app" layer?
You definitely could, and I know people who have ported things from e.g. PHP or Python or Java to Go, so that a service that required 100 servers then only required 1. But it's just expensive -- the time it takes, and hiring people who can do that and maintain that.
And I only know of that having been done for services that were already in an extremely stable state, and not expected to change -- API endpoints, not HTML generators.
Generally speaking, the "app" layer is just changing a lot as features get tested and added and changed and swapped out, and dev cost is the priority, not server cost. And load balancing web servers is very easy.
> And connected to SQLite: Combine the fact that databases are just files, and that you can easily run "ATTACH <file>" to connect multiple database files on the same "connection", you should in many cases be able to design your overall structure to effectively "pre-shard" data.
Distributing a database that is read-only is easy, sure. It's the writes that are hard, and when you shard it's the joins between shards that give you headaches. If you work on a site where data from one user never has to be joined to data from another, then lucky you. ;)
We have a giant Postgres DB, but instead of having every machine connect directly to it, we have a job that creates a smaller SQLite cache of relevant data and that’s pushed out to the machines who then reload that on the fly.
All this depends on your data being somewhat shardable of course.
You can’t really sync the data in real time tho. Transactions still need to hit your Postgres, and then you reload. You might as well do write thru cache.
Yeah, this sounds pretty not great if you need ACID consensus or locking between multiple machines. If what you need is to read a somewhat static state, fine, great.
> Databases that are built into the application by default are the future. Our computers are definitely big enough.
That assumes local applications themselves are the future, and that assumption has grown ever weaker and weaker with everyone and their dog going cloud-only (or starting as a SaaS in the first place) to grab all the sweet sweet recurring subscription revenue.
| Databases that are built into the application by default are the future.
I'm not sold on client side, but a lot of great work is being done on putting the db and application on the same server, between SQLite replication and other approaches like SpacetimeDB. I'm interested to see where it goes.
SQLite underpins nearly every on-disk storage mechanism at Apple and Google as the OS level. You may not be sold but systems programmers elsewhere are.
Sqlite isn't something that works for only apple or google. Point is it's sufficient for them, simple, and more than sufficient for you. He's explaining to you that this is something that has already happened as you're unaware.
I use SQLite all the time, I said in another comment it's my go-to for server-side green field projects. But there's a lot of push to create data solutions that are fully bidirectional between client and server, or to even make them client-driven in an awkward peer-to-peer model, and I'm not sold on either pattern.
Sqlite is at it's best in "pure client" situations where there is no server and there are no P2P things going on. Just the application, the user, and their data on their own drive with sqlite.
I often find myself thinking about what a database like SQLite that might look like if it had a native API instead of a query language. I guess it wouldn’t be so different from a dataframe API, but with persistence, locking, and relational features, and I guess an mmap under the hood.
> Databases that are built into the application by default are the future.
That’s a bit overzealous. SQLite is great, but it’s not a replacement for a hosted database. Not all data can live on the client. Use SQLite when you have the right use case for it (offline desktop app, etc.)
If you have a database server and an application server, then move your application on the same machine as your database. That's basically the use case of sqlite server-side. You can easily fit 250TB on rack (or 64tb on an ec2.) That's a lot for non-blob storage: 750KB per us citizen
Who has "an" application server though? Those of us who came up through PHP/Rails/Flask and even much of the modern microservices world were taught to do stateless, probably load-balanced application tiers with shared DB, maybe cache, maybe message queue. Both for scalability and redundancy.
Redundant SQLite is not impossible [0] but it is definitely less mature and well-known than replicated MySQL or Postgres.
I think the "hosted flavors of SQLite" would be doing the type of trick Litestream does: streaming the WAL to other copies of the database. The copy you interact with is still local.
Wouldn't a hosted database really only "make sense" if you don't want sharding? It's a lot easier to shard at the front end before it makes it to any business logic than to do that and also shard the database. It's when you don't want to deal with database sharding at all that you'd want all application instances to hit the same hosted database.
Sorry I forgot that people do "front end" these days. I was thinking of the old-school server-side-rendered web apps that load almost instantly, not the modern spinner-fest ones.
An interesting result of this is that SQLite is a really good database for implementing GraphQL.
The biggest problem with GraphQL is how easy it becomes to accidentally trigger N+1 queries. As this article explains, if you're using SQLite you don't need to worry about pages that accidentally run 100s of queries, provided those queries are fast - and it's not too hard to build a GraphQL API where all of the edge resolutions are indexed lookups with a limit to the number of rows they'll consider.
It's true that RPC latency doesn't exist, and that opens up other possibilities that are worth considering.
But if you do a ton of small queries instead of one big one, you could be depriving the database of the opportunity to choose an efficient execution plan.
For example, if you do a join by querying one table for a bunch of IDs and then looking up each key with an individual select, you're forcing the database into doing a nested loop join. Maybe a merge join or hash join would have been faster.
Or maybe not. Sometimes the way you write the queries corresponds to what the database would have done anyway. Just not necessarily, so it's something to keep in mind.
I swear "n+1" only popped up in the past 5 or so years with this meaning. I learned it over a decade ago as "1+n".
Also "n+1" looks more like a reference to concurrency/synchronization throughput - like, n concurrent threads or requests where the results are collected and used together. I was really confused the first few times I saw "n+1" because of that.
I have thought of that as ‘n’ being the manageable threshold and the (uncontrolled) ‘+1’ the overflow creates the problems. Typically in terms of additional layers or iterations.. but i like your point and perhaps’n+1’ and ‘1+n’ mean different problem shapes
I could see that making sense in some contexts, like the "straw that breaks the camel's back". However that's not what's usually being referred to in this database query problem.
Usually it's doing one query that returns n results, then doing one more query for each result. Therefore, you end up having done 1+n queries. If you'd used a join you could potentially have done only 1 query.
It's interesting to consider the question of what this would look like if you put Postgres, rather than SQLite, in process? With PGlite we can actually look at it.
I'm hesitant to post a link to benchmarks after the last 24 hours on Twitter... but I have some basic micro-benchmarks comparing WASM SQLite to PGlite (WASM Postgres that runs in-process): https://pglite.dev/benchmarks#round-trip-time-benchmarks
It's very much in the same ballpark, Postgres is a heavier databases and is understandably slower, but not by much. There is a lot of nuance to these benchmarks though as the underlying VFSs are a little different to each other, and PGlite has a WAL rather then SQLite which is in its rollback journal mode (I believe this is why PGlite is faster for some inserts/updates).
But essentially I think Postgres when in-process would be able to perform similarly to SQLite with many small queries and embracing n+1. But having said that I think the other comments here about query planning are important to consider, if you can minimise your queries, you minimise the scans of indexes and tables, which is surely better.
What this does show, particularly when you look at the "native" comparison at the end, is that removing the network (or at least a local socket) from the stack being the two closer together.
The article doesn't explain why this is a bad idea for database servers hosted on a remote machine. The first reason is obvious... Network connections take memory and processing power on both client and server. Each additional query causes more resource usage. It is unnecessary overhead which is why things like multiple active result sets were created for SQL Server.
The network round trip time can also add up if you run into resource constraints doing this.
On a remote database, you also have to contend with multiple users and so complicated locking techniques can come into play depending upon the complexity of all database activity.
Many databases have options to return multiple result sets from one connection which helps control the overhead caused by this usage pattern.
EDIT: This also brings back horrible memories where developers would do this in a db client server architecture. Then they would often not close the DB connections when done. So you could have thousands of active connections basically doing nothing. Luckily, this problem was solved with better database connection handling.
These days there are other tricks you can use to turn several SQL queries into a single round-trip too, with things like JSON aggregates.
Here's an example PostgreSQL query that returns 10 rows from one table and 20 rows from another table in a single network round-trip, using JSON serialization to return the different shaped rows in one go: https://simonwillison.net/dashboard/union-json-demo/
the problem is that it only works with Postgres, not mysql or sqlite or pretty much anything else (at least not as conveniently) and the bigger problem is that the queries become more complex
It works with SQLite too - the functions have different names (json_group_array() and json_object() rather than PostgreSQL's json_agg() and json_build_object()) but they work pretty much exactly the same.
The one JSON PostgeSQL feature that SQLite doesn't have yet and I miss is row_to_json(record) which serializes an entire row without you needing to specifically list the column names.
yes unfortunately only postgres has that convenient thing where you don't have to go out of your way to specify all the column names. SQLServer has it too by a different name but not MySQL
Yes, you get some performance improvements, but I think that comes at the price of security isolation. Think about it, your application probably requires tons of libraries and evolves really quickly compared to the database itself. Additionally, having Admin/root permissions on the server hosting the DB is generally a much bigger deal than granting such permissions on an application server that talks to that DB.
If none of this makes sense, don't worry, that just means you don't work in enterprise...
Use a single SqliteConnection instance for all access - most builds of the SQLite provider serialize threads inside. Opening multiple connections will incur unnecessary F/S operations.
> How do they manage durable data store? Since sqlite doesn't have a replication mechanism, how do people handle losing transactions?
I've never run one of my SQLite-embedded services on a machine that wasn't virtualized and also known as "production". The simplest recovery strategy is to snapshot the VM. There are other libraries that augment SQLite and provide [a]sync replication/clustering/etc.
#1. They are pretty much my default settings now. But how do you serialize concurrent write with multiple processes? Let's say you run your API service in separate stack as your webapp service. You can't really do that. I know some folks just make sure the transactions are so fast, collision will be rare. But why take the chance?
#2. Snapshot is backup strategy. AFAIK, from reading some reddit comments by Ben? (author of litestream) that it's not a replication strategy.
I've used sqlite in ETL pipeline very successfully, as well as as a read-only caching. I just can't figure out how people use it as server without doing a whole bunch of hacks to deal with its limitations.
I would never try to have 2 different processes (such as a web server) talking to the same SQLite database. That is a recipe for performance disaster. All of the projects I build this way are single process designs. It is not really a good fit for legacy solutions.
If you absolutely require transactional integrity against the same datastore with multiple processes, you could consider selecting one process as the DB owner and then IPC/RPC the SQL commands from the other processes to that one. Named pipes (shared memory) are nearly free in many ecosystems. You will find tradeoffs with a proxied approach, such as all application defined functions need to be defined on the db owning process.
> that it's not a replication strategy
Correct, but many scenarios can survive with 15-60 minutes (whatever the snapshot interval is) of data loss. The complexity trade off is often worth it if the business can tolerate. If you do require replication, then a 3rd party library or something application-level is what you need.
I think my point is still valid. How are people using sqlite3 in server context? Single process in modern webapp and 15-60 minutes of data loss might be acceptable but why? To save yourself a postgresql install which can be on the same server?
One thing I don’t see mentioned yet is that efficiency isn’t the only reason multiple queries can be problematic. There’s also consistency. This might not matter for many SQLite use cases, or in general for blocking single-reader calls. But if your database might be handling concurrent writes that could occur between any two reads, reducing the overhead to zero still might not be worth the tradeoff.
Effectively (not completely but in general real world use) solving the N+1 queries problem is one of many reasons why SQLite is my first choice for greenfield projects.
The only concern I have is it isn't portable. Move to Postgres and your app will choke. This can be acceptable while scaling because SQLite can go much farther than people realize. But it is good to keep in mind.
The issue isn't that "postgres is actually slower" but the network round trips:
> Such criticism would be well-founded for a traditional client/server database engine, such as MySQL, PostgreSQL, or SQL Server. In a client/server database, each SQL statement requires a message round-trip from the application to the database server and back to the application. Doing over 200 round-trip messages, sequentially, can be a serious performance drag. This is sometimes called the "N+1 Query Problem" or the "N+1 Select Problem" and it is an anti-pattern.
I think that's why the GP said "on the same machine" — I read the article as comparing to the typical DB that is located on a separate machine in the data center, accessed over a network, as is currently the common setup.
But if we're considering running SQLite, the apt comparison would be against other DBs running on the same machine, because we've already decided local data storage is acceptable.
I assume Postgres/MySQL/etc would have higher latency than SQLite due to IPC overhead — but how significant is it?
I ran a quick, non-scientific Python script on my Macbook M2 using local Postgres and SQLite installs with no tuning.
It does 200 simple selects, after inserting random data, to match the article's "200 queries to build one web page".
(We're mostly worrying about real-world latency here, so I don't think the exact details of the simple queries matter. I ran it a few times to check that the values don't change drastically between runs.)
SQLite:
Query Times:
Min: 0.007 ms
Max: 0.031 ms
Mean: 0.007 ms
Median: 0.007 ms
Total for 200 queries: 1.126 ms
PostgreSQL:
Query Times:
Min: 0.023 ms
Max: 0.170 ms
Mean: 0.028 ms
Median: 0.026 ms
Total for 200 queries: 4.361 ms
(Again, this is very quick and non-optimised, so I wouldn't take the measured differences too seriously)
I've seen typical individual query latency of 4ms in the past when running DBs on separate machines, hence 200 queries would add almost an entire second to the page load.
But this is 4.3ms total, which sounds reasonable enough for most systems I've built.
A single non-trivial query required for page load could add more time than this entire latency overhead. So I'd probably pick DBs based on factors other than latency in most cases.
Oh, I've definitely seen N+1 problems with a SQLite database before. It's more resilient than a traditional DB server, but it can still run into performance issues when doing too many select queries.
You can run an unbelievable # of select statements per unit time against a SQLite database. I think of it like a direct method invocation to members of some List<T> within the application itself.
Developers who still protest SQLite are really sleeping on the latency advantages. 2-3 orders of magnitude are really hard to fight against. This opens up entirely new use cases that hosted solutions cannot consider.
Databases that are built into the application by default are the future. Our computers are definitely big enough. There are many options for log replication (sync & async), as well as the obvious hypervisor snapshot approach (WAL can make this much less dangerous).