Anyone who is looking at using SQLIte + Rails should check out the work done by Oldmoe (X/Github) on his Litestack project.
Here's the intro paragraph: "Litestack is a Ruby gem that provides both Ruby and Ruby on Rails applications an all-in-one solution for web application data infrastructure. It exploits the power and embeddedness of SQLite to deliver a full-fledged SQL database, a fast cache , a robust job queue, a reliable message broker, a full text search engine and a metrics platform all in a single package."
I'm currently using it on a project and can't say enough good things about it!
I’m making a FOSS analytics system, and ease-of-installation is important. I want to send event data to a separate SQLite database, to keep analytics data separate from the main app’s data.
I’m concerned about scaling, since even a modestly busy website could have 1000+ events per second.
My thought is to store events in memory on the server and then make one batched write every second.
Does this seem like a reasonable way to get around the SQLite limitation where it struggles with lots of DB writes? Any better ideas?
I'd say that's a very reasonable and very good idea. I've implemented similar ideas in many other systems. Batching in general always reduces the per-entry overhead. It's easy to prove this via a benchmark. You can even put all the batched items in a single transaction. And since you are batching, you essentially will always have one thread that's getting the batch and actually doing the write, perfectly matched with SQLite's limitation of one concurrent write at a time.
What you need to worry about is slightly higher complexity: (1) what happens when a single batched write doesn't complete within one second; (2) what is the size of queue you store events in memory and whether it is unbounded or not; (3) if it is unbounded are you confident that overloading the server won't cause it to be killed by OOM (queueing theory says when the arrival rate is too high the queue size becomes infinite so there must be another mechanism to push back), and if it is bounded are you comfortable with dropping entries; (4) if you do decide to drop entries from a bounded queue, which entries you drop; (5) for a bounded queue what its limit is. These are very necessary questions that arise in almost every system that needs queueing. Thinking about these questions not only help you in this instance, but also in many other future scenarios you may encounter.
SQLite doesn't struggle with writes. But it only supports a single write transaction at a time; if you don't trust SQLite's transaction concurrency performance, you might serialize all your writes on a specific thread/process.
Since analytics data is generally write-heavy, I would recommend to use ClickHouse. You can use async-insert[0] feature of ClickHouse, thus you don't need to worry about batching events on your side. If you are looking for an embedded solution, you can use chDB which is built on top of ClickHouse.
Batching writes is probably a good idea, but by far the absolute best way to do something like this with SQLite is to use WAL and have a single designated writer (and as many readers as you want), probably fed by something like a queue. As long as you do that, I usually find the the performance is often really amazing.
I do something similar for an audit trail at work. I work with the type of data where we may need to know who looked at what and when. All those records are stored in a separate SQLite DB (main DB is postgres), and I cycle it out once per calendar year. That makes archival trivial, and should a compliance person need to look at it a simple desktop app can open the file easily.
You can't beat SQLite for ease of use. I'd try it out and simulate some load to see if SQLite can keep up, if you keep your inserts simple I bet it can.
You can write large SQLite databases at disk speed with the right data model and schema.
What most people do as mentioned is either batch writes, or simply send them over some kind of channel to a single thread that is the designated writer, some kind of MPSC queue, and that queue effectively acts as a serialization barrier.
Either can work depending on your latency/durability requirements.
You also absolutely need to use the WAL journaling mode which allows concurrent reads/writes at the same time, N readers but only 1 writer, and you probably want to take a hard look at disabling synchronous mode, which forces SQLite to fsync everywhere all the time. In practice this sounds bad but consider your example: if you make one batched write every second, then there is always a 1-second window where data can be lost anyway. There's always a "window" where uncommitted data can be lost, it's mostly a matter of how small that window is, and if internal consistency of the system is preserved in face of that failed write.
In your case the lack of synchronous mode wouldn't really be that bad because your typical "loss window" would be much greater than what it implies. At the same time, turning off synchronous mode can give you an order of magnitude performance increase. So it's very well worth thinking about.
TL;DR use a single thread to serialize writes (or batch writes), enable WAL mode, and think about synchronous mode. If you do these things you can hit extremely fast write speeds quite easily.
I'm still not understanding this push toward using SQLite as a production backend database. It's great for what it is, a tiny embeddable client-side application database. Like an address book on your phone. But even the developers themselves have steadfastly refused to allow it to expand beyond that scope. For instance, they won't add native types for any useful things like dates/times, or uuids. Because that would bloat the code and the size of the embedded object. So you're stuck with "everything is a string". Ref integrity can be enabled, but even those constraint options are very limited.
Not sure why people are still trying to shoe-horn it into a role that it's not meant to be in, and not even really supported to be.
First off, I don't know that Richard Hipp agrees with you about what roles SQLite is "meant" to be in.
Second: the reasons are straightforward:
* For read-heavy access patterns, SQLite is crazy fast.
* It's fast enough that you can often simplify your database access code; for instance, N+1 queries are often just not a problem in practice.
* SQLite removes a whole tier from the N-tier architecture, which in turn removes a whole set of things that can go wrong (and if you've ever managed your own Postgres or MySQL: things do go wrong).
It's not a perfect fit for every application, or even the majority of applications, but the push you're seeing is a correction against the pretty clearly false idea that SQLite is well suited only for "tiny embedded client-side application databases".
> I don't know that Richard Hipp agrees with you about what roles SQLite is "meant" to be in.
If Hipp thought that SQLite was suitable for backend applications where the database is the authority then he would allow real types and the associated constraints. But he won't do that because it complicates the code and bloats the embedded object size.
SQLite is great for what it is. But it's not a real concurrent backend database. It's a client-side database. That's all the SQLite developers will ever allow it to be.
We can try to layer-on a bunch of stuff like Lite Stream or whatever, and sharding. But the fact is that the core database itself is not, and will never be, suitable for backend applications.
You can accidentally write a string to an int column. Will SQLite say no? No. SQLite doesn't care. It returns everything is A-OK!
You can query an ISO-8601 string column with date_trunc() and strftime() and it just returns NULL whether there was a value or not, or maybe just because it did't recognize the string in that column (LOL).
SQLite is fine. But it's not a real backend database. It's not a replacement for PG.
What's a "real" "concurrent" "backend" database? It's absolutely not a "client-side" database. Plenty of people use it in backends. Some of them are posting about it on this thread.
The correctness arguments apply just as much, if not more so, to MySQL and to document/schemaless databases. Lots of people don't like those databases, but nobody claims they're not "real backend databases".
You seem hung up on the idea that "backend" means "n-tier", with a segregated compute/storage tier for the database with networked connectivity to the app server. That architecture is something SQLite will never support, but that is not the only backend architecture.
I think there's a new generation of developers that don't want to use "no sql" databases anymore (MongoDB, etc.) I think that's why we're starting to see a surge in people wanting to run SQLite as their backend database. It's similarly simple to start out with, and also similarly flimsy when dealing with actual data integrity. Very limited types, limited/uncomplicated isolation options, ref integrity disabled by default (Mongo DB also disables things by default that hurts their benchmarks).
No, you're seeing a surge in interest for SQLite because people like relational databases, but the n-tier architecture is sometimes not the right solution for the problems people have. And again: many of your arguments have been applied to MySQL, but nobody can with a straight face say it's not a "real" backend database.
(To a first approximation ~nobody is interested in SQLite because it lacks correctness or rigid typing features; what's interesting about SQLite is not what was interesting about schemaless databases, but rather the ability to ship backend apps without a separate database tier.)
Again: I think you need to snap out of the idea that n-tier architectures are axiomatically optimal for all backend applications. They often are! But not all the time.
I think that most applications are written for their database. Their database defines their application.
If you write your application on a flimsy database then your application becomes equally flimsy. All of your business constraints become flimsy because your source-of-truth (the database) is flimsy.
This was the kind of thing people used to say about MySQL before Meta made those arguments look silly, and so they've moved to SQLite as a new target. I like Postgres fine, but it's just a tool, like many others.
Of course, once you come to that realization, then you realize that it is all one in the same and that there isn't any magic going on, which then realizes that business constraints can go anywhere in your application and be written by anyone.
I suspect what you are really trying to say is that you trust Hipp more than you trust yourself to get the constraints right. Indeed, if you screw it up you're in for a world of hurt, so you are right to be cautious. But, if you have more trust in a random stranger who has no care for your data than you do yourself to implement it for you, perhaps you shouldn't be writing any code at all? Software development certainly isn't for everyone.
Au contraire, SQLite makes it very easy to write extensive automated testing for your application, since you can spin up in-memory DBs per test with minimal overhead. This makes your application much more robust.
> But it's not a real concurrent backend database. It's a client-side database
People are successfully using it server-side, in specific situations it appears to be a good fit.
> You can accidentally write a string to an int column
Yes, you need more validation logic client-side in exchange for the performance gain.
It's a trade-off, not a black/white distinction. A strongly typed language can help here.
> If Hipp thought that SQLite was suitable for backend applications where the database is the authority then he would allow real types and the associated constraints. But he won't do that because it complicates the code and bloats the embedded object size.
Totally baseless claim. Advances to the query optimizer complicate code and bloat the binary far more than adding DECIMAL, DATETIME or UUID as types would.
The reason types don't change is forward and backward compatibility, and the promise of supporting the current file format and APIs for interacting with it for at least another 25 years.
We eventually realized that the "API server" was actually just a DBMS all along. And once that was realized, it was realized that a DBMS sitting beside another DMBS that deals with the exact same data is rather silly, which is now leading to:
1. Let the clients connect to Postgres directly.
- or -
2. Cut Postgres out of the picture and double down on the homegrown DMBS.
Some are going the #1 route, others #2. Where #2 is opted, SQLite is a convenient engine on which to build upon. It may not be perfect but it is what we have. Keep in mind that this realization on a grand scale (I'm sure some noticed many years ago) is fairly recent, so there is a lot of experimenting going on to figure out what works and what doesn't.
It's the natural cycle of computing. What is old is new again.
(Replace Postgres with MySQL, MSSQL, Oracle, or other DBMS as you see fit.)
Why not do UUIDs as a string or blob? And dates as strings, or integer / real timestamps?
It conceptually simplifies things in so many ways that benefit the app developer, not just the sqlite devs and low-spec hardware. Simpler documentation, shorter learning curve, smaller surface area for bugs, smaller binary size, etc.
There's a trend to add bloat and complexity to everything in software these days, but I'm so glad that a few projects like SQLite are pushing against that.
This is an excellent article! I wonder if there is any equivalent for Django?
ArchiveBox uses SQLite via django and I've run into exactly the issue the author describes in rails fairly often. It would be awesome to have a SQLite-layer solution that doesn't require serializing all my writes through some other channel in the app.
I think @flexterra (aka gcollazo) should also add `"check_same_thread": False` to the recommended OPTIONS, right? Unless they left it out intentionally?
> By design, the sqlite3-ruby gem does not release the GVL when calling SQLite. For the most part, this is a reasonable decision [...]
Following the issue comment link https://github.com/sparklemotion/sqlite3-ruby/issues/287#iss..., it sounds like they they had a suspicion about a significant cost of reacquiring the lock but didn't validate it. Sounds iffy especially given all this workaround effort.
I feel in eg Python extensions culture this would have gotten designed the other way (maybe someone knows how it's done there?).
edit: also, there's this other comment in the linked issue:
> The extralite gem is an alternative SQLite client which releases the GVL during blocking, see note on concurrency here: https://github.com/digital-fabric/extralite?tab=readme-ov-fi.... It is both significantly faster than this gem in general and doesn't have concurrency issues.
You can read more discussion here: https://github.com/sparklemotion/sqlite3-ruby/pull/528 and here: https://github.com/digital-fabric/extralite/pull/46 to see how it was validated that simply releases the GVL for every `step` in the SQLite VM majorly hurts single-threaded performance. Finding a middle ground for both single threaded and multi-threaded performance is tricky. In Rails, we know it is multi-threaded because of the connection pool. But the lower level gem is used in many other libraries and tools where it is used in a single threaded environment
It's not too dissimilar, that said, SQLite is much more performant than Jet (MS Access) for mostly read scenarios. Not to mention that computers and disk speed are much better today than decades ago, where Access use was larger. You can pretty easily hit tens of thousands of rps with SQLite and probably even Jet for mostly-read usage.
Most applications don't have even hundreds of thousands of simultaneous users, so SQLite can be a great fit. Where SQLite also shines in that it has clients in just about every platform/language you're likely to want to use.
Archival/backup/portability are also very nice use cases for SQLite. I've worked on projects where there are specific, time-boxed data input and had actively pushed for using SQLite per box and still feel it would have been better. Vs having a very complex schema with export/archive functionality as custom code. My idea would have allowed to simply copy a file as archive/backup and schema changes over time would not necessarily need to be accounted for as deeply.
YMMV, but it's definitely a decent solution for many problems. Much in that using PostgreSQL or another RDBMS is often a better solution over using a more scalable no-sql option for most applications. There has been a tendency to over-engineer things, and we're approaching a level of compute/io that is less and less likely to justify those efforts.
I would love to have something like MS Access but for web apps. I’ve tried various website builder products but nothing seems as straightforward as Access was.
> It’d analyze the feed, see which jobs were remote, then normalize the data and then push it into a simple SQLite database (yes, I’m not using JSON text files as a database anymore, thank you :P).
> This includes all data used by the app. I actually shy away from using real database systems in the 12 startups. Instead I use JSON text files.
I was hoping to find data on Pieter's current projects using Sqlite and their loads. (For all I know, RemoteOK still does, but I can't find more recent posts about it).
Very informative article and well written! I wonder why the default `busy_timeout` method has this exponential delay that punishes old old queries. Why does this make sense as the default?
Awesome, I'm always glad to see when someone figures out integration problems and helps the rest of us. I hope he manages to get these fixes into the default Rails confug.
I run a Rails app; I switched to Postgres years ago and never looked back. Postgres is awesome. Still, it's great to have alternatives available, and I use sqlite for other tasks, so I know it has good capabilities too.
I’m very excited that yes indeed we have the four major pillars in Rails 8, which is releasing soon, but can be used now via the main branch. The default, out-of-the-box experience with Rails 8 will go all in on SQLite, a database will be the only dependency, and you will have a production-ready app with jobs, cache, web sockets, and a primary database ready from `rails new`. I’ll be talking more about this all at Rails World in a couple weeks and that talk will be on YouTube sometime after that. But exciting times are ahead for Rails, for sure.
No, you can set up replication with, eg LiteFS where you have one writer and multiple read replicas.
That said, then you have operational overhead that defeats a part of the purpose with SQLite. In practice, you can get very far with a single machine and many CPUs (Postgres is ironically a good example of this). In eg Go you can easily parallelize most workloads. In rails, I don’t know if that’s possible. A quick search suggests there’s a GIL which can be limiting.
This is my perspective as well. You certainly can horizontally scale with SQLite, but I strongly recommend that you vertically until you hit an actual limit there. If you know you will absolutely need multiple app nodes on day 1 or day 10, I think you will probably be better served by choosing a client/server database like MySQL or PG instead.
So, you aren’t limited to single machine, but you should stay single machine as long as possible and extract as much value from that operational simplicity before you trade that simplicity for some kind of horizontal scale
> If you know you will absolutely need multiple app nodes on day 1 or day 10
Yes, but there are a few options even then. First, you can of course tune http caching etc, find traditional bottlenecks.
Second, you can also break the business logic into a separate API endpoint that runs only SQLite + business logic + API responses. Then you can add more frontend nodes in case rendering and other things are most expensive.
The main downside is all logic practically has to be written in the same language as a monolith.
Very true. There isn’t an actual limit. You can horizontally scale with SQLite if you want to or need to. I just think it is worth pushing vertical scaling as far as possible as long as possible. And I don’t actually believe that SQLite is the right tool for every problem or web app. Some apps absolutely should use managed PG/MySQL or serverless PG/MySQL. I think they are the statistical exception and 80% of web apps would be well served with SQLite. But for the other 20%, probably simpler to just start with PlanetScale
I've found that with a traditional RDBMS that schema and query structure often count for as much or more over specific tweaks. I have pushed for using a single-node CockroachDB configuration for local/development instances and then using hosted (cloud provided) PostgreSQL for production. The use of CDB is to allow for the potential of future scaling as needed, where PG is more widely available in an externally supported means from the start with CockroackLabs (cloud) as an optional step as well as self-managed.
Just my own $.02, you can definitely tweak an RDBMS, but it's definitely going to vary by use case and more work can definitely be needed (indexing in particular is a bit of a dark art).
As I explain in the post, if you have multiple connections and consistent write load, the timeout will penalize older quereres and noticeably harm your long tail latency
I thought it was a pretty good list of common Rails-application-specific and sqlite3-specific knobs to turn, for newcomers to performance tuning. (Really just a guided tour though -- turn this knob to enable this particular tool for dealing with concurrency problems...)
I'm not using Rails, but I now have several sites using my own little thing that is a single docker container where all state + content is in a single sqlite file, and it's very nice to be able to just move that single file around. I love postgres, but doing the equivalent of that with Postgres is a lot more hassle.
While I'm fine with using SQLite for these things, I would counter that a docker-compose file makes using a db with your app roughly as easy as a sqlite file, only in that you'd have have a data directory as a volume mount for the db. PostgreSQL and MySQL/MariaDB in particular are pretty easy to launch with a configured user/pass for the db/app specifically with docker/compose.
If you're comparing to SQLite, sure... if you're comparing to installing and configuring an RDBMS server on a host OS for development work, I'm going to hard disagree. Most services already have a hosted docker container configured, usually by the developers of said service. Getting that running is often as simple as googling "service-name docker-compose example" and "docker compose up".
And once you do understand docker-compose, it becomes second nature. I'd be willing to state that dealing with a merge conflict with source control is more difficult than docker-compose.
It is until you realize that using SQLite means you don't have to worry about N+1 queries, which actually does make a pretty big difference in Rails code.
Not sure I understand this point, how does SQLite fix the N+1 query problem? Just by having the data co-located with the app and avoiding the round-trip latency hit?
If so, I'd argue you still have N+1 problems, you just won't notice them until N gets a bit larger.
For others, the short-ish answer is that doing hundreds of SQL queries in response to a request (loading nested timeline elements in their case) in SQLite is fine because of the lack of networking/IPC overhead. The nature of N+1 queries is unchanged.
The other half of it that the sqlite page doesn't mention is that sqlite lacks query engine optimizations which would make one large query faster than many smaller queries. If you had a hypothetical in-process version of postgres which didn't have any IPC or networking overhead you'd still see benefits from performing fewer larger queries when using it because the query planner adds some overhead to small queries but makes complex queries faster.
I'm wondering how would it perform if we can compile https://pglite.dev/ as a native library and use it as an in-process Postgres... I know Node folks already use it as a Wasm module, so it shouldn’t be too tricky?
I would suggest that sometimes you want N+1 with a collapsed data set (JSON column data) if you have limited request size, separate stores/service and/or otherwise have clear primary key lookups for the secondary data. I've seen these types of queries run faster with separate lookups, especially depending on caching in environments where the lookup would mean a DBMS connecting to another DBMS for the extra data much more slowly.
I kind of have to agree, I recently thought I'll use sqlite in Rails for my new project to keep things simple but then realized it's actually more annoying for my use case. I'd need a persistent volume with the right permissions, and I can't just connect to my PG instance running on the server from my local machine to run some queries.
I'm sure it makes things easier for some use cases but it's not a given.
Here's the intro paragraph: "Litestack is a Ruby gem that provides both Ruby and Ruby on Rails applications an all-in-one solution for web application data infrastructure. It exploits the power and embeddedness of SQLite to deliver a full-fledged SQL database, a fast cache , a robust job queue, a reliable message broker, a full text search engine and a metrics platform all in a single package."
I'm currently using it on a project and can't say enough good things about it!
https://github.com/oldmoe/litestack