Hacker News new | past | comments | ask | show | jobs | submit login
Consider SQLite (wesleyac.com)
362 points by wesleyac on Dec 29, 2021 | hide | past | favorite | 267 comments



We've been using SQLite in production as our exclusive means for getting bytes to/from disk for going on 6 years now. To this day, not one production incident can be attributed to our choice of database or how we use it.

We aren't using SQLite exactly as intended either. We have databases in the 100-1000 gigabyte range that are concurrently utilized by potentially hundreds or thousands of simultaneous users. Performance is hardly a concern when you have reasonable hardware (NVMe/SSD) and utilize appropriate configuration (PRAGMA journal_mode=WAL).

In our testing, our usage of SQLite vastly outperformed an identical schema on top of SQL Server. It is my understanding that something about not having to take a network hop and being able to directly invoke the database methods makes a huge difference. Are you able to execute queries and reliably receive results within microseconds with your current database setup?

Sure, there is no way we are going to be able to distribute/cluster our product by way of our database provider alone, but this is a constraint we decided was worth it, especially considering all of the other reduction in complexity you get with single machine business systems. I am aware of things like DQLite/RQLite/et.al., but we simply don't have a business case that demands that level of resilience (and complexity) yet.

Some other tricks we employ - We do not use 1 gigantic SQLite database for the entire product. It's more like a collection of microservices that live inside 1 executable with each owning an independent SQLite database copy. So, we would have databases like Users.db, UserSessions.db, Settings.db, etc. We don't have any use cases that would require us to write some complex reporting query across multiple databases.


> It is my understanding that something about not having to take a network hop and being able to directly invoke the database methods makes a huge difference. Are you able to execute queries and reliably receive results within microseconds with your current database setup?

with SQL Server you can get a very fast local connection by specify "server=(local)" in the connection string - this uses shared memory protocol bypassing the net stack.


This is true - The counterpoint is that now you have this leviathan that is the SQL Server process(es) running on the same machine as the application.

If I am constraining my SQL Server usage to fit on 1 box, I might as well use SQLite (assuming no future plans for horizontal scaling).


How is this handling privledge separation? Typically you'd have all accesses as root in SQLite.


>Typically you'd have all accesses as root in SQLite.

So? It's the backend talking to the db.


Systems of least privilege prevent the levels of access exploitation (either vuln or misconfigure) from going further than is risk assessed. Or from "accidents" like a little utility script nuking the database with root privilege. Or having multiple entry points with their own tables of data. That's the difference between "hackers stole X but not Y and Z" and "hackers dumped everything".

Your "backend" should not be assumed to be the infallible security. Outside of limiting DB access procledges, we have proper selinux and other configs to lock down the software abilities so even in the case of Arbitrary code execution caused by a common exploit it may be very difficult to use. This is important when you're running a box with multiple other applications with their own DB's as well.

Infact these reasons are why many times the "backend" software and DB are hardware isolated or at least VM isolated so that those db access limitations aren't bypassed.


I haven't done anything like this before, but simple FS level file permissions should resolve the issue as long as the data with separate priv requirements is put into different sqlite databases.


Oracle clients can do the same by setting the ORACLE_SID and ORACLE_HOME environment variables.

For remote databases, setting the TWO_TASK environment variable to the server's TNS descriptor is one way to force a network login.


How is this setup fault tolerant? What happens if there is a hardware failure? How do you partition access in a way that means an extremely active user doesn't impact availability?


> How is this setup fault tolerant?

It is not.

> What happens if there is a hardware failure?

The product would suffer a total outage until manual intervention takes place. A restore of the VM from snapshot would be carried out by the customer. Some loss of the most recent business data would occur (i.e. between latest snapshot and time of failure). All of this is understood and agreed to by our customers.

> How do you partition access in a way that means an extremely active user doesn't impact availability?

Partitioning is not applicable. Our entire product fits on 1 machine and dividing the I/O along any dimension does not add much value to the performance equation.


Wait a minute....are you implying that I don't need Azure and FB level infrastructure for my local recipe application?

/s


I'm pretty sure you do need FB and Azure level infra for you recipe app. I've read a couple blog posts, watched a video on YouTube and copied the code from Stack Overflow, so I'm pretty much an expert on this, so trust me.


My recipe application is going to be written in a purely functional language with CQRS and a message bus that provides "guaranteed once" determinism. I should be able to spin up no more than a half dozen containers on two VM's to handle this architecture. The database will probably be redundant MSSQL and Oracle (just in case one technology is fundamentally broken). Database writes will be proxied by a Redis instance. I will use 3x 3080 GPU to train a ML model used for scaling the recipe depending on the number of people you wish to serve. Hmmm...I might need a third VM.


You can't do any serious neural stuff on 8GB of VRAM. I'd say save yourself some pains and buy the $3000 professional models with ECC so you can train transformers.


It's never gonna scale. Get on the NoSQL train now to future proof and woo the VCs.


Yes. Make it web scale.

Also, think early on about your compensation packages. You don't want to lose a 10x engineer to a FAANG, do you?


> Make it web scale.

Oh god whenever I read/hear that it reminds of the ridiculously funny video about nodejs and Apache servers. [1]

[1]. https://youtu.be/bzkRVzciAZg


Glad you liked it! The reference to this series of videos was intended of course.

The specific video I had in mind was actually https://youtu.be/b2F-DItXtZs

Enjoy!


Oh yes! I’ve seen this before as well. Should’ve guessed it’s this considering we were talking about sql.


This video is just gold.

asyncio is probably one of the worst things that happened to programming in the last 2 decades. It's essentially a step backwards, to the technology of the 90s (cooperative multitasking), when a badly written text editor could cause an entire OS to freeze.


Curious why you think that. We have an app that does computation using thread pools, and moving from manually-created threads to asyncio has made it much more efficient and robust.

Some advantages: cancellation and error handling are trivial, there's no need for error-prone thread coordination, threads don't have to waste time waiting on locks (because we use asyncio sync primitives instead, so we don't submit work to a thread pool unless it's doable right now with no wait).

Of course, it depends on the tasks being written correctly. But "don't block the event loop" is a much easier rule to follow than "don't deadlock these 10 different threads that share queues and locks".

We didn't write an entire OS with it, but I don't think that was ever the point of asyncio, was it?


> The database will probably be redundant MSSQL and Oracle (just in case one technology is fundamentally broken)

Only one technology?


Interesting. For an extremely specific use case and with users who understand and accept the caveats of this approach I'm sure it would work well enough. The most confusing thing to me is that there is apparently an intersection of users who are ok with an outage and data loss with users who want a product which can

> execute queries and reliably receive results within microseconds

What is your product? Who are these users?


I can think of plenty of services that an occasional (once a year? less?) outage is okay. Heck, anything relying on AWS us-east-1 is going to have outages that frequently based on the last few months. Meanwhile, almost any service is better off when its response times are cut drastically. I’ve seen many instances where a service’s response times are more than half waiting for a db to respond.


It's not the threat of an outage with data loss that is concerning to me- I just want to understand use case that needs fractions of a second shaved off of query times by using SQLite in this way that is also ok with the possibility of data loss.


Interactive interfaces. There's a huge difference between moving a slider and seeing the reaction in real time and moving the slider and seeing the reaction a second later. If you define "real time" as 30 fps, you have 33ms to process each query and show the result. That could involve multiple database queries with computation in between, if your business logic isn't easily expressible in SQL.

Come to think about it, that covers most apps with UI. Apps where you are exploring data are definitely more impacted however.


I'd rather have a service that is down one day per year, rather than death by thousand papercuts of a slow UI every day.

Imagine some frustratingly slow app like Jira for example, i would be much happier with a snappy UI there, where one could theoretically still work for a downtime of week or more without updating your task status.

If we are talking about shaving the last milliseconds, it could be something very interactive like autocompletion. Which is something with a graceful fallback of typing in the whole thing manually.


If my webpage loading makes ten database calls, and each database call takes 45 milliseconds, my webpages takes 300 milliseconds longer to load than if each call only took 15 milliseconds. 0.3 seconds across each page load, especially if the rest of the call is less than a second, can significantly improve the usability of a website and increase user retention. This applies for any user facing website whether its a social media platform, an online shop, or just a blog.

Obviously there are other improvements that can be made too, like batching db requests or in-memory caching, but this is one option.


All users understand and empathise when you say "Sorry, the system is down right now" once or twice a year.

None of them display any understanding or empathy whatsoever when you say "Your trade will always be executed after a 1 second delay, even if the price has moved"

Users find occasional downtime awful, but they find consistent lethargy worse.


No customer in fintech is going to accept the "we lost some data transactions" and buy the software so your use case is covered in that they are up front with the customer that if the server goes down any transaction in progress will not complete.


Financial transactions, especially payments, work under the assumption that any node in the path from "payment attempt made" to "Payment attempt successful" can go down and not have an incorrect record of payment.

Authority? Me - I worked in the payment space for EMV for a number of years. There are (maybe) a minimum of five nodes in the path from swiping a card/making a payment to a successful payment, and each of these computers/systems are owned by different companies, all of which don't trust the other not to be a MitM, and all of which are designed to not lose transactions if any of the systems involved suddenly power off.

It's almost a myth at this point that a system doing financial transactions needs 100% uptime or it will lose transactions.

Trust me, even a system with only 80% uptime is never going to mislay transactions. The worst that happens is that transactions won't be processed during the downtime, payment or transfers are declined and the entity making payment/transfer can use the decline message to try again later.

If you're quadrupling your development and infrastructure cost in an attempt to reduce your downtime from twice a year to once every decade you're simply throwing money away[1].

[1] If you're doing that, it's also possible that you don't know of the many fail-safes built into the protocols for message exchange. Remember that these systems worked fine prior to HA/Always-Available cloud systems. Adding HA cloud tech is not going to make them more resilient than they are now, it just introduces additional points of failure.


There is a difference between losing data, and losing transactions. Transactions can be replayed from the message log (depending on the implementation) and reprocessed if needed.

I have seen plenty of recovery operations in finance, sometimes cases where the entire database server was lost, and even though the transactions were not saved, no actual financial data was ever lost, as this data could always be reloaded.

Of course, certain systems exist that do store data in a way where loss is not accepted, but even in finance these are somewhat rare compared to the vast number of other systems.


> users who are ok with an outage and data loss

It depends on how you couch it. Working with large fleets of cloud servers has jaded us (well, me) into forgetting just how high you can get the uptime of a single server or well-managed VM if you're careful.

Sure, power cords get tripped over and hardware fails sometimes, but I'd imagine a lot of users who do care about reliability are making totally reasonable judgements that single points of failure in their systems are acceptable if the SPOFs are hardened and the oh-shit plan (restoring from a VM snapshot in this example) is something that they have confidence in.

Heck, given the availability issues that arise from the complexity of redundant/HA systems, I think there are probably plenty of seasoned enterprise customers who prefer hardened SPOFs managed by experts over distributed systems, especially once the dimension of cost enters into the discussion.


For starters, how likely is a data loss? SQLite may lose a transaction if the server shuts down or the application crashes, but this doesn't mean the db gets corrupted.

Secondly, even if that happens, the fact that the database is a simple file on the production server, makes it easy to back up and restore.

Thirdly, a "traditional" db is no proof against loss either...sure, the production server with the SQLite can go down...so can the container running Postgre.

And lastly, actually most services are doing perfectly fine when a data loss occurs. Take a message board for example. Worst case scenario: A post isn't written and someone has to type a few lines and emojis again.


I agree that 'reliably' would be a poor choice of diction when considered in your context.

The user group of our product is effectively only professional bankers and their managers. No one in the general public has any direct access.


For the extremely specific use case of "almost everything" and the users "most everyone, maybe at the right price point", yeah.


[flagged]


No need to be condescending when I'm trying to learn. The premise of using SQLite because it was easier and to save fractions of a second didn't make sense to me as a tradeoff for potentially losing data


As long as you're estimating your future costs correctly then you're golden. If you aren't and your application becomes complex through growth or you need resiliency then you'll need to pay that cost and that cost can be big.


Interesting. So I would ordinarily want to put a foreign key constraint on the user_id column of a UserSessions table (or similar). In general, presumably you have relationships across the tables that are in those discrete databases. Do you just enforce these constraints/do joins in code? It seems like splitting related tables across multiple databases loses some (possibly a lot?) of the benefits of relational DBs, so I'm curious how you handle/manage that.

That said, I love the idea this architecture. Might use it for whatever next dumb little web service I cook up! I love how this simplifies a lot of dev/deployment ops, perfect for a side project.


You'd have to do the join on the client. This is a problem in general for scaling SQL - its not impossible, but with sharded SQL servers you have to be very deliberate in how you do joins so as to coincide as much as possible with your sharing strategy.

It would be similar here. According to the author each sqlite DB belongs to a single microservice, which will naturally group together the most common of joins. Anything else will indeed have to be manually joined.

Part of the reason nosql became popular for a bit. That's reversed, and sharded SQL is pretty common now, but it definitely adds more cognitive load to schema design.


SQLite has ATTACH (https://sqlite.org/lang_attach.html) for cross database operations. I've never tried it for foreign key constraints across databases, but I think it would work?


> So, we would have databases like Users.db, UserSessions.db, Settings.db, etc.

How do you do joins?


Not the OP, but you can attach and join tables from multiple sqlite databases in the same process.


Really??? How? Because I use sqlite, and the query functions that I know take a single DB as an argument.



I would guess they do parallelized calls to the microservices to collect the data.


Are you capable of achieving no downtime deployment ? I mean, on the product I currently work on, we have one mongo database, and a cluster of 4 pods on which our backend is deployed. When we want to deploy some new feature without having any downtime, one of the pod is be shut down, our product still work with the 3 remaining pods, and we start a new pod with the new code, and do this for the 4 pods. But with SQLite, if I understand correctly, you have one machine or one VM, that is both running your backend, and storing your SQLite.db file. If you want to deploy some new features on your backend, can you achieve no downtime deployment ?


This is really interesting. Would you mind sharing a bit about how you keep the data that is shared across services in sync? Or is there a hard separation of concerns so that services would only commuincate with the service that owns that data to obtain it?


To my knowledge, WAL mode still needs to serialize writes for each database file. I'm assuming this is not a setup where there are too many concurrent writers?


Correct - We are using the typical SQLite build which serializes all writes by default and we have made no effort to undo this behavior. We actually rely on this to ensure things happen in the right order and take advantage for performance reasons.

Because SQLite is effectively serializing all the writes for us, we have zero locking in our code. We used to have to lock when inserting new items (to get the LastInsertRowId), but the newer version of SQLite supports the RETURNING keyword, so we don't even have to lock on inserts now.

Also, the fact that we have the databases divided across function helps free up some of the lock contention in the provider. We don't really have any mixed workload databases - its either "slow" gigantic operations (JSON blob access), or super quick tiny things (updating session timestamps). So, there is some minor isolation of contention issues on a subsystem basis.


Testing with JMeter, I got upto 10,000 concurrent writes on SQLite with WAL mode enabled


Is this a situation where multiple web servers in a farm are accessing the SQLite databases on a file server?


This is a situation where a single process owns all of its SQLite databases and handles everything out of 1 machine.


Is the process multithreaded? I guess WAL takes care of that.


How do you join?



I believe SQLite is about to explode in usage into areas it’s not been used before.

SQL.js[0] and the incredible “Absurd SQL”[1] are making it possible to build PWAs and hybrid mobile apps with a local SQL db. Absurd SQL uses IndexedDB as a block store fs for SQLite so you don’t have to load the whole db into memory and get atomic writes.

Also I recently discovered the Session Extension[2] which would potentially enable offline distributed updates with eventual consistency!

I can imagine building a SAAS app where each customer has a “workspace” each as a single SQLite db, and a hybrid/PWA app which either uses a local copy of the SQLite db synced with the session extension or uses a serveless backend (like CloudFlare workers) where a lightweight function performs the db operations. I haven’t yet found a nice way to run SQLite on CloudFlare workers, it need some sort of block storage, but it can’t be far off.

0: https://sql.js.org/

1: https://github.com/jlongster/absurd-sql

2: https://www.sqlite.org/sessionintro.html


I evaluated sqlite for a web extension but ultimately decided it wasn't worth it. There is no easy way to save the data directly to the file system. And saving the data in other ways meant I was probably better off with IndexDB instead. Still it is a tempting option and one that seems to work well for separate tenancy.


> There is no easy way to save the data directly to the file system.

That's what absurd SQL is for (link in the parent comment).


I read that one and agree it feels absurd. Not something I want to depend on.


I believe the absurd things are: - the idea (i.e. storing a database in a database) - the fact that the resulting storage is faster than vanilla indexed db.

The repo's readme is a bit misleading and it does make it look like a mad scientist experiment but you can read more about it here (really interesting stuff): https://jlongster.com/future-sql-web

Obviously this project is still young, but I find it convincing (disclaimer: I don't use it yet).


I’m 100% sure this is (very nearly) suitable for production. It works and works well.

It is only a matter of time before browsers implement the proposed sandboxed virtual file system apis with block level access and this would be superseded.


Do you know of any oss projects using absurd? Or maybe an ORM built on top of it? It doesn't seem like there's any ecosystem built around it yet, and their example project is too trivial to be useful. I'd have to have to built even the most basic CRUD functionality from scratch.


It’s really not that much trouble. Most of the ORM like things over IndexedDB introduce extra bugs and slowness for you to deal with anyways - SQLite is millions of times easier to use than IndexedDB, so you’re getting a good deal out of considering AbsurdSQL “just” an ORM for IndexedD… as long as you don’t need to support Safari.


Yeah but you don't need to go through a web worker to use any other database product. Seems like with absurd you'd have to create a separate worker for each query and write your CRUD stuff in a potentially clunky, asynchronous style by passing messages to/from those workers. Sounds painful to me.


When they tried to put SQLite into browsers I was excited but I had no project that needed it. It took years to finally get there, and wouldn’t you know it, they had just marked it as deprecated. KV stores are just not the same.


You can run it on AWS Lambda and store the SQLite file(s) on EBS [1].

[1] https://aws.amazon.com/blogs/compute/using-amazon-efs-for-aw...


That's EFS, not EBS. EFS is file-level storage (i.e., NFS); EBS is block-level storage (SAN). AWS Lambda only supports EFS.


Correct, I mispelled. The link points to the correct service.


I wouldn’t use a SQLite file over NFS. It has caused me locking problems when used by multiple clients.


NFS does have a relatively low limit in terms of lock. For some use cases it can handle it fine.

Having clients running on read-only mode, when possible, can help.

Also partitioning the database can help, but then it's too much trouble already.


> I can imagine building a SAAS app where each customer has a “workspace” each as a single SQLite db

I did just that at my (now defunct) startup a few years ago. We were building a collaborative database query tool. The software we built used sqlite to keep a local db for storing user credentials, caching query results, etc. Bonus, we were able to have the local database file be encrypted protecting the data at rest.


> we were able to have the local database file be encrypted protecting the data at rest. How do you store the password locally? Or do you requesting the passwords from a remote server?


The golang equivalent of native SQL can be found here - https://pkg.go.dev/modernc.org/sqlite

Very useful for pure go applications!


Wouldn't that be leaving one huge advantage of sqlite behind (namely it's insane test regimen and battle tested history)?


It’s cross compiled, so all the tests are brought along and test fine. And god, there are so many tests.


How would we deal with conflicts (e.g. syncing back several conflicting offline clients for the same user) with something based on the Session Extension?


You have to handle the merge conflicts yourself, see https://www.sqlite.org/session/sqlite3changeset_apply.html

So you need to be carful how you design your schema, but very possible.

One option is to use something like Yjs and a JSON column to get proper CRDTs for merging.

https://github.com/yjs/yjs


How does one deal with fault tolerance ?


Using the transaction API for modifications, and ordinary fault tolerance measures for the hosting application. It is no more difficult than any other app-foundation technical feature.


I've always thought it interesting that there was a time when large(ish) websites were hosted using servers that would struggle to outperform a modern smart toaster or wristwatch, and yet modern web applications tend to demand a dramatic distributed architecture. I like the examples in this article showing what a single modern server can do when you're not scaling to Google's level.

As an aside, what about distributed derivatives of sqlite, like rqlite, as a response to the criticism that sqlite requires your database server to also be your web server. Could something like rqlite also provide a way for an sqlite database to grow into a distributed cluster at a later point?

https://github.com/rqlite/rqlite


> and yet modern web applications tend to demand a dramatic distributed architecture

The dirty secret is: Most of them don't really need that architecture. Most, and I mean +95% of websites would run just fine on a single box, running the websrv, db and whatever script-interpreter the backend runs on.

Sure, it may be a bit slower if its not hooked up to a global CDN, and if the box goes down the user gets 404. But its also alot easier for an admin to simply give the one box a good wallop and put it back online, since it so simple.

https://idlewords.com/talks/website_obesity.htm#heavyclouds


I don't disagree with you, a single server can go a really, really long way in scale before you run into problems. I know because I've done it a few times.

The problem to me isn't ability to scale on one server, it's the single point of failure. My biggest site is a wordpress box with one instance on a pretty big VPS. In the last year I've had several outages big enough to require a post-mortem (not complete outages, but periods with high levels of error rates/failures), and every time it has been because of internal data center networking issues at my preferred cloud provider (and thankfully their customer service is amazing and they will tell me honestly what the problem was instead of leaving me to wonder and guess). So the main incentive for me to achieve horizontal scalability in that app is not scaling, it's high availability so I can survive temporary outages because of hardware or networking, and other stuff outside of my control.


I don't mean to be too glib, but to me this reads like: "I am concerned about SPOF so I outsourced that."

Except now you are not in control of plugging the cable back in, and pay (less?) for that convenience.


I'm not sure I understand your point?

Even if I ran it on my own hardware, where I was responsible for plugging the cable in, it would still have the same single point of failure problem in the event that my ISP started suffering Network issues. I've never had an ISP that didn't have some sort of issues at some point.

Alternatively, if you think I am blaming my cloud provider for being the single point of failure, please read again. I specifically mentioned that I want to fix up the application so that it is highly available. I entirely and completely blame myself for the current situation. I didn't build it (otherwise it would be 12 factor[1] from the start and horizontally scalable), but I maintain it now.

[1]: https://www.youtube.com/watch?v=REbM4BDeua0


Wouldn't a snapshot mechanism plus load-balancer handle this case and still keep the architecture simple?


The biggest problem is that the snapshot mechanism would need to traverse multiple data centers. The current arch uses the disk for storage in several ways which aren't easy to fix, and because of the way several of the plugins were built (with lots of the nastiest custom code I've ever seen) each time I start experimenting I run into weird breakage and debugging hell. Enough so that I decided it's not worth it for HA at this point. Will be rebuilding much of it in Elixir/Phoenix though, and that will be 12 factor/HA[1] from the start.

[1]: Me explaining 12 factor for anyone not familiar already (and I've actually had people who were well experienced with 12 factor tell me they got a lot out of this video): https://www.youtube.com/watch?v=REbM4BDeua0


> and thankfully their customer service is amazing and they will tell me honestly what the problem was instead of leaving me to wonder and guess

Which cloud provider is this, if you don’t mind sharing?


Sure! I wasn't going to say because I didn't want to people to judge them for the network issues, but it is Linode. I've become a huge fan of them and use them for almost everything now. The Newark data center has been struggling a bit with networking issues but they are working hard on the problem.

I really can't say enough good things about the Linode customer service. They respond timely and with honest info, which is invaluable to me as a head of engineering trying to make important platform and infrastructure decisions. After a terrible experience with Digital Ocean, before moving over to Linode (so I still had a pretty small account, less than $10/month) I actually opened some mostly stupid support tickets just to see how they would respond. They were professional, courteous, and helpful. For the record I have no affiliation with Linode beyond being a happy customer.


Consider, though, that in the days past (when your server would be probably on an equivalent of dual-socket Pentium 166-MMX), most clients would be coming from slow links like 33.4-56.1kbps dialup, and it wouldn't be a problem to serve them at all. Links were slow, users were patient, timeouts were high, webpages were sort of slim. Although if you ask me, they always have been heavy, just within the constraints of their time.

Then, of course, there was ISDN and xDSL, which would give you true to god whopping 128 kbits/s for a while. 64 kpbs if you were cheap. It took a while to get to affordable multiples of Mbits per second.

Now that there's at least 10 Mbps uplink from each residential subscriber, doesn't take long to DoS even a beefy server.

And I'd say that server-side, things improved vastly with advent of FastCGI and its equivalents. Back in that heyday of your P166-MMX server, it was CGI with Perl, spawning a process for each incoming request, or "blazing-fast" Apache's server-side includes, or other things like that. Maybe mod_perl with its caveats on memory sharing.

Anyway, you're right in that whenever you show them a wider pipe, they will find more stuff to congest it with.


Not even a decade ago we were hosting our web-facing services and our postgres DB on the same server. A lot of smaller projects had their own db and credentials on the same instance. The idea of having to spin up independent DB servers for each one seemed wasteful to us, since back then each box would cost at least £10/mo or more on Linode or Rackspace and it meant more secrets to keep, more stuff to secure, more stuff to maintain.

It was only with the advent of heroku and the sudden shift to the cloud (AWS) that the DB stopped being software you ran on your box as a daemon, and became essentially a dedicated resource. Even for services running at a completely trivial scale.


It's mostly because there is a demand for HA which requires multiple replicas. The moment you start going down that route you increase complexity.

Whether most things actually require HA is debatable, but a lot of businesses make it a core requirement and so it gets baked into the architecture. Personally I feel like most stuff would be better suited to having fast fail-over and recovery early on, but my advice rarely gets taken. Instead you end up with complicated HA architectures that nobody totally understands, which then (inevitable still) fall over and take hours to recover.


HA is generally needed because users of a SaaS can crash the instance especially in products that offer customisation abilities.


Sure, but that becomes a requirement at that particular design junction. A lot of stuff is built with HA that isn't even close to that complex.


Edge networking should work for SaaS, especially if your tenants tend to be geographically clustered. Stand up stateless front end web servers in half a dozen regions, and have them talk back to a more central region, with caches and circuit breakers to control the amount of cross-region traffic. That geographical separation will also make diurnal traffic patterns more obvious because you need more servers later in the day in PST than in EST.

You don’t have to distribute everything to scale pretty high.


There are some important things that SQLite does not do.

It is not client/server; a process must be able to fopen() the database file. NFS and SMB are options that can convey access to remote systems, but performance will not likely be good.

Only a single process can write to the database at any time; it does not support concurrent writers.

The backup tools do not support point-in-time recovery to a specific past time.

If your application can live with these limitations, then it does have some wonderful features.


> NFS and SMB are options that can convey access to remote systems, but performance will not likely be good.

And the safety of your data depends on the quality of your network filesystem's locking implementation. It's not too difficult to design a locking method that works most of the time, but it's a lot harder to build something that guarantees mutual exclusion over an imperfect network.

On a single machine, file locking is generally reliable (mediated by the kernel). Multiple clients can access the same database simultaneously, with readers and writers taking turns, and all that happens is you pay the performance penalty of serialized access.

But if your client thinks it has a lock on a remotely-stored database file, but the server-side lock has actually expired and another client is writing to the database, your data is likely to get hosed.

https://www.sqlite.org/howtocorrupt.html


Databases on remote filesystems should be limited to SQLITE3_OPEN_READONLY access, agreed.


I'm experimenting with using SQLite to store users' history in fish shell, but the remote filesystems problem seems likely to be a showstopper. What can be done about it?


I'm just reading this, and learning a few new things:

https://www.sqlite.org/howtocorrupt.html

SQLite has an alternate lock mode with dotfiles that seems to prevent database corruption over NFS. It is important that all SQLite accesses from all connected processes use the same lock mode.

"2.3. Two processes using different locking protocols

"The default locking mechanism used by SQLite on unix platforms is POSIX advisory locking, but there are other options. By selecting an alternative sqlite3_vfs using the sqlite3_open_v2() interface, an application can make use of other locking protocols that might be more appropriate to certain filesystems. For example, dot-file locking might be select for use in an application that has to run on an NFS filesystem that does not support POSIX advisory locking.

"It is important that all connections to the same database file use the same locking protocol. If one application is using POSIX advisory locks and another application is using dot-file locking, then the two applications will not see each other's locks and will not be able to coordinate database access, possibly leading to database corruption."


The problems with dotfile locking are:

1. A crash risks leaving the lock file in place; it must be manually removed or else you hang forever.

2. The same home directory may be local and remote on different machines, meaning different locking protocols.

I am considering a mode where the database is loaded fully into memory, and writes are atomically saved via the classic adjacent-file-renamed mechanism. You risk losing data due to races, but it can't corrupt the database.


talk to a server on the remote "storage" where the history is rather than directly to the sqlite server? Queue up changes and make them serial in the "service"


None of those matter to me, but I'm being forced to switch from SQLite simply because of its lack of uint64 support.


Being forced as in you can't make things work without that support? If so, is parsing/processing blobs infeasible for some reason?


I want to use arithmetic functions like sum().

Sure, I could just sum them myself, but not only would that still leave me SOL with ad-hoc, interactive queries, but actually be more work than just firing up postgres.


In WAL mode, a write can mean writing some data in memory on the local machine. With Postgres/MySQL/etc, it has to go over the network.

I can't parse [0] right now about the overhead of TCP and round trips with the database server, but it's basically a question of whether one SQLite write is 10x, 100x, 1000x, or more faster than a database server write. That should make a lot of difference.

[0] https://gist.github.com/jboner/2841832


If you have PostgreSQL or MySQL on the same machine as your application, you can use UNIX sockets; they won't have much latency at all (I think Linux also optimizes TCP on localhost to skip all the pointless protocol overhead).

In terms of latency it'll still be difficult to beat a database that lives in the same process as your application, but it won't be as bad as going over the network might be.


> If you have PostgreSQL or MySQL on the same machine as your application, you can use UNIX sockets; they won't have much latency at all (I think Linux also optimizes TCP on localhost to skip all the pointless protocol overhead).

Sounds like a nice topic for a study; I'd expect the latency using local sockets to connect to a DB server to be about 3x as much, minimum, as the latency to simply write to a file.

Writing to a file: a single context-switch from process to kernel (CS). Writing to a socket: CS on write to socket, CS when receiving process writes the data to file, CS when receiving process sends ACK to sender via a socket.

The worst thing you can do in your program is to cause a context switch by passing data to kernel or recieving data from a kernel.

That being said, a trial of the two approaches is not a bad idea for an easily-publishable paper :-)


The context switch buys you safety from many cases of corruption due to uninitialized memory mistakes etc. though, if using an unmanaged language. In C-type code with sqlite embedded, you might write garbage over its internal datastructures.

A program error might still make you issue the wrong command or write corrupt data, but you shouldn't corrupt the internals with process isolation.


Well, for "local" TCP, it's pretty straightforward to see how the kernel can be smart and just copy the sent data into the receiving process' buffer directly and skip ACKs and other overhead if no-one's looking. Neither the sender nor the receiver actually care whether network traffic occurs as long as the semantics of the operation stay the same.


> Well, for "local" TCP, it's pretty straightforward to see how the kernel can be smart and just copy the sent data into the receiving process' buffer directly and skip ACKs and other overhead if no-one's looking.

I didn't mean the TCP ACK packet, I meant the application layer return saying "Thanks, I got your query".

It is never a good idea to blast some data off into a transmission channel without knowing if the receiver ever got it, hence the receiver always sends something back to say that they received it.

> Neither the sender nor the receiver actually care whether network traffic occurs as long as the semantics of the operation stay the same.

Whether you are sending over the network or over a local UNIX socket is irrelevant, the context switch is going take place for each `write()` and for each `read()`.


I don't know what backup tools you have in mind... But since a SQLite database is a single file (modulo write ahead journal and whatnot), making whatever you need is trivial.


> modulo write ahead journal and whatnot

That's like saying immortality is trivial modulo death and whatnot. If you don't integrate with sqlite's locking system, you can easily "backup" a file mid-write, corrupted and unusable.

That's why sqlite has a built-in backup command.


Litestream does streaming backups using a separate process: https://litestream.io/


I guess my secret is having almost no writes (the couple things I deployed with SQLite would see a dozen writes per month).


Be careful! If you copy a database file while it is being modified, you may end up with an inconsistent view of the database depending on the order in which different parts of the file are modified.

There are ways around this (filesystem snapshots and potentially file locking) but it's not trivial and failing to handle this correctly is a very good way to build a system which passes tests under low load (when the file-copying operation will win all the relevant races) but fail catastrophically under heavy load.


The sqlite3 utility can be used to create a transaction-consistent backup of a live database. This interface is the only way that a backup should be taken, either from the utility or the C API.

    .backup ?DB? FILE        Backup DB (default "main") to FILE
https://sqlite.org/cli.html


The state of the art is that you click a few buttons in the UI of the cloud provider of your choice. That takes a few seconds.

What you are proposing may be trivial but in comparison it might just as well be rocket science. Doing more work to use a simpler database that is harder to setup and has less features is not exactly compelling.


In Oracle, I can do this:

    RECOVER DATABASE UNTIL TIME '2021-10-01 02:00:00' USING BACKUP CONTROLFILE;
SQLite does not implement such a feature.


There is a diff tool [0], and an online backup tool [1]. That kinda thing should be pretty easy to cobble together. You'll just need ~3x disk space to store 2 backups at an interval, diff them and discard the older backup, then apply backward diffs on your most recent backup to achieve a specific backup time.

EDIT: or maybe the Session Extension [2]

[0]: https://www.sqlite.org/sqldiff.html

[1]: https://www.sqlite.org/backup.html

[2]: https://www.sqlite.org/sessionintro.html


Litestream may be able to do that on top of SQLite: https://litestream.io

See the -timestamp option here: https://litestream.io/reference/restore/


> Only a single process can write to the database at any time; it does not support concurrent writers.

I think you can do concurrent writing now with Write-Ahead Logging (WAL): https://www.sqlite.org/wal.html

I've never tried it though, so I don't know how suitable it is for web apps that might potentially have multiple processes trying to write to the DB at the same time.


Nope. In the default rollback-journal mode, SQLite supports either multiple readers or a single writer at a time. WAL mode improves this to multiple readers and at most one writer.

But transactions in SQLite are often fast enough that this could still be acceptable for a lot of purposes, especially on an SSD.


There is no concurrent writing with WAL mode


> It is not client/server; a process must be able to fopen() the database file.

The one time I actually wanted to do that, I wrote the server that `accept`ed incoming connections and used the single `fopen`ed SQLite DB.

It can be very flexible that way, TBH, but if you really need that type of thing a more traditional DB is better.


> Only a single process can write to the database at any time; it does not support concurrent writers.

Potential plus in cloud/container deployments.

> It is not client/server; a process must be able to fopen() the database file. NFS and SMB are options that can convey access to remote systems, but performance will not likely be good.

There are other possibilities given a bit of additional wrappers (over the db process) and federating nodes with some (consensus) protocol. It may actually be a good fit as the 'node-unit' for very large scale distributed metadata store that would benefit from relational DB semantics instead of basic K/V.


Note -- the single process write at any one time is a killer for most web apps, where for example within SaaS you have many users doing things at the same time.


Even in write-heavy environments, I've used SQLite effectively with multiple thousands of concurrent (as in logged in and changing things at the speed of human site interactions, not as in simultaneous web requests at the same instant) users. In this context, SQLite is effectively a write linearizer.

With appropriate pragmas and database handles stored persistently between requests in server processes, the rate (in bytes of data affected by INSERT/UPDATE/DELETE) at which users could make DB changes got pretty darn close to the rate at which a file could be sequentially written on the filesystem, which is more than enough aggregate throughput for 1000s of concurrent non-bulk changes in most web applications, I'd imagine.


Can you elaborate? I’m imagining that the DB web server process would have no problem opening a thread per user and multiple threads can write to a SQLite store. Or, optionally you could buffer multi-threaded requests into a single queue for DB I/O. I’m not seeing why the single process constraint is a major impediment, or maybe I guess I’m not sure I understand why multiple separate processes might be required. Am I misunderstanding your comment?


Most writes only take tens to hundreds of microseconds and you can perform those writes from different processes serially. As long as you set WAL mode & a busy timeout then you should be fine:

PRAGMA journal_mode = WAL;

PRAGMA busy_timeout = 30000;


It's not really an issue if you have 1 db per customer


If a customer has 1000 employees all using your app, it is.


Not always. In fact, I'd strengthen that and say that this is not a problem in general/in most cases with thousands of concurrent users. See my other comment one level up.


If the app is designed correctly, then the thousand employees would write to their own temporary databases, and a background job would pull their changes into the main database sequentially.

If the app is not specifically designed to do this, then SQLite would not be an option.


Serious question, is this just a "how do I get SQLite to work in this scenario?" thing, or is there actually some other benefit to having this sort of data architecture?


This can actually relate to SMTP servers using mbox or maildir formats. Maildir is more resistant to corruption, and doesn't rely (as much) on file locks.

https://serverfault.com/questions/564456/what-are-the-practi...


What's the benefit over just using the main database?


Does an event sourcing architecture help with this? I'm thinking a bunch of client piping events into a queue with a single writer working through that queue. Might be stretching the eventual part of eventual consistency if the latency gets too bad, but I don't think SQLite is top of mind for most SaaS solutions anyway so more of a thought exercise.


Sure, though that's if they're all doing CRUD at the same time. I maintain our company blog on a self-hosted install of Ghost backed by sqlite, and it's a been great (since the only inserting or updating is likely to be one person editing a post, and the frontend is mostly cached).


I don't doubt the power of SQLite, but its difficult to see why its worth using over Postgres anyways.

This is what it takes to run a basic postgres database on my own PC (in a docker compose file):

  postgres: 
     image: postgres:12.7
     container_name: postgres
     environment:
       - PGDATA=/var/lib/postgresql/data/pgdata
       - POSTGRES_PASSWORD=<pw>
     volumes:
       - ./volumes/postgres/:/var/lib/postgresql/data/

For someone who's completely allergic to SSH and linux, a managed Postgres service will take care of all that too.

SQLite seems simple in that its "just a file". But its not. You can't pretend a backup is just copying the file while a DB is operating and expect it to be consistent. You can't put the file on NFS and have multiple writers and expect it to work. You can't use complex datatypes or have the database catch simple type errors for you. Its "simple" in precisely the wrong way - it looks simple, but actually using it well is not simple. It doesn't truly reduce operational burden, it only hides it until you find that it matters.

Similarly postgres is not automatically complex simply because it _can_ scale. It really is a good technology that can be simple at small scale yet complex if you need it.


I'm not really sure why this post has downvotes. docker-compose dramatically lowers the barrier for setting up a single machine with multiple services (your service, db, etc). For a similar experience you do the same with AWS RDS or equivalent. Performance will be better and worse in various situations but if your software still fits in one machine you're largely going to be "ok." Backups, restore, monitoring, etc are all important for running software and that's something an sqlite file doesn't really offer the best solutions for. It works great for some things (I've used it many many times) but it's not perfect for everything.


I call this trend "tech hipster"-ism. Part of the motivation is just to do something different just for the sake of being different. Maybe part of it is a perception that Postgres or Linux are oh-so-scary and difficult things or that using the same technology that Amazon uses makes you evil. ¯\_(ツ)_/¯.


It's unfortunate you are getting downvoted. Nobody in this thread has given clear solutions for addressing issues like fault-tolerance, failover, backup and all the managed goodness you can get from choosing Postgres.


I believe the addressed backups in the article with litestream.


I use SQLite exclusively on a high performance crypto sniper project - https://bsctrader.app and I could not be happier with it.

Performs much better then postgres in terms of query latency which is ultra important for the domain we operate in.

I take machine level backups every 2 hours, so in the event of an outage, just boot the disk image on a new vm and it's off.

I would never do this on my professional job due to the stigma, but for this side project, it has been incredible


> I would never do this on my professional job due to the stigma, but for this side project, it has been incredible

I love controversy so I was able to push SQLite through as the sole persistence mechanism for our product. Virtually every constraint lined up with our business perfectly.

We sell a B2B product that gets installed by our customers, so they love it too. No more sweaty hands conversations about why their database server cant talk to their application server over their networks while we are caught in the middle trying to avoid taking ownership of their internal IT problems.


I think that's one of the inspirations behind sqlite.

> They’d double click on the thing and a dialog box would pop that says, “Can’t connect to database server,” and it wasn’t our fault. We didn’t have any control over the database server, but what do you do if you can’t connect to the server, so we got the blame all the same because we were painting the dialog box.

https://corecursive.com/066-sqlite-with-richard-hipp/


> I would never do this on my professional job due to the stigma, but for this side project, it has been incredible

Using the right tool for the job does indeed carry a lot of stigma in many professional environments. Instead you use the tool that some VP has been sold by some salesman.


The stigma of using the most popular database in existence?


For the domain of webapps, where multiple concurrent writers are often expected, yes, I would say it's a stigma.


Are they expected, or are the required?

Because, serializing db access through a single process only becomes a problem when the number of reads/writes get so large, that the process becomes a bottlenec.

And judging by the test the author of the linked article did, that would have to be a HUGE number.


That's only a theoretical limitation. 99% of all your typical insert / update / delete operations finish in the single digits of milliseconds, making the serial nature of SQLite writes a problem when you get north of 5000+ requests per second.


> machine level backups

I presume that's block level backups? Or some snapshotting?

As far as I know, block level filesystem copies can get inconsistent (so we have journalling file systems). But assuming it works well, like a filesystem aware snapshot, can sqlite deal with files snapshotted in the middle of an operation?


That's a really interesting questions, I have not even considered that. Just using my providers automated backup system. Never had to restore yet


They say a backup becomes a backup after it's successfully restored :)


Very interesting. Can you speak more about your architecture?


Am I the only one who thinks SQLite is still too complicated for many programs? Maybe it's just the particular type of software I normally work on, which tends towards small, self-hosted networking services[0] that would often have a single user, or maybe federated with <100 users. These programs need a small amount of state for things like tokens, users accounts, and maybe a bit of domain-specific things. This can all live in memory, but needs to be persisted to disk on writes. I've reached for SQLite several times, and always come back to just keeping a struct of hashmaps[1] in memory and dumping JSON to disk. It's worked great for my needs.

Now obviously if I wanted to scale up, at some point you would have too many users to fit in memory. But do programs at that scale actually need to exist? Why can't everyone be on a federated server with state that fits in memory/JSON? I guess that's more of a philosophical question about big tech. But I think it's interesting that most of our tech stack choices are driven by projects designed to work at a scale most of us will never need, and maybe nobody needs.

As an aside, is there something like SQLite but closer to my use cases? So I guess like the nosql version of SQLite.

[0]: https://boringproxy.io/

[1]: https://github.com/boringproxy/boringproxy/blob/master/datab...


SQLite hides a ton of complexity that lives in the filesystem. It’s incredibly hard to do robust IO correctly with the APIs we have.

I almost always choose SQLite for persisting to disk over JSON files. It essentially removes a large class of bugs and is robust enough that I’m not worried about introducing new problems.


In addition to hiding a ton of complexity, SQLite is actually faster than the filesystem in a surprising range of cases.


I never questioned whether it was faster or had more features. My whole point is maybe a lot of programs don't need what it offers.


Most people don't need everything SQLite offers, but almost all programs needs some of its features. And even if you literally do nothing more than storing rows of text, the API SQLite offers is still more convenient than most filesystem API's once you go beyond the bare minimum.

That is to say that there are remarkably few usecases where SQLite isn't better than plain file access.


As I said before, my use cases, while producing useful software, have yet to require more than is offered by a filesystem.


Software that isn’t robust, while still useful, can ruin some days.


Do you have an example of a problem I'm likely to encounter at the scale I described in my comment?


If you are doing IO in a non-atomic manner (seems probable), it’s a matter of time before something happens and a file gets messed up. The frequency of that occurring is certainly a function of scale, but I would argue it’s always a bad idea to deal with your persistent state non-atomically.

Additionally, having a schema (even if it’s trivial) protects you from other software problems. It’s much easier to reason about upgrades. And more importantly, if your application complexity suddenly breaches the data complexity threshold where JSON files become a lot more difficult, you already have all the tools you could need at your disposal with SQLite.


> If you are doing IO in a non-atomic manner (seems probable)

No. I typically use a single mutex for the entire database, and as I said haven't had any performance issues. I'm confident for many classes of software this can scale to at least 100 users (maybe with some slight performance tweaks), which is sufficient for many federated services.


A mutex doesn’t give you atomic or durable IO. I’m referring to proper use of fsync and friends.


SQLite hides a ton of complexity that lives in the filesystem

Since they are using Go, couldn't you say the same thing about the Golang std library? As long as they know how to use a local file as database (do the swap, flush, etc...) I don't see the problem.


What makes you think SQLite uses go? (It's written in C[0])

[0]: https://www.sqlite.org/index.html


Was saying that Go's std lib also "hides a ton of complexity that lives in the filesystem" for you.


No because it doesn’t give you IO abstractions as robust as SQLite.


Your solution works to a point, but it will not be as robust as SQLite. Its ACID powers are incredibly valuable in the real world where things fail or are unreliable. Also the ability to do complicated queries comes in handy sooner than you would think.

Of course you have to then mess around with SQL but you only have to write it once and encapsulate it somewhere and you're done.


I never questioned whether it was faster or had more features. My whole point is maybe a lot of programs don't need what it offers.


> This can all live in memory, but needs to be persisted to disk on writes. I've reached for SQLite several times, and always come back to just keeping a struct of hashmaps[1] in memory and dumping JSON to disk. It's worked great for my needs.

I do a very similar thing for many of my at-home projects. They generally have a single-user, me, and it works great.


Look up "Prevayler" and the "system prevalence" pattern. It introduces occasional data journaling, and so it exists somewhere in between "serialize and persist the whole thing at once" and a full blown DBMS. But I've experimented with system prevalence in the past and decided that just "serializing and persisting the whole thing at once" like you've mentioned was simply a better solution.


For a recent project I picked a similar approach, but a directory full of JSON files, where the filename is the key and the value is the contents of the file. This gets me two improvements over the single massive object persisted to disk in a single file: ACIDish for free from the filesystem, and a large reduction in bytes (potentially) written to disk since you don't have to rewrite a large file every time a single key changes, and this service potentially has frequent writes to a subset of keys.

This latter point matters a lot to me since the disk in question is a Raspberry Pi's SD card, which I've heard have a tendency to wear out quickly, so I wanted to avoid writing to it unnecessarily. And yeah, this system is fragile, not actually ACID, etc etc… but I'm the only one who relies on it (it's not even accessible on the public internet, only over my Wireguard VPN), so it doesn't matter if it breaks!

I will probably rewrite this service later, backed by either SQLite or Postgres, but writing it this way saved me a fair bit of thinking and lines of code, which was great for getting it off the ground and usable to me in like two hours.


Most relational databases offer ACID, for one thing.

And also a query language, which is easy to write and modify queries for, presenting a great benefit over writing large amounts of boilerplate code for looping over records and accessing the right indices every time (keep abstracting that and you'll end up with your own relational database system).

I often import data into SQLite just to work with it, without necessarily even an application in between. Depending on the nature of the data it's either that or Matlab.


For Vanilla Forums we have a file based configuration interface with a bunch of utility methods for writing/reading too it. This is the default and what you really want in a single environment either hosted or in development.

In our cloud infrastructure though we have 10k+ distributed sites so the interface reads/writes to an API backed by a database.

A major benefit to the confit file is that you can just open it in a text editor and muck around during development.


There is a key difference. With the JSON version you are constantly serialising and parsing the entire file, even if you need to write just a single byte.

With SQLite, you can read or write a few bytes without having to process the others.

Also, SQLite would allow observing the current state by querying the file, with the JSON version you would need to keep dumping the state every few seconds.

But I agree with you that most languages have solid support for JSON and it prevents you having to break down those tree structures into rows and then back again. For typed languages the struct becomes the schema.

SQLite indexes and joins also would not be a benefit if your state is small enough that queries are just filter functions with “full table scans”.


> There is a key difference. With the JSON version you are constantly serialising and parsing the entire file

Only serializing. The db is kept in memory, and only deserialized on startup.

But why is this a problem at the scale I'm working with?


It's no problem, just a different way of doing the same thing really.


You seem to be describing leveldb: https://github.com/google/leveldb


BerkeleyDB maybe? DBM files used to be pretty commonly used on Unix boxes. Sendmail, for example, uses DBM files.

Or LMDB, LevelDB, RocksDB, etc, if performance is important.


If you care about data normalization and data integrity then SQLite is going to be a much better choice.


At the scale I described in my comment, I do not care about normalization. Can you give an example of where I'm likely to lose data integrity?


Would Redis be a valid solution for your case?


This adds another service you now need to deploy and maintain. For small personal projects, I definitely would like to avoid that if I can!


I agree.


I have tried adopting SQLite in my side projects. The problem I encountered is that using managed PostgreSQL/MySQL is still more convenient and more reliable than using SQLite on a bare metal VPS. I like to use Heroku or Digital Ocean App platform because I want to spend time creating and not managing the infrastructure (ci/cd, ssl certs, reverse proxy, db backup, scaling, container management and what not). I tried looking for a managed SQLite but could not find one. On an unrelated note I found using Redis a good lightweight alternative to the classical psql/MySQL. Although still multi-tier and more difficult to model data, it’s initially cheaper and easier to manage than its relational counterparts. Anyone has had similar setup/preference?


What would a managed sqlite even look like? I can't tell if this is a real response or not...


When hankchinaski says 'managed' I think they really mean that there's some capital-A App dashboard somewhere, on Digital Ocean or wherever, and they log in and click 'new database' and that's it. No ssh-ing to a VPS, and choosing the file location where the sqlite file will sit, figuring out backups and so on. But as you say, while you can wrap postgres or redis in that sort of 'just take care of it for me' approach, given the simplicity of sqlite it doesn't fit that paradigm, so perhaps hankchinaski is just misunderstanding what sqlite fundamentally is and how it works.


Exactly. I was expressing the limitation of that single tier paradimg viewed from an angle of someone who doesn’t want to ssh into machines to configure and deploy code... :)


I clearly did not express myself correctly. Being all these managed solutions (heroku, do app platform) running on an ephemeral fs it is not possible to use disk as long term storage as sqlite requires. So I’m basically forced to manage an attached disk to my container which needs to be backed up, monitored and so on


Maybe an NFS mount or something that handles back ups automatically? Scripting to handle an automatic restore of the database?

Maybe a heroku that knows about your database file and automatically loads the latest version for you?

I kind of feel like GP is a troll comment, as there's no real value add for a managed SQLite.


Or just use litestream, it‘s perfect for this and the closest you can get to managed by replicating to S3 or Cloud Storage or the likes.


Beware that Litestream on PaaS has safety concerns unless you can get your PaaS platform to guarantee that the active app instance is terminated before a new instance is booted. Litestream doesn't turn SQLite into an multi-master distributed system.

If two copies of the DB accept writes at the same time, Litestream will just send backups to two different backup generations, and when you restore you'll only pull down one of those generations and won't receive the writes that landed in the other generations. All of your writes will still be in S3, they'll just be peppered across distinct backup snapshots and you can't get all the data back unless you separately restore all relevant generations and manually merge the restored databases.


In the past I had a website with hundreds of gigs of data that needed updating regularly, but could be read-only from the web server perspective.

I used sqlite for that, and had a mysql server for the user data and stuff that needed to be written to. Performance was fantastic, users were happy, data updates were instantaneous ; copy the new data to the server then repoint a symlink.

Most of my work is modeling and simulation. Sqlite is almost always my output format ; one case per database is really natural and convenient, both for analysis, and run management.

Anyway. Sqlite is amazing.


I'm exactly at a point where I'm considering SQLite for its single file db advantage, but I'm struggling to find solutions for my use case.

I need to import some 30k JSONs of external monitor data from Lunar (https://lunar.fyi) into a normalized form so that everyone can query it.

I'd love to get this into a single SQLite file that can be served and cached through CDN and local browser cache.

But is there something akin to Metabase that could be used to query the db file after it was downloaded?

I know I could have a Metabase server that could query the SQLite DB on my server, but I'd like the db and the queries to run locally for faster iteration and less load on my server.

Besides, I'm reluctant to run a public Metabase instance given the log4j vulnerabilities that keep coming.


Check out datasette: https://datasette.io/


You could do incremental updates of the local databases using the session extension:

https://sqlite.org/sessionintro.html


I've had great success using SQLite as both a desktop application file format and web server database. I'll mention just one thing I like about it in the desktop application realm: undo/redo is implemented entirely within SQLite using in-memory tables and triggers following this as a starting point: https://www.sqlite.org/undoredo.html

It's not perfect, but it fills the niche nicely.


I love to see that more projects are using SQLite as their main database.

One thing that I always wondered though: does anyone knows a big project/service that uses Golang and is backed by SQLite? This because SQLite would require CGO and CGO generally adds extra complexities and performance costs. I wonder how big Golang applications fare with this.


Not a "big project/service" but a Go project that uses Sqlite is one of my own, Timeliner[1] and its successor, Timelinize[2] (still in development). Yeah the cgo dependency kinda sucks but you don't feel it in code, just compilation. And it easily manages Timeline databases of a million and more entries just fine.

[1]: https://github.com/mholt/timeliner

[2]: https://twitter.com/timelinize


Interesting project! It seems to be perfect for SQLite, considering it seems to be mostly for reads instead of writes. I wonder if heavy write applications are a bit of a trouble in Golang because of Golang goroutines x C threads model (which I believe SQLite might use?).


arp242 has an excellent post[0] about statically compiling sqlite for Go programs, which may be useful.

Isn't there some issue where SQLite basically has to be single-threaded in Golang programs, at least if you use the stdlib SQL library?

[0]: https://www.arp242.net/static-go.html


> Isn't there some issue where SQLite basically has to be single-threaded in Golang programs, at least if you use the stdlib SQL library?

I guess not exactly, but iirc there were some caveats/performance issues around C threads x Go routines. A bit is touched in this post: https://www.cockroachlabs.com/blog/the-cost-and-complexity-o...

But well, it seems arp242 is GoatCounter's developer. Based on the post content, it is powered by SQLite and this is probably a good write-heavy application example. I wonder if there are any blog posts about the SQLite performance for this service.


https://github.com/gravitational/teleport/ has the option to use it, but it only uses it as a key value store.

CGO isnt too big a problem and if it is a real dealbreaker something like https://pkg.go.dev/modernc.org/sqlite will work as it transpiled the c into go and passes the sqlite test suite. I think there is performance degradation with writes but reads are still pretty quick.


You can use pure Golang SQLite (without requiring CGO) -

https://pkg.go.dev/modernc.org/sqlite

It works well, but the performance is worse than C version. Not a big deal for what I used it for, though. It was approx. 6x worse at inserts.


Using https://github.com/mattn/go-sqlite3 for a pet project right now. So far not a single issue with it, I never had to think about any CGO specifics.


As a author of a library that got rewritten by another team in go with a similar rationale, I must say that the Go ecosystem's unwillingness (or inability) to interface seamlessly with other languages seems like its greatest drawback.


More info about WAL mode concurrency [0]

No reader-writer lock. Still only 1 concurrent writer, but write via append to WAL file is cheaper. Can adjust read vs write performance by syncing WAL file more or less often. Can also increase performance with lower durability by not syncing WAL file to disk as often

https://www.sqlite.org/wal.html


SQLite is great, but it's not a more simple drop in replacement for DB servers like HN often suggests it is.

My team at work has adopted it and generally likes it, but the biggest hurdle we've found is that it's not easy to inspect or fix data in production the way we would with postgres.


>it's not easy to inspect or fix data in production the way we would with postgres.

I assume because you're using a remote socket connection from the client?

I haven't tried it in a serious setting yet, but I did play around with dqlite and was impressed. Canonical uses it as the backing data store for lxd. Basically sqlite with raft clustering and the ability for clients to connect remotely via a wire protocol. https://dqlite.io/


> I assume because you're using a remote socket connection from the client?

Yeah, it's common for all developers to connect to and query against prod postgres DBs via DataGrip or similar.

dqlite definitely looks interesting, but I worry it's a bit heavy given that our only use case for remote access is prod troubleshooting. I think I saw something recently where you could spin up a server on top of a sqlite file temporarily - that might be ideal for us.


Can't you run your queries in a copy of the data (eg.: a backup)? I think that'd be advisable even if you were running postgresql.


> Can't you run your queries in a copy of the data (eg.: a backup)?

That's essentially what we do: copy the file locally if we need to inspect it. It's slightly more cumbersome though.

> I think that'd be advisable even if you were running postgresql.

Connecting to live prod servers is definitely not a 10/10 on the "best practices" scale, but it works well for our business (trading), where there are small developer teams that also operate, no PII in the database, and critical realtime functionality isn't directly involved with the database anyway.


With postgres you could set up a read-only replica or something. At least for me, the small effort is well worth not having to worry about accidentally deleting production data.

I feel like having an easy mechanism to clone the production database somewhere you can play with is well worth the effort. You can even use those clones to run backtests and other integration/regression tests against, which is also a very nice to have.


> With postgres you could set up a read-only replica or something. At least for me, the small effort is well worth not having to worry about accidentally deleting production data.

We do do this, and probably should be more disciplined about connecting to it when only reading. Of course that doesn't help if we need to run an update in production, but that isn't that often.

> I feel like having an easy mechanism to clone the production database somewhere you can play with is well worth the effort. You can even use those clones to run backtests and other integration/regression tests against, which is also a very nice to have.

We do actually do this as well (nightly), and it is a huge productivity boost for testing and development. I would recommend to anyone that writes software dealing with persisted data to invest in an easy mechanism to clone from production.


One can create read-only users in Postgres.


I'd say the team is using it wrong. SQLite is really intended for embedded use, not a Postgres replacement. The two shouldn't even be mentioned in the same sentence. SQLite is weakly typed, performing autoconversion from ints to strings.

The value in SQLite is its light weight, and not it's SQL side. If you're building a mobile app and you're loading a lot of local data, it might be the right choice.


You may have misunderstood, we're not using it as a postgres replacement. I agree with this take, hence my original assertion that it isn't a drop in replacement for a DB server.

We are using it as a replacement for RocksDB - we need a richer way to store data than a simple key value store. It still runs on a server though, and therefore it would be useful to be able to read data remotely, even if that isn't the primary purpose.


My mistake then. I read it as "we tried it as a Postgres replacement, even though many here suggest it was going to work".

I've toyed with SQLite as replacement for a client-server database for personal projects. While I stand by my overall dim assessment of SQLite, with a statically typed language and a diligently maintained data access layer (ie. one-man project), I would endorse its use on the server.


I believe you mean that you can't easily do a "psql ..." or connect using DataGrid and similars, right?

Does this mean that devs need to copy the production database file locally to then inspect it? Or are there tools to connect/bridge to a remote sqlite file?


I think the usual approach would be to SSH into the server and run sqlite3 there. This issue [0] mentions some workarounds for connecting from DBeaver, which I assume would work for other graphical client software. I haven't tried those approaches, they seem pretty hacky and I imagine performance isn't great, but I guess that's to be expected given that SQLite isn't designed for that type of access.

[0] https://github.com/dbeaver/dbeaver/issues/6876


> I believe you mean that you can't easily do a "psql ..." or connect using DataGrid and similars, right?

Yeah.

> Does this mean that devs need to copy the production database file locally to then inspect it? Or are there tools to connect/bridge to a remote sqlite file?

We use "kubectl copy" currently when we want to inspect it, and we haven't actually had to write back to a production file yet. We've explored the "remote" option, but since it's just a file, everything seems to boil back down to "copy locally" then "copy back to remote prod".

It's only a small part of our stack at the moment, so we haven't invested in tooling very much - but I'd be curious if others have solved similar problems.


Ah I see! Yeah, I expected your team would have to copy it locally. I wonder tho, in times of data leaks and whatnot, couldn’t it be dangerous to have lots of PII (personal identifiable information) copied around many dev laptops?

I mean, devs can do the same with Postgres, but it is more for backups instead of purely querying.


That's true - our software does not deal with PII, which makes many of our DB practices sound strange to outsiders.


Did you try SSH and using SQLite command line tool?


That ought to work, but the experience we are trying to emulate is that of using DataGrip, rather than psql.


Nim forum uses SQLite as its db since 2012 and it fits perfectly the article’s use case. Code is available and it can be used to run a discourse inspired forum (although much less featured).

https://github.com/nim-lang/nimforum


Shout out to litestream[0] for backups

[0] https://github.com/benbjohnson/litestream


My 2 cents on sqlite:

https://corecursive.com/066-sqlite-with-richard-hipp/

An interview with one of the creators:Mr. Richard Hipp - for a better and deeper understanding what pitch they took and what industries they were in to. Their approach to overcome the db-world that they saw in front of them. See the obstacles and the solutions and why it came to be that underestimated 'sqlite' that powers a good chunk of all you mobile actions triggered by your apps - but just read that interview - i cannot reproduce the dramatic here in my own words (underestimated).


I used it for ETL process extensive which is great. I still don't know how people use it for concurrent writes like a simple ToDo webapp?


With WAL, writes for something like a ToDo app finish in a small fraction of a millisecond so unless your todo webapp is writing to the DB at a rate exceeding 20k writes per second, the fact that writes are not concurrent becomes largely irrelevant.


This can't be right. As far as I can tell, WAL allows concurrent READS and WRITE, not concurrent WRITES. Am I doing this wrong all these years?


That's correct, it doesn't allow concurrent writes, but if the writes finish fast enough, that's somewhat academic.


You still have to serialize the writes. If you have a lot of users on a simple ToDo apps, you can't do concurrent writes. I tried several times to run a simple webapp with Sqlite3 that have concurrent write requirement, the work around was too painful. I had to push all writes to an in-memory queue and have a single process pick off the queue. That process is outside of the web framework.


Why couldn't you do it in-process with a mutex? IIRC sqlite3 is thread-safe


It's still a hassle, right? Especially in a WSGI environment for Python applications.


I've recently used SQLite for my personal project rigfoot.com

It's a "read only" and small website (at least for now), with just a bunch of daily visitors, a perfect use case for SQLite.

Funny thing is that in my case the database it's so small that it's pushed directly on the repo.

Especially for startups and little projects, SQLite is your best friend.


Don’t consider SQLite for cloud based webservers.

The scant upside of 10-50x supposed query latency increase is likely to be worth little. In the extreme this is low single-digit milliseconds, so will be dwarfed by network hops.

In return for the above, you’ve coupled your request handler and it’s state, so you won’t be able to treat services as ephemeral. Docker and Kubernetes, for instance, become difficult. You now require a heap of error-prone gymnastics to manage your service.

If the query latency really matters, use an in memory db such as Redis.

SQLite is great for embedded systems where you’re particularly hardware constrained and know your machine in advance. It could also be a reasonable option for small webservers running locally. For anything remote, or with the slightest ambition of scale, using SQLite is very likely a bad trade off.


This excellent article doesn't even mention rqlite, which will synchronize an arbitrary number of SQLite instances using the Raft protocol.

There must be some scaling limits to encounter using this combination, but wouldn't you love to have that problem?


I wrote a script to find all csv files in a directory and figure out the best way to load them into SQLite.

It gives me a handy way to run queries on data.

I tried to make it super smart too and really make educated guesses on what data types to use and even linking foreign keys.


A language like R (https://www.r-project.org/) helps more for CSV data-science. https://www.tutorialspoint.com/r/r_csv_files.htm


You can use Python with Pandas as an easy and powerful way to import from CSV files and then directly write or append the tables to an sqlite db. The whole thing can be done in just a couple lines of code.


Postgres is 9.5× slower when running on the same machine as the one doing the query

I'm surprised by this, sure in-process is always going to be faster but still find it hard to believe that sqlite can be beat postgres in a single machine.


Just a note that there are significant features of SQLAlchemy that don’t work with SQLite such as ARRAY columns, UUID primary keys and certain types of foreign key constraints.


well no major database other than PostgreSQL has native support for UUID or ARRAY, you can certainly use string-based types for these things for other databases. the DB agnostic UUID is at https://docs.sqlalchemy.org/en/14/core/custom_types.html?hig... and for ARRAY it's likely most convenient to use the SQLite JSON datatype which is also supported directly.


I've been doing some ETL exploration with opening a sqlite :memory: connection, ingesting small-to-medium data, and then doing

"VACUUM INTO somefile.sqlite;"

to dump the RAM copy to disk.

What a great tool.


What does that do? How is VACUUM different from BACKUP in this context?


"The VACUUM command with an INTO clause is an alternative to the backup API for generating backup copies of a live database. The advantage of using VACUUM INTO is that the resulting backup database is minimal in size and hence the amount of filesystem I/O may be reduced. Also, all deleted content is purged from the backup, leaving behind no forensic traces. On the other hand, the backup API uses fewer CPU cycles and can be executed incrementally."

https://sqlite.org/lang_vacuum.html


i have so much love for SQLite. Consider me, getting my first internship at a startup. They have a bunch of contractors doing work for them as part of their service. The whitelabel application they got developed would export data in CSV. My job was to take that data and get some meaning from it. data included availability, locations, etc (imagine data about delivery drivers). I had no idea what to do but realized I could definitely parse this CSV through Python.

Once I had this data in Python I needed a way to analyze it. I never worked with Databases but decided to install a local copy of SQLite. The rest is history. I feel like I learned how to use databases in an organic way: by looking for a solution from raw data. A couple of queries later and python was exporting excel sheets with color coded boxes that indicated something based on the analysis I did.

Of course this could be done with any database application but the low weight nature of sqlite allowed me to prototype a solution so easily. We just backed up that native sqlite dump with the cloud and had an easy (super easy) solution to analyze raw data.


I love this approach. I've ran sqlite in production for a variety of products over the years and for the most part it's great. One "back up" solution I did was dump tables to text - commit the diff in git. Taking file copies while transactions are in process can lead to corrupted db's.


Before you start throwing your opinions into HN about sqlite, please read: https://corecursive.com/066-sqlite-with-richard-hipp/?utm_so...


Also, consider just using the filesystem with binary encoding. Serialize / deserialize directly from / to data structures. It is faster and simpler than any database you'll ever use, assuming you don't need the functionality a database provides.


Is anyone here running production workloads which perform read and write operations on a remote SQLite database?

Currently using Postgres and I'm open to switching but I haven't seen any libraries or implementations of SQLite being used as a client/server database.


Why switch? Typically the work involved would need to be justified by some benefit, right?


I happily use SQLite via `sqflite`, a Flutter library, to store and retrieve data for an offline-first mobile app. This is my first time really using it, and I’m quite pleased with the experience and the familiar feel of using SQL.


Is SQLite suitable for a small-to-medium CMS (Content Management System), or a blog platform e.g. WordPress (MySQL) or Ghost (MySQL)?


Sure. Considering many CMSes are sucessfully doing “flat file” instead of database. Sqlite can for sure do that as well or better.


sqlite as the data store for a system package manager is possibly the most practical single-host example


This "little thingie" is so powerful, that words are missing to tell the story.


SQLite database can be stored in git which seems like a great benefit. But I wonder would it also be possible to have different "branches" of the database and then merge them at some point?


It's not integrated with git the way you're perhaps imagining, but SQLite sessions[0] is adjacent to what you're imagining.

[0] https://www.sqlite.org/sessionintro.html



What a cruel world- sqlite to the rescue!


Wow. So much tech - sqlite this time, and so many opinions to discuss with words / characters. Maybe i could create my own alphabet to have some peace of mind in the end? I don't think so - somebody would make a story of it - wait ... happened!

Read the creator of sqlite for more information on all running topics that make you go 'Uh' up to this point of time: https://corecursive.com/066-sqlite-with-richard-hipp/




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: