What is not so obvious to me: When should WAL be used and when not? Does it even make a difference? It says that WAL might be 1-2% slower on read heavy applications, but how much faster is it for write heavy applications? I'm trying to think of a use case because most (web) applications are ready heavy.
And another question: Are both, WAL and rollback journal multithreading and multiprocess safe while being accessed by multiple applications on a single server? I always asked myself this but somehow never found a clear yes/no answer.
Not 100% certain about this, but I can give my understanding and some links.
Have a look at https://www.sqlite.org/faq.html#q5. Multiple process can have the database open. Multiple processes can read the database simultaneously. Only one process can write to the database at each moment.
For the above to be true WAL has to be enabled. Otherwise writes in the one process will block reads from other processes. You can test this with two simple scripts that write and read with an uneven interval from the database (I have an example for that under [0], german, but with code).
I think the idea is that you can also have multiple process that write, they just have to set a timeout and retry. That's not the default configuration, not with the sqlite libs I know at least. But if write processes just retry then that should also work. PRAGMA busy_timeout should be the keyword here, or handling it manually by catching the timeout exception and retrying.
I usually try to avoid that scenario regardless, one process that writes and many that read worked for me for everything so far.
The article discusses ec2 so I can tell you at least this much.
The flush FS commands will get translated into storage system commands to ensure blocks are written through to the disk. When using network-attached storage, like EBS, the minimum latency will likely be determined by the network latency(and the bulk of all the latency will likely be network) involved and sending these commands and waiting for the ack. If using WAL halves the number of times this needs to occur, you should see double the number of serial transactions per second.
1-2% slower on reads is likely as insignificant as it sounds, particularly if your write load has you interested in double the number of write qps; possibly without changing a single schema or line of code!
Edit: This is generalized information though. I would like to see some actual SQLite numbers and I disagree with the article's generalization that SQLite is best suited for applications that "Are read-heavy but not write-heavy".
"Benchmarks on the smallest EC2 instance, with no provisioned IOPS, put this little trooper at 400 write transactions per second, and thousands of reads" -> I think 90% of SaaS would be super fine with this write limits.
One thing to note: the bottleneck demonstrated here is in write transactions per second, not in GB|rows inserted per second or anything.
The vast majority of businesses will never hit 400qps for writes anyway (and many of the ones that would could throw a sharding strategy at the problem), but just having a lot of data being written isn't sufficient to make sqlite an untenable solution; problems only arise when the insertions of that data need to be extremely fine-grained.
Can anyone share their experiences using SQLite over NFS? It seems like SQLite has major caveats here (no WAL, dot-file locking); how much of a headache is it?
For context, fish shell is considering using SQLite for history, but some users have NFS home directories.
A big headache in my experience. Different NFS servers lie about whether or not they really fsync writes when the clients request it, so you can end up with clobbered writes and corrupted databases in the worst case, and lots of "database is locked" errors that persist for longer than they should in the best case.
Sorry, wasn't meaning to be rude but the title of the post implied there was some insights being shared. Instead, it was telling us that SQLite has WAL mode, and that it should be used.
There was nothing there that actually talked about issues or lessons from working with it.
I'm copying out 2 points I think are most relevant here:
- Be kind. Don't be snarky. Have curious conversation; don't cross-examine. Please don't fulminate. Please don't sneer, including at the rest of the community.
- Please don't post shallow dismissals, especially of other people's work. A good critical comment teaches us something.
I asked for counter-proof of my dismissal and reasoning, and all you offer up is copy pasted points that aren't applicable.
Yes my dismissal was shallow, as was the article in question. Here's a more verbose version that should satisfy you?
The article speaks of using SQLite in production but only mentions the existence of WAL-mode and suggests its use. That is general (not production specific) advice that is well known in the community. While the promotion of SQLite is a good thing, I found this article wanting in delivering on what the title suggested.
Well, first you can tune SQLite to get a 1000x speedup over 400w/s, without compromising safety too much. Second, the use case is small & local — I use it as an fprintf replacement.
Second, 400 concurrent users per second is a lot? If I had 400 users per second, with a 10 minute engagement time for 4 hours a day, I’d have ~10000 daily users. If they paid a monthly SAAS of 5$/user, I’d be making 600k$/year.
So… considering the implementation simplicity, that seems like a good trade off?
And to be explicit: It looks like InnoDB is ACID compliant and crash consistent, so indeed this criticism appears out of date. Thanks for correcting me.
Yep, and even before 5.5, InnoBD was present (since the early 2000's IIRC), and is what most people chose to use for any production level deployment (MyISAM never even supported transactions, so people had to use something different when they got serious), it just wasn't the default engine until that point so you had to specify it in your create table statement or use a config setting to change the default.
MySQL 3.x was mostly just okay for simple website that didn't do much. (MyISAM is good for either extremely read heavy with few writes or vice versa, was very performant in those cases for the time)
MySQL 4.x added subquery support and InnoDB as a standard component, even if not the default.
MySQL 5.0 added cursors, triggers, stored procedures, views.
I think from about this point MySQL was fairly usable in place of some enterprise DB's depending on use case, and most of the essential features were there, even if not always the default.
MySQL started as a super easy to use and accessible DB, and slowly accreted enough enterprise level features to be able to compete in that field, but it's been in use successfully there for quite a while.
In comparison, my (super limited and possibly incorrect) understanding of Postgres's history is that they focused on stability and enterprise level features first, and then later added some things for more convenience. I think they aren't super different in capabilities now, at least for the core functionality most expect.
When there are enough failings to generalize and the product is owned by a company that I feel is evil incarnate, I'm happy to dismiss them and then proceeded to not pay attention for a very long time, yes. In this case, it appears that this particular feeling may indeed have been fixed.
With no claims to suitability. Just back of envelope based on 400 write transactions per second.
-- Assume an employee filling out a form on a web site that takes on minute to complete.
-- Assume submits are randomly sent. (A bad assumption)
Per minute that's (1 * 60 * 400) 24,000 people submitting a form. Or over a million people submitting their form per hour. Nothing Google scale but enough for most any internal business application.
These are also likely just serialized commits per second. Parallelized non-blocking writes and/or using explicit micro-batching(SQLite supports batch inserts) and you'll be blowing most gov sites out of the water.
And another question: Are both, WAL and rollback journal multithreading and multiprocess safe while being accessed by multiple applications on a single server? I always asked myself this but somehow never found a clear yes/no answer.