Hacker News new | past | comments | ask | show | jobs | submit login
How to Corrupt a SQLite Database File (sqlite.org)
372 points by pcr910303 28 days ago | hide | past | web | favorite | 148 comments

In another life (2007) I worked for a company that does digital signage. We had about 8000 systems deployed across the US. The software on them relied heavily on SQLite. At some point we started seeing random corrupted SQLite databases.

Trying to figure out what was causing this we realized our pc manufacturer was using the same sku but hardware inside could vary... and in particular the hard drive. Every system with a corrupt DB had a particular brand hard drive (WD), the ones that didn't fail was another (Seagate).

We never really spent much more time on this other than to lock down the hardware config and replace all the hard drives in the failed systems with the "good" brand.

Super interesting to see this. I'm assuming it was: Disk drives that do not honor sync requests

We were of course using consumer grade equipment, because the company was pretty cheap, but also we had so many to deploy. I wonder how much this cost the company to fully resolve. Between systems not playing, having to send technicians, shipping systems overnight, time troubleshooting, etc.

Reminds me of the Maxtor DiamondMax Plus 8 disk firmware problem. This was a model of hard drive back in 2004-2005 that had a known issue where it would corrupt it's own firmware. The disk would literally overwrite and corrupt portions of firmware code which would render the disk unusable. Heat seemed to exacerbate the problem, and the problem caused the disks to generate a lot of heat so it was self-perpetuating.

I worked at a hospital at the time, and we were upgrading from Win2k to WinXP, and replacing a lot of old HP Celeron 733 MHz systems with newer HP dc5000 AMD desktops. We deployed roughly 1,000 desktops over the course of a year and a half, and fully 25% of the systems that had a DiamondMax Plus 8 had disk failures within the first 6 months of deployment. Symptoms were always the same. System would get slower and slower until one day it would hard reboot itself and then claim there was no disk drive installed. BIOS would show the disk drive was present but that it had no capacity or name. Built-in BIOS self test would report a code 7 error.

It got bad enough that we refused to deploy systems that shipped with those drives. Our manager had to fight with the sales rep for about a month before they relented. We were ready to go to Dell just to get away from those stupid disks.

You can still find disk recovery sites talking about the issue: https://vitaldata.ca/article/n40p-data-recovery.html

Seagate had their problems too around that time.

The Barracuda 7200.10 was infamous not just for its bug but also for Seagate's response.

From what I recall, the SMART log wrapped around periodically - if the drive was reset when the pointer was at a certain value, it failed to boot with an assertion failure. 1 in 128 chance of a failure on a given power-up.

Thankfully someone found out that you could send a couple of debug commands to it with an FTDI cable and get your data back.

With how widespread both SQLite and WD are wouldn't half the world have apps crashing left and right, if just the hard drive brand was the cause of the problem?

The case of the disks not honoring fsync is famous. I learned about it from Brad Fitzpatrick, author of LiveJournal (later on the Go team at Google) writing about it in 2005: https://brad.livejournal.com/2116715.html

One of the first optimizations at basically every level is to not honor fsync requests. Network file systems lie. Operating systems lie about local volumes. RAID controllers lie. On-disk controllers lie. That's because these lies are caches, and cache buffers are one of the easiest ways to boost performance.

Apps are in fact crashing left and right.

My Materialistic app crashed about 30 seconds prior to opening this thread. I probably average about 5 "noisy" app crashes a week on my phone. No idea how many may be failing silently.

I had to give up on Materialistic because of the frequency of crashes. It's a shame because the hn website is really difficult to navigate on mobile (collapsing threads with that tiny + is a nightmare). Do you know of any better alternatives that let you log in and post from the app?

I have no issues with this guy’s client and have been using the free version for 3+ years. https://github.com/premii

Hmm, I checked that one out and it is indeed pretty nice but I don't think I can log in to my account with it. Just lets me view the public interfaces.

I login on the iOS app every day. There's a login option in the settings menu near the bottom.

Ah, I was using the webapp which doesn't have that functionality. Hmm, it appears the Android app is no longer in the play store. There's a dead link and a GH issue with no comments for two years.

I'll try to build the APK myself when I get some spare time.

Curious what kind of phone do you have? Its not very common for Materialistic to crash on me. It does happen but not that often.

Materialistic doesn't crash very often. When I say I have an app crash 5 times a week, I mean any one of my installed apps. If I ever come across a particular app that is crashing regularly, I'll look for an alternative and/or report the bug if the developer makes it easy to do so.

I have a one plus 6 and materialistic crashes almost every time I open it from a previous session. It also often crashes when reading articles even on a fresh instance

a couple of years prior to that era i worked for a Giant Electronics manufacturer who used WD desktop class drives in RAID and CCTV/DVR systems. we had insane corruption issues, burnout issues and so on with a large number of our devices. at the time i was running an rma shift for the product line so this was notable to me. we also had devices come in with Seagate drives, but they had been in operation much longer before they showed any issue and were from a few previous model releases. at some point the drives were "value engineered" and WD was selected. in the end they had to suck it up and put in a server class drive (something with a high duty cycle rating). once they did the fails slowed dramatically. when we went over the data it looked like all most of the failing devices were installed in 24/7 video monitoring applications, and several were installed in poorly (or not at all) ventilated spaces. after a while we put temperature tags on the inside of the chassis to get a sense of the max environmental temp (yeah, didn't work if it had been on a truck for week in transit, but at least a field tech could check).

if i had to guess, i'd say that the digital signage company was "value engineering" the system components ;) i was in industrial controls and never had a sense that the signage folks were laughing on the way to the bank...

It's somewhat implied it was one specific WD consumer drive model.

I used WD RE3s back then, and never had problems.

I had consistent problems with WD in that same era. Switched to Seagate as well, and my problems disappeared. Just another datapoint.

GP mentioned digital signage, so probably some displays at a business. Unlike a computer which would mostly either be left on (asleep) or shut down properly, I'm sure users had no qualms about killing power to these displays as they closed up every night. Write flushing problems would be exacerbated in that environment.

I doubt most consumer applications write to the database nearly as often as we were. Also very likely a particular model of the drive and also sheer luck that one worked and one did not.

See also: counterfeit SD cards.

Yes, I think it's possible (likely, even) that there was at least one more factor.

> because the company was pretty cheap, but also we had so many to deploy

Isn't "deploy many systems in hard to reach places" the prime target for enterprise-grade hardware? Fixing problems on one bespoke system is a lot easier than fixing the same problem 8000 times, and of course with a large sample size the number of problems experienced goes up.

I always thought "people willing to pay five to ten times market rate per gigabyte" was the target of enterprise-grade storage...

Yep. This was a big problem there. Before they were delivering content to PC's they distributed content via DVD once per month. DVD player failure rate was pretty massive, especially since they just started recycling refurbed players.

Contrary to popular belief, even consumer drives will honor flush commands. The problem is operating systems generally do not send flush commands, even when you might expect it (e.g. last I checked fsync() on Linux does not send a flush).

Disabling write caching resolves these problems. Unfortunately SATA drives typically ship with write caching enabled by default.

Of course fsync() sends a flush to the disk (unless you use non-standard mount options or "eatmydata"). They may not honor the FUA bit, though, but hopefully nobody relies on this if they expect it to work with consumer disks (See e.g. here https://devblogs.microsoft.com/oldnewthing/20170510-00/?p=95... ).

What I have heard is that some disks have problems if they are on a error recovery path. E.g. they have a corrupt sector or a sector that needs reallocation (pending sector), and then they have a power failure while the sector is reallocated.

Some cheaper HDDs and third-rate drives won't honor flushes. Same for SSDs, a few very cheap ones ignore flushes in favor of cache performance.


Not quite the same thing but it turns out there are some other gotchas with fsync on Linux https://lwn.net/Articles/752063/

The implementation of fsync is FS dependent. On the default ext4 filesystem with the default options (barrier) fsync does send flush.

In my other life (2000-2008 ish I think) I've also dealt with the large (tens of thousands) deployments all consumer grade as in your case. No SQLite though. Obviously we've had some hard drive failures. What is very interesting is that the results were exactly opposite. WD drives lived and Seagate drives failed. I am wondering if such discrepancy was caused by the usage patter.

The drives weren't actually failing, the setup failed due to how the WD drives work. If you reformatted them they'd be fine.

Yep, exactly this. The drives were fine, only the database was getting corrupted.

Oops, sorry was not careful reading your message

WD: this comment Seagate: ST3000DM001

Another reason to stick to Toshiba or ex-HGST(if and when available)

I have found that corruption of SQLite files in practice is a very difficult thing to achieve. On Windows, the worst I've ever seen is a wall of "Database File is Locked" which was resolved with a rewrite of the DB access logic. After much experimentation, I have found that the happy path for SQLite is as follows:

1. Use a SQLite distribution that is compiled or otherwise configured for serialized threading mode (this is the default in most cases).

2. Use a single connection from a single process to access the SQLite database file. This connection can be safely shared between multiple threads with optional locking as required (due to the guarantees offered by #1 above).

3. Transactions are handled with locking in the application code (i.e. on the SQLiteConnection instance), not in the database. This is predominantly required for inserts with PK generation, but also applicable in cases where you don't want 2 different threads to interleave updates during a certain multi-step operation (e.g. debiting one account and crediting another).

So far, we have found this to be the most performant and simple way to get business data in and out of a SQLite database file. The "one connection per operation" approach is NOT ideal for SQLite because there is an actual database file that needs a handle created/closed each time. One connection per makes sense when you are using a hosted database designed for this access model.

In WAL mode you can write from multiple threads without any problem, but you need to keep serializability in mind. Specifically, no concurrent write transactions are allowed. In other words, you can't have a transaction start at X, have another transaction start at X+1, issue DML and commit, and expect your original transaction from X being able to commit - it will never be able to.

Some SQLite wrappers (notably, Python's) mess with the start and rollbacks of transactions and make this appear buggy.

> Some SQLite wrappers (notably, Python's) mess with the start and rollbacks of transactions and make this appear buggy.

@coleifer (author of peewee) has a pretty informative blog post exploring — in part — transactional semantics, which might be of value to people who want to know more about this.


+1 for @coleifer, peewee is an awesome ORM, he knows his DB stuff very well.

Python's wrapper is one of the worst. It attempts to parse the SQL you are trying to execute and automatically send BEGIN or COMMIT to the database. Setting the "isolation_level" property (badly named) to None makes things better.

Even in WAL mode, you can still get a SQLITE_BUSY error even when just trying to open a new connection. The application I worked on could not tolerate these busy exceptions so we had to acquire an exclusive global mutex before attempting to open new connections to sqlite and make sure to release it at soon as the operation was done.

WAL mode doesn't really have "read-only" connection (if you try to open a read-only connection by marking as READ_ONLY with sqlite_open_v2, it will error out for WAL mode). A "supposedly" read-only connection can end up doing writes for example, consolidating WAL log into the main DB file. That can cause SQLITE_BUSY errors. You pretty much have to set the sqlite_busy_timeout in any cases.

They are working on it; please check for 'begin-concurrent' and 'wal2' branch names here https://www.sqlite.org/cgi/src/brlist

Posted in another thread... but I discovered how the FS in Docker for (Windows/Mac) worked the hard way. If you use a mounted volume in Win/Mac it does a syncing, not a direct mount/share. This lags and isn't a great option for live databases and if you're accessing the same SQLite db outside the container and inside, you will see corruption happen.

In any case, it was definitely an interesting discovery... since then, I only use shared volumes with the host as backup/restore targets, never for shared anything.

It's also worth noting, that using a DB in a shared volume, even if you don't touch it outside the container, is MUCH slower... I was doing this for mssql containers and when I stopped, my spinup for a fresh instance went from nearly 90 seconds to about 15.

I run Plex on my Windows desktop as a container, with everything I want to persist mounted in (this includes the database). If I don't stop the container before hibernating my PC, I'm pretty much guaranteed to have some kind of database corruption on the next start.

I'd consider a volume container for your database, the content is probably safe enough, since it's not a lot of heavy write. Then have a timed event that copies the DB out of the volume container for backups regularly.

I've found WAL mode with one writer thread and as many reader threads as you want works pretty well.

Yes, absolutely use WAL mode whenever feasible. Does wonders for performance.

> I have found that corruption of SQLite files in practice is a very difficult thing to achieve.

Not in the slightest. Stick your Plex DB on NFS and it’ll get messed up in very short order.

Wonder if it changing some of the NFS mount options (caching, sync, etc) would make it better?

And/or options known to cause corruption for this Plex DB usage pattern. eg: "Don't use options foo, bar, bazz with PlexDB over NFS, as corruption will occur".

People who know what NFS is will be on Emby or similar before I finish writing this comment.

If you have to handle (write code for) 2 and 3, why isn’t something like Postgres better?

I can answer this. Adding a queue thread to your executable is infinitely easier to redistribute and run than something that requires the user to install a persistent postgres daemon. There are premade libraries for several languages that provide thread safe query queueing for sqlite transparently as drop in imports so you don't even really need to think about it. See e.g. https://github.com/dashawn888/sqlite3worker (not mine)

I'd love to use Postgres with PhotoStructure, but it would triple the installation size (!!), make my code more complicated (I'd basically need to automate common dba work), and Windows support is shaky (last time I checked, at least).

I picked SQLite to reduce moving parts.

Something like Firebird Embedded would be better than Sqlite if you have concurrency and performance requirements alongside an embedded packaging constraint.

Sqlite is more like a structured persistent file format with a SQL interface than a database.

It's a good question. For me, lack of yet-another-service.

This is my primary answer as well. I have no problem with Postgres as a solution, but I will only invite additional complexity into my life when absolutely required by the business constraints.

SQLite is dead simple compared to Postgres (or any other hosted solution). I can copy a SQLite .db file from a production machine to my local box and be reviewing rows/running queries within seconds. Also, sharing SQLite databases with other users & developers or even using them as a contract between systems is a very compelling set of use cases. Finally, going from a blank windows machine to a working development (or production) environment is very quick if your application exclusively uses SQLite to store its data. You simply stop thinking about databases after a while because everything just works.

For me, the single file is the nr 1 reason. It sluist works, the file is the data.. very clear.

> Systems that run automatic backups in the background might try to make a backup copy of an SQLite database file while it is in the middle of a transaction. The backup copy then might contain some old and some new content, and thus be corrupt.

I am surprised by this! Is it really not entirely safe to take a backup while some software is running, has a SQLite database open, and might be writing to it?

That limitation is fine for a big database that has someone knowledgeable administering it who will take care to follow any required backup procedures. But SQLite is often included in software that nontechnical end users will use without even knowing that there's a SQLite file. For example, Firefox and Google Chrome use it.

This seems to mean that if a regular end user is browsing the web while backups run on their computer, it's possible they could be unlucky and get a bad backup that would corrupt the database if they had to restore it.

Am I interpreting that right? If so, it suggests that many/most software projects that include SQLite are doing it in a slightly unsafe way.

(It seems possible that the -journal and -wal files could protect against that, but this SQLite documentation only mentions that they protect you when transactions fail, which is a different case.)

Yes. However, keep in mind it's not just SQLite, but pretty much every program that has any kind of state.

For instance, a program may save its state into a file every time it changes by truncating the file and then writing new contents. If an user were to copy the file after truncation, but before write, the user would end up with an useless backup.

That said, this is rather unlikely to happen in practice. Programs usually do stuff (including saving files) when they are interacted with, so a backup copied with File Explorer has a very high probability of working, even when the application is running.

Programs can deal with the issue by performing atomic full writes by performing renames, but most programs don't bother to. SQLite cannot do that, because it cannot afford doing full writes as databases tend to store a lot of data.

The better way of ‘overwriting’ is to write the new file under a new name, and then atomically move it to where the old file was. If you have to overwrite (e.g. because of lack of disk space), the program that rewrites that file should open it in exclusive mode.

A back-up tool should open files to backup in a mode that disallows others from writing to it, and retry that a few times, each time after waiting a few seconds.

If that keeps failing, there’s a problem. Do you skip backing up the file or retry With a “it’s fine if others are reading” mode, with the risk of getting a hybrid of multiple versions? There is no fully correct answer there.

And it gets worse. Not all programs write files in exclusive mode. Some might even open a file multiple times while rewriting it. Also, consistency may work cross-files, for example with Mac OS packages.

And then, of course, there’s anti-virus tools. They often work below the file system, reading and writing files without the OS being aware of it.

The reason that backup tools work relatively reliable and speedy at all mostly is that most programs that write to files will either regularly close them, or just append to them.

Databases are the exception. They tend to keep files permanently open and regularly write to them. If you use SQLite in a ’normal’ application for user data, you should flush data after (about) every write (if you use it for caching, you can be a bit more loose, as its relatively fine to discard the database when it gets corrupted.

> A back-up tool should open files to backup in a mode that disallows others from writing to it, and retry that a few times, each time after waiting a few seconds.

Generally the preferred approach for backup programs is to trigger a file system snapshot - after quiescing the disks - and back that up. Sometimes (eg databases), the backup program is able to run application specific logic that works in with this (eg flush all database buffers before the snapshot).

For the filesystems that don't offer snapshots, some still have the ability to freeze the file system to allow a consistent backup.

As I said in a comment you responded to.

> Programs can deal with the issue by performing atomic full writes by performing renames, but most programs don't bother to. SQLite cannot do that, because it cannot afford doing full writes as databases tend to store a lot of data.

I'm aware you can do that, but most programs aren't. If you pick any random program, chances are it won't do that.

Writing a copy and renaming works, but is it the only way, or is it just an approach that a lot of software uses because it's simple and it works? In other words, are there ways that might be more work to implement but also more efficient that would be suitable for a database?

I agree the risk is pretty small in practice. Though copying with File Explorer isn't the use case I had in mind. It was automatic scheduled backups. Both Windows and macOS make it pretty easy to set these up with an external hard drive. And those could start running while you're using the computer and/or you could start using the computer while backups are running.

An append-only format should work fine with most backup software that copies files. A vacuum phase would still have to do an atomic rename.

Some backup software gets the modification time of the file, does the copy, then checks the modification time again. If it changed then it recopies the file. If it exceeds some number of attempts it gives up and reports an error for the file.

Back-up software can use NTFS/ReFS Shadow Copy in Windows. This shouldn't cause issues with SQLite.

Copying files is only an issue for manual back-ups done without special software.

That depends on your backup solution. Naively copying or rsyncing files can lead to corruption for a large number of applications, including anything using SQLite. With OS or file system support you can make a snapshot of a point in time and copy that. Windows backup for example uses VSS (Volume Shadow Copies [1])) which should not suffer from this problem. On linux you can make ZFS clones or BTRFS snapshots, backup those and then delete them locally.

1: https://en.wikipedia.org/wiki/Shadow_Copy

>> Systems that run automatic backups in the background might try to make a backup copy of an SQLite database file while it is in the middle of a transaction. The backup copy then might contain some old and some new content, and thus be corrupt.

> I am surprised by this! Is it really not entirely safe to take a backup while some software is running, has a SQLite database open, and might be writing to it?

I've written backup software: You only know if another process has an open file handle if you write your own driver. (Or integrate with 3rd party drivers, like ELDOS.)

In most cases, writing a driver is overkill, so you have to accept that you might read a file mid-write. In general, the other process should lock the file, but there's no way to force every programmer in the world to write their software to be convenient for you.

Yes it is unsafe yet everybody does it. As long as you accept that some backups may be corrupted, it's fine. You'll still have older backups to go to. Of course, you have to notice the corruption in the first place!

On a snapshotting system such as ZFS it should be fine because there won't be a race between updates and copying the WAL file and the DB file. (Haven't investigated it though.)

Let's turn this around. How would you avoid this?

Suppose that you are copying the file front to back. While you are doing this, an update comes in that changes something in the front and back which refer to each other, while your copy is somewhere in the middle.

The result is that you get a backup with a data structure in the back of the file referring to stuff in the front of the file that isn't there. Your backup copy is therefore corrupted.

This problem is almost impossible to avoid. You just do the best that you can.

I wonder if dqlite (https://github.com/canonical/dqlite) or rqlite (https://github.com/rqlite/rqlite) are enough to help prevent this.

No, but they should be able to recover from it, which is almost as good.

I must be getting old - an article with such a title would used to also be appended with, "For Fun, and Profit!"

I would have expected that if this were posted on milw0rm or a personal blog, but not on the actual SQLite info page.

I honestly feel it's more a sea change type of thing,


These days that's not enough; most come with 6 Reasons, and often one of them is particularly Unbelievable.

Using a mounted volume in docker for windows/mac desktop and inside container to edit the same database.

Don't do this btw, the FS sync in docker for windows/mac is really slow and prone to causing corruption and errors if you do.

This section was particularly interesting (I add my own thoughts below it):

>Unfortunately, most consumer-grade mass storage devices lie about syncing. Disk drives will report that content is safely on persistent media as soon as it reaches the track buffer and before actually being written to oxide. This makes the disk drives seem to operate faster (which is vitally important to the manufacturer so that they can show good benchmark numbers in trade magazines). And in fairness, the lie normally causes no harm, as long as there is no power loss or hard reset prior to the track buffer actually being written to oxide. But if a power loss or hard reset does occur, and if that results in content that was written after a sync reaching oxide while content written before the sync is still in a track buffer, then database corruption can occur.

>USB flash memory sticks seem to be especially pernicious liars regarding sync requests. One can easily see this by committing a large transaction to an SQLite database on a USB memory stick. The COMMIT command will return relatively quickly, indicating that the memory stick has told the operating system and the operating system has told SQLite that all content is safely in persistent storage, and yet the LED on the end of the memory stick will continue flashing for several more seconds. Pulling out the memory stick while the LED is still flashing will frequently result in database corruption.


If you were a human rather than sqlite you would stop believing that a sync has occurred immediately, trust it a little bit after a few seconds, and if it's been a minute since you yourself last touched the file, you would think "surely it has been written by now".

So it seems by tracking wall time, the library could do better than have to rely on the drive, and could be more on guard against lying drives.

The bit about disk drive lying about syncing writes is a persistent myth. There have probably been some HDDs at some point which had notable bugs in their flush command, but in general drives do actually write data to disk before reporting complete status for a flush.

I spent several years working for a major HDD manufacturer and their firmware QA process included a test rig which would repeatedly pull power to the drive while sending write and flush commands. The drive would then be powered up and the data on the drive examined to verify that all data which should have been flushed actually made it to disk.

I think this myth persists mostly because it can be surprising difficult to coax an OS into sending a flush command for much the same reason disk drives get accused of lying about it: performance.

These days the most reliably solution is to reconfigure the drive with write caching disabled. This way the drive will ensure every write hits the disk before reporting completion to the host. Write caching remains enabled by default on SATA drives mostly for historical reasons. Back before command queuing was commonly supported write caching gave a large speedup. These days everything supports NCQ so disabling write caching only has a small performance impact.

This is absolutely not a myth. There have been drives that behaved this way, including from "name brand" manufacturers. Maybe those were firmware bugs, maybe they stopped doing it, but it is not something made up.

Of course there have also been controller bugs, file system bugs, and so on over the years. A decade ago the behavior of most file systems was quite poor in the face of underlying hardware failures but a lot of work has gone into improving the situation.

SSDs are slowing improving too. Most of them have claimed to go into read-only mode when enough blocks wear out but in reality most of them would crash, fail to respond, or just lose data. It turns out manufacturers tend to be a bit loose in their claims until independent third-parties put their claims to the test. More recent models seem to be a bit better behaved in failure states.

I'd also wonder about how old this part is and whether it's partially based on Windows. USB behavior on Windows used to be optimized for speed and would do caching (hence the need to eject flash drives before removal) but in one of the Win10 updates it was changed to safe removal which basically disables write caching on removable devices. So, at least on Windows, there was absolutely a time where the OS would report completion before things were actually written.

> So it seems by tracking wall time, the library could do better than have to rely on the drive, and could be more on guard against lying drives.

Can you elaborate about tracking wall time --- how would you determine the real sync timing? Picking an arbitrary large delay (a minute) would trash the write performance

Why would it trash the write performance to pick an arbitrarily large delay, such as one minute? Can you give an example? I don't know sqlite internals but this document is about "corrupting" a database (not just losing some data.)

So I imagine it could be set up so that any power loss that keeps data from being written within a specified arbitrarily large delay (such as one minute) won't actually corrupt the database - just lose data.

Wouldn't you need to wait the 1 minute before the next write can happen?

I don't think so. I think you could treat it as a buffer that might still be lost on power failure. Doesn't mean you can't add more to that buffer or have to wait. It's not like that buffer has a limit.

Is open() on something other than /dev/stderr returning 2 a thing? Or is there something screwy that was causing it to get associated with the open database file?

The kernel gives you the lowest numbered fd available.

If you close stderr, 2 becomes available and can be handed out.

Also it's not exactly guaranteed that 0, 1 and 2 are the fd numbers for stdin, out and err on program start, but they are hard coded and standard.

E: It's implementation defined, ie not guaranteed. Your platform may make guarantees about it, but it is not an actual standard.

> Also it's not exactly guaranteed that 0, 1 and 2 are the fd numbers for stdin, out and err on program start

Yes, it is.


You’re right, but it’s more than that: stderr is defined to be fd 2. Whatever is assigned to that fd, is the stderr. If you close it and open something else, that something becomes your stderr.

You must differentiate between a guarantee by an actual standard and platform specific documentation.

That is a guarantee by an actual standard. More specifically, it's one of the things that POSIX.1-2017 guarantees but the ISO C standard does not. In practical terms this means that all Unix-like systems use the traditional fd numbers for stdin, stdout and stderr but it's possible that non-Unix systems might not.

(Standard C doesn't have file descriptors at all, it has FILE * file handles).

Can you tell me what the "std" in "IEEE Std 1003.1-2017" stands for?

He linked an actual standard.


> On program startup, the integer file descriptors associated with the streams stdin, stdout, and stderr are 0, 1, and 2, respectively.

> On program startup

Then, after program startup, if it is closed, then fd 2 could be assigned to something else.

You must differentiate between a guarantee by an actual standard and platform specific documentation.


> The following symbolic values in <unistd.h> define the file descriptors that shall be associated with the C-language stdin, stdout, and stderr when the application is started:

> STDIN_FILENO Standard input value, stdin. Its value is 0.

> STDOUT_FILENO Standard output value, stdout. Its value is 1.

> STDERR_FILENO Standard error value, stderr. Its value is 2.

It is not only guaranteed by the POSIX standard, as pointed out by the siblings, it is necessary for standard shell scripting idioms to work, e.g. "command1 2>&1 | command2".

Of course it is. That's how people did redirection of stderr before threads and before dup2.

It doesn't get more than a passing mention, but section 2.1 is worth noting. Don't try to use sqlite files on an NFS mounted drive.

> an SQLite

Does this settle it, then? This implies it's pronounced ess-queue-lite. I personally prefer the sequel-lite pronunciation, but if the developers use the former, then I guess I'll do that.

I thought it was ess-que-ell-ite

yea, seems so https://english.stackexchange.com/questions/431329/what-is-t...

Are you from the 70s and still use SEQUEL? Then you pronounce it sequel.

If instead you are a modern hu-man who uses SQL, you pronounce it ess-qu-ell.

> Backup or restore while a transaction is active

I wonder does that apply to Windows?

Well-designed Windows backup software uses shadow copy NTFS feature. Shadow copy creates a snapshot of the complete volume when backup starts. All writes happening after that won’t be included in the shadow copy, so the backup software should never read some old and some new content.

Does it mean when using VSS on Windows, SQLite databases will be backed up just fine? Or am I missing something?

You're probably right. Taking a VSS snapshot is identical to pulling the power plug and imaging the drive. If SQLite hand handle the latter it should be able to handle the former.

Another thing that should work well is an AWS EC2 EBS snapshot as Amazon promises a consistent point in time copy.

Somewhat unrelated, I feel like the stars aligned to prevent my company from using SQLite.

We use a lot of CSV files in our embedded devices. I personally hate CSV and advocated for SQLite. However, our CSVs are 1 writer/many readers, and the only way to do that in SQLite is to change the mode to WAL. However, apparently WAL doesn't work on JFFS2 filesystems :(...

I'm going to give it another shot on our next-gen product which uses ubifs

Can they be applied to other DBMSs? I mean, I don't think other DBMSs treat database files too differently from SQLite, don't they?

Most DBMS don't let you at the files directly - you speak to the DB over a socket. Because SQLite is in-process, bugs in the process that access random file descriptors can write over the SQLite files.

(This is more ammunition for the idea that the real software isolation boundary on a desktop computer should not be "user", but "software author"!)

Most of them would be possible in some variation, although typically with different results.

For instance typically databases are accessed through a socket interface which uses the same pool of file descriptors as open(). So it's probably possible for a database connection to end up with fd 2 and then the write(2, ...) would also send garbage to the database.

Although in this case it probably won't do much unless you're very unlucky because the message will almost certainly not end up being valid for the DB so it'll either ignore it, return an error or maybe drop the connection without corruption. So it would still be a nasty thing to debug but not quite as critical.

More generally there's typically a much better isolation between database and process on traditional processes which makes it difficult for a misbehaving program to mess directly with the DB. And having a server program processing the transactions means that you don't have to rely so heavily on things like filesystem locking to achieve atomicity for instance. That being said buggy or limited filesystems can be a problem for any database, for instance to guarantee that a transaction has been really commited to disk in case of a power outage etc... See this for instance:



SQLite is serverless; server-based DBMSs are free to use other locking mechanisms, other than relying on filesystem primitives, to achieve concurrency.

The most common by far.

I/O error while obtaining a lock leads to corruption. Probably speaks to my drive/os choices.

Thanks for the post, We have been using SQLite with Django for the past one year and I'm so happy that it exists. I even store it in the git repo and and it makes backups so easy.

> Continuing to use a file descriptor after it has been closed

It would be nice if file descriptors on 64-bit platforms used tagging so the full value could never be reused.

E.g. lower 32 bits is the normal fd and the 32-bit is a serial number (or random value). When using the fd the kernel can use the lower 32-bit as index into the table and check the upper 32 bits are correct.

Sadly that isn't allowed by POSIX


> 2.14. File Descriptor Allocation

> All functions that open one or more file descriptors shall, unless specified otherwise, atomically allocate the lowest numbered available file descriptor

Would badness ensue if we allowed ourselves to define all previously used fds as “unavailable”, and thus conforming to the rules as stated when we increment our serial number before handing out the next fd? If yes, what sorts of badness and why?

It strikes me that you'd use up STDIN, STDOUT, and STDERR pretty fast that way.

One could make an unprincipled exception for them, of course. And SQLITE would still need the special-casing to prevent using low-numbered fds. But this would prevent other problems and is worth considering.

Didn't see your reply before commenting, but it gave me a thought: We really only need a "local" solution. If there were a way of requesting fd's that weren't required to be the lowest available ones (perhaps even 64-bit ones with a process unique uppper part), that'd probably be good enough here. Of course we're still sharing address space with buggy code, so maybe it's not worth it.

I’ll reply here since then both you and parent can see it and comment further.

In the case that you linked to in the sibling comment that you wrote in reply to me, the dup2 syscall is used.

So the question then that I have is, in the case of

  foo 2&>1
Does bash then:

a) First close fd 2, and then call dup(1), or does it

b) Call dup2(1, 2), or

c) Do something else entirely, or something more complex or convoluted?

In case the answer is b) then it seems to me, though I may be overlooking something, that we could allow calls like b) and inside of the dup2 syscall keep the id of the fd that we are using as newfd, since it is unambiguously requested that this is intended, while at the same never reusing fd id in the dup syscall.

In this comment I am referring to the dup and dup2 syscalls as defined in the Linux man pages:

dup() uses the lowest-numbered unused descriptor for the new descriptor.

dup2() makes newfd be the copy of oldfd, closing newfd first if necessary, but note the following: [...]


With the modification that “lowest-numbered unused descriptor” would be unique as previously stated in the comment that the two of you are replying to.

More generally, the question is, does any wide-spread piece of code manually close an fd and then call the dup syscall, and expect the fd id to be exactly the same as the one they themselves closed, instead of using dup2?

And furthermore, likewise, does any widespread piece of code use the dup2 syscall to specify the newfd where actually it should not have cared about the id of the newfd and should have used dup instead and let the system decide?

I don't think it matters what bash actually does, the issue (as I see it) is that when a program calls write(N, ...) it sometimes wants to refer to the "stale" file descriptor N. Consider this contrived example:

    write(STDERR_FILENO,"test",4); // OK, write to normal stderr
    open(....); // Open some log file
    // later
    write(STDERR_FILENO,"x",1); // OK, write to log file
Even though fd 2 refers to different files I think the above is required to work for POSIX compliance.

That case is covered too as long as the program makes use of the dup2 syscall to make the switch instead of close followed by open. But it is probable that some pieces of code that are currently in widespread use do it by close and expecting a call to open to give them fd id for stderr like you say indeed I guess.

So the conclusion is that the answer to my original question is probably like you say, we can’t just do that.

But it would be interesting to find out how many wide-spread pieces of open source code do it that way, and for all code that does one could submit patches to change them to using the explicit dup2.

64 bit file descriptors doesn't work unless you create a whole swag of new system calls, because file descriptors are type `int` and most 64-bit POSIX implementations are I32LP64.

I think quite a bit of software actually relies on closing and re-opening file descriptors, especially stdin/stdout/stderr just before/after forking. This was the first example I found via a quick web search: https://unix.stackexchange.com/a/148410 but it suggest that e.g. bash would break :)

It might be possible with some hacks/heuristics to catch many errors (perhaps one could create a valgrind tool/santizier), but I suspect it's not possible in general since there's no way of knowing if the call to write(2,....) meant the old stderr or the new one.

What would it take for a valid program to depend on this behavior?

How weird would it look? How likely is it to arise in practice?

I'm a believer in standards in general, but there's a time and place to move beyond them too.

File descriptors are held in type int in most APIs, which is only 32 bits wide even in 64-bit systems.

Related question - does SQLite checksum blocks (pages) of data? Of course the filesystem usually does, and the physical device itself does, but in the scenario where the OS somehow delivered corrupted data would SQLite at a minimum detect it?

> Of course the filesystem usually does

Does it really? Certainly a few filesystems like ZFS have features like that, but AFAIK none of the "mainstream" ones do (the things actually running on billions of devices, like ext4, NTFS, etc)

> does SQLite checksum blocks (pages) of data?

No. Postgres has an off-by-default option for this. The journal does, though.

> Of course the filesystem usually does

Nope. Data checksumming is usually only employed by CoW file systems, because it requires data journaling for a regular file system, which generally isn't used or not supported at all.

2009 (at least) + previous discussions: https://hn.algolia.com/?q=How+to+Corrupt+an+SQLite+Database+...

> Backup or restore while a transaction is active

Is there any information whether taking a consistent snapshot (that contains both the DB file and journals) is safe?

It's recommended to use the backup api vs a filesystem copy: https://www.sqlite.org/backup.html

Seems like that would have the same problems if you do it while transactions are ongoing.

I believe they are subtly different.

If you copy while transactions are ongoing, you're reading different parts of a file at different points in time. For example, the beginning of the sqlite3 has already been copied, but by the time you get to the log file, blocks have been committed back into the beginning of the sqlite file.

If you take a consistent snapshot of the filesystem first, I would think that would be the same as a crash in the middle of a transaction. Things should be in a consistent state that the recovery code at startup can handle.

Given that the DB is designed to survive power outages (even during transactions), I don't really understand how that would be the case.

If you are using WAL, I feel volume-level snapshots are an acceptable approach.

Had a team mate use “dos2unix” on everything in a directory and that defo corrupted the database.

Very interesting read! Although some points might sound too strange to include in this list (like "memory corruption" - in case of faulty memory, everything is in danger, not only database) - some worth keeping in mind (multiple hardlinks to the file, moving a single DB file around without accompanying *-wal file, etc)

They are not describing cases of faulty memory but explicitly, and very aptly, warn you that since SQLite is a C library that runs in the same address space as the application code, if your application code corrupts memory (by way of buffer overrun, heap corruption, etc) it can impact SQLite's internals and in turn corrupts the SQLite database file.

> strange to include in this list (like "memory corruption" - in case of faulty memory, everything is in danger, not only database)

But theoretically one could build something that handles up to X random bit flips, so it's maybe worth mentioning that sqlite doesn't handle this (even though probably no one else does as well)

Could you? You would also have to take into account that the instruction that make the program could also be randomly flipped

In most cases the size of the code is much smaller than the data handled, so for random errors (perfectly working DRAM has a bit error rate somewhere in the general vicinity of 10^-14/bit*hr) the distribution of errors is in accordance to that proportion.

Any good introduction to sqlite3 for a Python user?

Read the sqlite SQL documentation. Then read the built-in Python library documentation. Or read the SQLAlchemy documentation, for a different way to interact with it. Build something.




Applications are open for YC Summer 2020

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact