Trying to figure out what was causing this we realized our pc manufacturer was using the same sku but hardware inside could vary... and in particular the hard drive.
Every system with a corrupt DB had a particular brand hard drive (WD), the ones that didn't fail was another (Seagate).
We never really spent much more time on this other than to lock down the hardware config and replace all the hard drives in the failed systems with the "good" brand.
Super interesting to see this. I'm assuming it was: Disk drives that do not honor sync requests
We were of course using consumer grade equipment, because the company was pretty cheap, but also we had so many to deploy. I wonder how much this cost the company to fully resolve. Between systems not playing, having to send technicians, shipping systems overnight, time troubleshooting, etc.
I worked at a hospital at the time, and we were upgrading from Win2k to WinXP, and replacing a lot of old HP Celeron 733 MHz systems with newer HP dc5000 AMD desktops. We deployed roughly 1,000 desktops over the course of a year and a half, and fully 25% of the systems that had a DiamondMax Plus 8 had disk failures within the first 6 months of deployment. Symptoms were always the same. System would get slower and slower until one day it would hard reboot itself and then claim there was no disk drive installed. BIOS would show the disk drive was present but that it had no capacity or name. Built-in BIOS self test would report a code 7 error.
It got bad enough that we refused to deploy systems that shipped with those drives. Our manager had to fight with the sales rep for about a month before they relented. We were ready to go to Dell just to get away from those stupid disks.
You can still find disk recovery sites talking about the issue: https://vitaldata.ca/article/n40p-data-recovery.html
The Barracuda 7200.10 was infamous not just for its bug but also for Seagate's response.
From what I recall, the SMART log wrapped around periodically - if the drive was reset when the pointer was at a certain value, it failed to boot with an assertion failure. 1 in 128 chance of a failure on a given power-up.
Thankfully someone found out that you could send a couple of debug commands to it with an FTDI cable and get your data back.
I'll try to build the APK myself when I get some spare time.
if i had to guess, i'd say that the digital signage company was "value engineering" the system components ;) i was in industrial controls and never had a sense that the signage folks were laughing on the way to the bank...
Isn't "deploy many systems in hard to reach places" the prime target for enterprise-grade hardware? Fixing problems on one bespoke system is a lot easier than fixing the same problem 8000 times, and of course with a large sample size the number of problems experienced goes up.
Disabling write caching resolves these problems. Unfortunately SATA drives typically ship with write caching enabled by default.
What I have heard is that some disks have problems if they are on a error recovery path. E.g. they have a corrupt sector or a sector that needs reallocation (pending sector), and then they have a power failure while the sector is reallocated.
Another reason to stick to Toshiba or ex-HGST(if and when available)
1. Use a SQLite distribution that is compiled or otherwise configured for serialized threading mode (this is the default in most cases).
2. Use a single connection from a single process to access the SQLite database file. This connection can be safely shared between multiple threads with optional locking as required (due to the guarantees offered by #1 above).
3. Transactions are handled with locking in the application code (i.e. on the SQLiteConnection instance), not in the database. This is predominantly required for inserts with PK generation, but also applicable in cases where you don't want 2 different threads to interleave updates during a certain multi-step operation (e.g. debiting one account and crediting another).
So far, we have found this to be the most performant and simple way to get business data in and out of a SQLite database file. The "one connection per operation" approach is NOT ideal for SQLite because there is an actual database file that needs a handle created/closed each time. One connection per makes sense when you are using a hosted database designed for this access model.
Some SQLite wrappers (notably, Python's) mess with the start and rollbacks of transactions and make this appear buggy.
@coleifer (author of peewee) has a pretty informative blog post exploring — in part — transactional semantics, which might be of value to people who want to know more about this.
In any case, it was definitely an interesting discovery... since then, I only use shared volumes with the host as backup/restore targets, never for shared anything.
Not in the slightest. Stick your Plex DB on NFS and it’ll get messed up in very short order.
And/or options known to cause corruption for this Plex DB usage pattern. eg: "Don't use options foo, bar, bazz with PlexDB over NFS, as corruption will occur".
I picked SQLite to reduce moving parts.
Sqlite is more like a structured persistent file format with a SQL interface than a database.
SQLite is dead simple compared to Postgres (or any other hosted solution). I can copy a SQLite .db file from a production machine to my local box and be reviewing rows/running queries within seconds. Also, sharing SQLite databases with other users & developers or even using them as a contract between systems is a very compelling set of use cases. Finally, going from a blank windows machine to a working development (or production) environment is very quick if your application exclusively uses SQLite to store its data. You simply stop thinking about databases after a while because everything just works.
I am surprised by this! Is it really not entirely safe to take a backup while some software is running, has a SQLite database open, and might be writing to it?
That limitation is fine for a big database that has someone knowledgeable administering it who will take care to follow any required backup procedures. But SQLite is often included in software that nontechnical end users will use without even knowing that there's a SQLite file. For example, Firefox and Google Chrome use it.
This seems to mean that if a regular end user is browsing the web while backups run on their computer, it's possible they could be unlucky and get a bad backup that would corrupt the database if they had to restore it.
Am I interpreting that right? If so, it suggests that many/most software projects that include SQLite are doing it in a slightly unsafe way.
(It seems possible that the -journal and -wal files could protect against that, but this SQLite documentation only mentions that they protect you when transactions fail, which is a different case.)
For instance, a program may save its state into a file every time it changes by truncating the file and then writing new contents. If an user were to copy the file after truncation, but before write, the user would end up with an useless backup.
That said, this is rather unlikely to happen in practice. Programs usually do stuff (including saving files) when they are interacted with, so a backup copied with File Explorer has a very high probability of working, even when the application is running.
Programs can deal with the issue by performing atomic full writes by performing renames, but most programs don't bother to. SQLite cannot do that, because it cannot afford doing full writes as databases tend to store a lot of data.
A back-up tool should open files to backup in a mode that disallows others from writing to it, and retry that a few times, each time after waiting a few seconds.
If that keeps failing, there’s a problem. Do you skip backing up the file or retry With a “it’s fine if others are reading” mode, with the risk of getting a hybrid of multiple versions? There is no fully correct answer there.
And it gets worse. Not all programs write files in exclusive mode. Some might even open a file multiple times while rewriting it. Also, consistency may work cross-files, for example with Mac OS packages.
And then, of course, there’s anti-virus tools. They often work below the file system, reading and writing files without the OS being aware of it.
The reason that backup tools work relatively reliable and speedy at all mostly is that most programs that write to files will either regularly close them, or just append to them.
Databases are the exception. They tend to keep files permanently open and regularly write to them. If you use SQLite in a ’normal’ application for user data, you should flush data after (about) every write (if you use it for caching, you can be a bit more loose, as its relatively fine to discard the database when it gets corrupted.
Generally the preferred approach for backup programs is to trigger a file system snapshot - after quiescing the disks - and back that up. Sometimes (eg databases), the backup program is able to run application specific logic that works in with this (eg flush all database buffers before the snapshot).
For the filesystems that don't offer snapshots, some still have the ability to freeze the file system to allow a consistent backup.
> Programs can deal with the issue by performing atomic full writes by performing renames, but most programs don't bother to. SQLite cannot do that, because it cannot afford doing full writes as databases tend to store a lot of data.
I'm aware you can do that, but most programs aren't. If you pick any random program, chances are it won't do that.
I agree the risk is pretty small in practice. Though copying with File Explorer isn't the use case I had in mind. It was automatic scheduled backups. Both Windows and macOS make it pretty easy to set these up with an external hard drive. And those could start running while you're using the computer and/or you could start using the computer while backups are running.
Copying files is only an issue for manual back-ups done without special software.
> I am surprised by this! Is it really not entirely safe to take a backup while some software is running, has a SQLite database open, and might be writing to it?
I've written backup software: You only know if another process has an open file handle if you write your own driver. (Or integrate with 3rd party drivers, like ELDOS.)
In most cases, writing a driver is overkill, so you have to accept that you might read a file mid-write. In general, the other process should lock the file, but there's no way to force every programmer in the world to write their software to be convenient for you.
On a snapshotting system such as ZFS it should be fine because there won't be a race between updates and copying the WAL file and the DB file. (Haven't investigated it though.)
Suppose that you are copying the file front to back. While you are doing this, an update comes in that changes something in the front and back which refer to each other, while your copy is somewhere in the middle.
The result is that you get a backup with a data structure in the back of the file referring to stuff in the front of the file that isn't there. Your backup copy is therefore corrupted.
This problem is almost impossible to avoid. You just do the best that you can.
Don't do this btw, the FS sync in docker for windows/mac is really slow and prone to causing corruption and errors if you do.
>Unfortunately, most consumer-grade mass storage devices lie about syncing. Disk drives will report that content is safely on persistent media as soon as it reaches the track buffer and before actually being written to oxide. This makes the disk drives seem to operate faster (which is vitally important to the manufacturer so that they can show good benchmark numbers in trade magazines). And in fairness, the lie normally causes no harm, as long as there is no power loss or hard reset prior to the track buffer actually being written to oxide. But if a power loss or hard reset does occur, and if that results in content that was written after a sync reaching oxide while content written before the sync is still in a track buffer, then database corruption can occur.
>USB flash memory sticks seem to be especially pernicious liars regarding sync requests. One can easily see this by committing a large transaction to an SQLite database on a USB memory stick. The COMMIT command will return relatively quickly, indicating that the memory stick has told the operating system and the operating system has told SQLite that all content is safely in persistent storage, and yet the LED on the end of the memory stick will continue flashing for several more seconds. Pulling out the memory stick while the LED is still flashing will frequently result in database corruption.
If you were a human rather than sqlite you would stop believing that a sync has occurred immediately, trust it a little bit after a few seconds, and if it's been a minute since you yourself last touched the file, you would think "surely it has been written by now".
So it seems by tracking wall time, the library could do better than have to rely on the drive, and could be more on guard against lying drives.
I spent several years working for a major HDD manufacturer and their firmware QA process included a test rig which would repeatedly pull power to the drive while sending write and flush commands. The drive would then be powered up and the data on the drive examined to verify that all data which should have been flushed actually made it to disk.
I think this myth persists mostly because it can be surprising difficult to coax an OS into sending a flush command for much the same reason disk drives get accused of lying about it: performance.
These days the most reliably solution is to reconfigure the drive with write caching disabled. This way the drive will ensure every write hits the disk before reporting completion to the host. Write caching remains enabled by default on SATA drives mostly for historical reasons. Back before command queuing was commonly supported write caching gave a large speedup. These days everything supports NCQ so disabling write caching only has a small performance impact.
Of course there have also been controller bugs, file system bugs, and so on over the years. A decade ago the behavior of most file systems was quite poor in the face of underlying hardware failures but a lot of work has gone into improving the situation.
SSDs are slowing improving too. Most of them have claimed to go into read-only mode when enough blocks wear out but in reality most of them would crash, fail to respond, or just lose data. It turns out manufacturers tend to be a bit loose in their claims until independent third-parties put their claims to the test. More recent models seem to be a bit better behaved in failure states.
Can you elaborate about tracking wall time --- how would you determine the real sync timing? Picking an arbitrary large delay (a minute) would trash the write performance
So I imagine it could be set up so that any power loss that keeps data from being written within a specified arbitrarily large delay (such as one minute) won't actually corrupt the database - just lose data.
If you close stderr, 2 becomes available and can be handed out.
Also it's not exactly guaranteed that 0, 1 and 2 are the fd numbers for stdin, out and err on program start, but they are hard coded and standard.
E: It's implementation defined, ie not guaranteed. Your platform may make guarantees about it, but it is not an actual standard.
Yes, it is.
> On program startup, the integer file descriptors associated with the streams stdin, stdout, and stderr are 0, 1, and 2, respectively.
Then, after program startup, if it is closed, then fd 2 could be assigned to something else.
> The following symbolic values in <unistd.h> define the file descriptors that shall be associated with the C-language stdin, stdout, and stderr when the application is started:
> STDIN_FILENO Standard input value, stdin. Its value is 0.
> STDOUT_FILENO Standard output value, stdout. Its value is 1.
> STDERR_FILENO Standard error value, stderr. Its value is 2.
Does this settle it, then? This implies it's pronounced ess-queue-lite. I personally prefer the sequel-lite pronunciation, but if the developers use the former, then I guess I'll do that.
yea, seems so https://english.stackexchange.com/questions/431329/what-is-t...
If instead you are a modern hu-man who uses SQL, you pronounce it ess-qu-ell.
I wonder does that apply to Windows?
Well-designed Windows backup software uses shadow copy NTFS feature. Shadow copy creates a snapshot of the complete volume when backup starts. All writes happening after that won’t be included in the shadow copy, so the backup software should never read some old and some new content.
Does it mean when using VSS on Windows, SQLite databases will be backed up just fine? Or am I missing something?
Another thing that should work well is an AWS EC2 EBS snapshot as Amazon promises a consistent point in time copy.
We use a lot of CSV files in our embedded devices. I personally hate CSV and advocated for SQLite. However, our CSVs are 1 writer/many readers, and the only way to do that in SQLite is to change the mode to WAL. However, apparently WAL doesn't work on JFFS2 filesystems :(...
I'm going to give it another shot on our next-gen product which uses ubifs
(This is more ammunition for the idea that the real software isolation boundary on a desktop computer should not be "user", but "software author"!)
For instance typically databases are accessed through a socket interface which uses the same pool of file descriptors as open(). So it's probably possible for a database connection to end up with fd 2 and then the write(2, ...) would also send garbage to the database.
Although in this case it probably won't do much unless you're very unlucky because the message will almost certainly not end up being valid for the DB so it'll either ignore it, return an error or maybe drop the connection without corruption. So it would still be a nasty thing to debug but not quite as critical.
More generally there's typically a much better isolation between database and process on traditional processes which makes it difficult for a misbehaving program to mess directly with the DB. And having a server program processing the transactions means that you don't have to rely so heavily on things like filesystem locking to achieve atomicity for instance. That being said buggy or limited filesystems can be a problem for any database, for instance to guarantee that a transaction has been really commited to disk in case of a power outage etc... See this for instance:
I/O error while obtaining a lock leads to corruption. Probably speaks to my drive/os choices.
It would be nice if file descriptors on 64-bit platforms used tagging so the full value could never be reused.
E.g. lower 32 bits is the normal fd and the 32-bit is a serial number (or random value). When using the fd the kernel can use the lower 32-bit as index into the table and check the upper 32 bits are correct.
> 2.14. File Descriptor Allocation
> All functions that open one or more file descriptors shall, unless specified otherwise, atomically allocate the lowest numbered available file descriptor
One could make an unprincipled exception for them, of course. And SQLITE would still need the special-casing to prevent using low-numbered fds. But this would prevent other problems and is worth considering.
In the case that you linked to in the sibling comment that you wrote in reply to me, the dup2 syscall is used.
So the question then that I have is, in the case of
a) First close fd 2, and then call dup(1), or does it
b) Call dup2(1, 2), or
c) Do something else entirely, or something more complex or convoluted?
In case the answer is b) then it seems to me, though I may be overlooking something, that we could allow calls like b) and inside of the dup2 syscall keep the id of the fd that we are using as newfd, since it is unambiguously requested that this is intended, while at the same never reusing fd id in the dup syscall.
In this comment I am referring to the dup and dup2 syscalls as defined in the Linux man pages:
dup() uses the lowest-numbered unused descriptor for the new descriptor.
dup2() makes newfd be the copy of oldfd, closing newfd first if necessary, but note the following: [...]
With the modification that “lowest-numbered unused descriptor” would be unique as previously stated in the comment that the two of you are replying to.
More generally, the question is, does any wide-spread piece of code manually close an fd and then call the dup syscall, and expect the fd id to be exactly the same as the one they themselves closed, instead of using dup2?
And furthermore, likewise, does any widespread piece of code use the dup2 syscall to specify the newfd where actually it should not have cared about the id of the newfd and should have used dup instead and let the system decide?
write(STDERR_FILENO,"test",4); // OK, write to normal stderr
open(....); // Open some log file
write(STDERR_FILENO,"x",1); // OK, write to log file
So the conclusion is that the answer to my original question is probably like you say, we can’t just do that.
But it would be interesting to find out how many wide-spread pieces of open source code do it that way, and for all code that does one could submit patches to change them to using the explicit dup2.
It might be possible with some hacks/heuristics to catch many errors (perhaps one could create a valgrind tool/santizier), but I suspect it's not possible in general since there's no way of knowing if the call to write(2,....) meant the old stderr or the new one.
How weird would it look? How likely is it to arise in practice?
I'm a believer in standards in general, but there's a time and place to move beyond them too.
Does it really? Certainly a few filesystems like ZFS have features like that, but AFAIK none of the "mainstream" ones do (the things actually running on billions of devices, like ext4, NTFS, etc)
No. Postgres has an off-by-default option for this. The journal does, though.
> Of course the filesystem usually does
Nope. Data checksumming is usually only employed by CoW file systems, because it requires data journaling for a regular file system, which generally isn't used or not supported at all.
Is there any information whether taking a consistent snapshot (that contains both the DB file and journals) is safe?
If you copy while transactions are ongoing, you're reading different parts of a file at different points in time. For example, the beginning of the sqlite3 has already been copied, but by the time you get to the log file, blocks have been committed back into the beginning of the sqlite file.
If you take a consistent snapshot of the filesystem first, I would think that would be the same as a crash in the middle of a transaction. Things should be in a consistent state that the recovery code at startup can handle.
But theoretically one could build something that handles up to X random bit flips, so it's maybe worth mentioning that sqlite doesn't handle this (even though probably no one else does as well)