The only way a distributed file system such as I work on can provide sane behavior and decent performance to our users is to use local file systems only for course-grain space allocation and caching. Sometimes those magic incantations from ten-year-old LKML posts don't really work, because they were never really tested for more than a couple of simple cases. Other times they have unexpected impacts on performance or space consumption. Usually it's easier and/or safer just to do as much as possible ourselves. Databases - both local and distributed - are in pretty much the same boat.
Some day, I hope, all of this scattered and repeated effort will be combined into a common library that Does All The Right Things (which change over time) and adds features with a common API. It's not quite as good as if the stuff in the kernel had been done right, but I think it's the best we can hope for at this point.
1. File systems (single machine)
2. Distributed file systems
It looks like one of the authors cited in this article has written a paper analysing ZFS - though they admittedly don't test its behaviour on crashes. Citation here, in PDF form:
(edited to add: This only deals with the second part of this article. The first part would still be important even on ZFS)
ZFS also includes features around checksumming of the metadata. "Silent" write errors become loud the next time data is accessed and the checksums don't match. This can't prevent all errors, but has some very nice data integrity properties - Combined with it's RAID format, you can likely recover from most any failures, and with RAIDZ2, you can recover from a scattered failures on all drives even if one drive has completely died. This is actually fairly common - Modern drives are very large, and rust is more susceptible to 'cosmic rays' than one might think.
The same logic also apply to directories, although you will have to use links or symlinks to have something really atomic.
It may not work on strangely configured systems, like if your files are spread over different devices over the network (or maybe with NFS). But in those cases you will be able to detect it if you catch errors of rename() and co (and you should catch them of course). So no silver bullet here, but still a good shot.
Note that in the general case, you should fsync() the temporary file before you rename() it over the original - but ext3 and ext4 in writeback mode added a heuristic to do that automatically, because ext3 in the default ordered mode would effectively do that and many applications came to assume it.
I was saddened when I learned this. I used to use this old trick for a lot of my systems. I learned it from reading the code of old well-written unix systems programs.
Made me chuckle. I've been told off by a former Googler colleague enough times now to have learned that Gmail is more complex than anyone imagines on a first guess, in order to be "reliable".
So to anyone relying on gmail for safe keeping: Don't.
Please write to support and ask them to investigate. I used to work on the Gmail backend team, so I know bug reports regularly make it through support to an engineer. I also know they take data integrity quite seriously and have a variety of tools to investigate (potential) problems, both proactively and in response to user complaints such as yours. They also keep redundant copies of everything.
Do you feel you are you competent enough (in e.g. SMTP and MIME) to distinguish between MIME-encoded base64 inline images (super rare) and references to images on external web sites (a lot more common)? Which in the case of old web sites are quite likely to stop working if you revisit old mails.
Did you make the effort to verify that it was the first type when you witnessed this?
Also - your suggestion would have been a lot less douchey without that "laff".
It was inline do by means of a data-url in an img-tag in case you want the details ;)
And no offense taken. It's a fair question.
Keeping your "deleted" stuff hidden wouldn't hurt them.
Breaking a mbox is an extremely simple thing, as the format leaves no possibility of error checking, parallel writing, rewriting lost data, or anything else.
Outlook's mail folders are marginally better, allowing for error detection, but really, that's a lame first paragraph for introducing a great article.
I was shocked to discover that mke2fs doesn't check the return value of its final fsync call. This is compounded by the fact that pwrite calls don't fail across NBD (the writes are cached, so the caller's stack is long gone by the time the get flushed across the network and fails...)
As a test, I created an nbdkit plugin which simply throws away every write. Guess what? mke2fs will happily create a file system on such a block device and not report failure. You only discover a problem when you try to mount the file system.
(I like to test my storage code by running it on top of a network block device and then simulating power failure by cutting the connection randomly. This is one problem I found while doing that.)
I'd expect more "smartness" of Linux, like, as soon as there is no "write pressure" to flush earlier.
No, because correct programs use sync() and/or fsync() to force timely flushes.
A good database should not reply successfully to a write request until the write has been fully flushed to disk, so that an "acknowledged write" can never be lost. Also, it should perform write and sync operations in such a sequence that it cannot be left in a state where it is unable to recover -- that is, if a power outage happens during the transaction, then on recovery the database is either able to complete the transaction or undo it.
The basic way to accomplish this is to use a journal: each transaction is first appended to the journal and the journal synced to disk. Once the transaction is fully on-disk in the journal, then the database knows that it cannot "forget" the transaction, so it can reply successfully and work on updating the "real" data at its leisure.
Of course, if you're working with something that is not a database, then who knows whether it syncs correctly. (For that matter, even if you're working with a database, many have been known to get it wrong, sometimes intentionally in the name of performance. Be sure to read the docs.)
For traditional desktop apps that load and save whole individual files at a time, the "write to temporary then rename" approach should generally get the job done (technically you're supposed to fsync() between writing and renaming, but many filesystems now do this implicitly). For anything more complicated, use sqlite or a full database.
> I'd expect more "smartness" of Linux, like, as soon as there is no "write pressure" to flush earlier.
Well, this would only mask bugs, not fix them -- it would narrow the window during which a failure causes loss. Meanwhile it would really harm performance in a few ways.
When writing a large file to disk sequentially, the filesystem often doesn't know in advance how much you're going to write, but it cannot make a good decision on where to put the file until it knows how big it will be. So filesystems implement "delayed allocation", where they don't actually decide where to put the file until they are forced to flush it. The longer the flush time, the better. If we're talking about a large file transfer, the file is probably useless if it isn't fully downloaded yet, so flushing it proactively would be pointless.
Also flushing small writes rather than batching might mean continuously rewriting the same sector (terrible for SSDs!) or consuming bandwidth to a network drive that is shared with other clients. Etc.
If I get a corruption once in 100 outages instead of on every one, I'm satisfied. That it "masks" anything is not an argument at all.
The writes happen in bursts. The behavior of bursts won't change if one more write is done after the burst is over (an example) only a second later instead of waiting 30.
The "delayed allocation" is a red herring: in the optimal case, the software can instruct the filesystem to preallocate the whole file size without having to actually fill the content. If it's not common by some specific applications on Linux, that's the place to fix it.
Modern drives are ECC'ed to hell and back, on a enterprise systems (aka ones with ECC RAM and ECC'ed buses) a sector that comes back bad is likely the result of a software/firmware bug somewhere, and in many cases was written (or simply not written) bad.
Put another way, if you read a sector off a disk and conclude that it was bad, and a reread returns the same data, it was probably written bad. The fun part is then taking the resulting "bad" data and looking at it.
Reminds me of early in my career a linux machine we were running a CVS server on once or twice a year reported corrupted CVS files, and when looking at them, I often found data from other files stuck in the middle, often in 4k sized regions.
On Linux you have to drop_caches and then read get the checksum to be sure. Now per buffer or file drop_cache isnt available as per my knowledge . If you are doing a systemwide drop_caches you are invalidating the good and bad ones.
What if now if device is maintaing cache as well in addition to buffer cache?
Can someone clarify ?
There's always going to be a place where errors can creep in. There are no absolute guarantees; it's a numbers game. We've got more and more data, so the chance of corruption increases even if the per-bit probability stays the same. Checksumming reduces the per-bit probability across a whole layer of the stack - the layer where the data lives longest. That's the win.
I was asking this thinking of open(<file>,O_DIRECT|O_RDONLY);
that bypasses buffer cache and read directly from the disk that atleast solves buffer cache i guess. The disk cache is another thing ie if we disable it we are good at the cost of performance.
I was pointing that tests can do these kind of things.
Well, I'd need to be sure that "someone else" does things properly. My experience with various "someone elses" so far hasn't been stellar — most services I've tried were just a fancy sticker placed onto the same thing that I'm doing myself.
Still agree with we need checksumming filesystems though. That and gcc ram to make the data written more trustworthy.
I wouldn't assume that it was a cable error. The SATA interface has a CRC check. So the odds are quite high that a single error would simply result in a retransmission.
Of course, a plethora of detected SATA CRC errors and resulting retransmissions means that an undetected error could readily slip thru. There should be error logs reporting on the occurrence of retransmission, but I'm not enough of a software person to know how possible / easy it is to get that information from a drive or operating system.
OTOH, as you mention later in your post, a single bit error in non-ECC RAM could easily result in a single block having a checksum error. Exactly what you saw!
Simply observing that an error only occurred once does almost nothing to narrow down the possible causes. You have to also be keeping track of all the error reporting facilities (SMART, PCIe AER, ECC RAM, etc.).
I also have expensive power costs so running something like a Synology DS415 would cost $50 in power a year while barely using it - although that's better than older models.
But the exact nature of the problem is a distinction without a huge amount of difference to me. If I was copying those files, the copies would be silently corrupt. If I was transcoding or playing videos, the output would have glitches. Etc.
With this many HDDs, there are necessarily more components in the setup, and more things that can go wrong. Meanwhile, I'm not a business customer with profitable clients I can sell extra reliability to, so it's not the most expensive kit I could buy. I went as far as getting WD Red drives, and even then they were misconfigured by default, with an overly aggressive idle timer (8 seconds!) that needed tweaking.
The main thing is: more and bigger drives means increased probability of corruption.
But that means his pool performance is now shit.
> The only way to really be sure is to check your files after
> writing them that they match.
You generally can't make that assumption about a black box, so reading
back your writes guarantees nothing.
Unless you're intimately familiar with your underlying block device
you really can't guarantee anything about writes going to physical
hardware. All you can do is read its documentation and hope for the
If you need a general hack to that's pretty much guaranteed to flush
your writes to a physical disk it would be something like:
After your write, append X random bytes to a file where X is
greater than your block device's advertised internal memory, then
"that when they came up with threads, locks, and conditional variables at PARC, they thought that they were creating a programming model that anyone could use, but that there’s now decades of evidence that they were wrong. We’ve accumulated a lot of evidence that humans are very bad at reasoning at these kinds of problems, which are very similar to the problems you have when writing correct code to interact with current filesystems."
There were quite a few ways of organizing systems, including objects and functions, that the old guard came up with. UNIX's popularity and organization style of some others pushed the file-oriented approach from mainstream into outright dominance. However, many of us long argued it was a bad idea and alternatives exist that just need work on the implementation side. We actually saw some of those old ideas put into practice in data warehousing, NoSQL, "the cloud," and so on. Just gotta do more as we don't have a technical reason for dealing with the non-sense in the write-up: just avoiding getting our hands dirty with the replacements.
> Like that my long-time favorite, XFS, totally kicks ass in the
It's written in 2004 so I don't know how current it is, but it makes
the point that XFS makes certain performance & safety guarantees
essentially assuming that you're running on hardware that has a UPS
with the ability to interrupt the OS saying "oops, we're out of
XFS is becoming the sane default filesystem for servers as it allocates nodes more consistently than the other current mainstream linux options on multidisk systems. Basically small servers now have more disk space and performance than the large systems of 2004. So XFS stayed put in where it starts to make sense, but systems grew to meet its sweetspot much often.
All files are periodically (default daily) written to a block coalescing worm drive and you can rewind the state of the file system to any date on a per process basis, handy for diffing your codebase etc.
For a while the removal of the "rm" command was considered to underline the permanence of files but the removal of temporary data during the daytime hours turned out to be more pragmatic.
Most users of this pattern (concurrent updates to log files) get away with it because their updates are smaller than 4k, but if you're trying to write something as big as an E-Mail with this pattern you can trivially get interleaved writes resulting in corruption.
Our hypothesis in beginning this research was simply that
the complex optimizations applied in current file system
technology doesn’t carry over to SSDs given such dramatic changes in performance characteristics. To explore
this hypothesis, we created a very simple file system
research vehicle, IotaFS, based on the incredibly simple and small Minix file system, and found that with a
few modifications we were able to achieve comparable
performance to modern file systems including Ext3 and
ReiserFS, without being nearly as complex.
1. Write changed the data into a temporary file in the same directory (don't touch the original file)
2. Move new file over old file
Does this lead to a simpler strategy that is easier to reason about, where it is less likely for programmer to get it wrong? At least I see this strategy being applied more often than the "single file + log" approach.
The obvious downside is that this temporarily uses twice the size of the dataset. However, that is usually mitigated by splitting the data into multiple files, and/or applying this only to applications that don't need to store gigabytes in the first place.
This is the fundamental problem: When you allow the OS (or the compiler, or the CPU) to re-order operations in the name of efficiency you lose control over intermediate states, and so you cannot guarantee that these intermediates states are consistent with respect to the semantics that you care about. And this is the even more fundamental problem: our entire programming methodology has revolved around describing what we want the computer to do rather than what we want to to achieve. Optimizations then have to reverse-engineer our instructions and make their best guesses as to what we really meant (e.g. "This is dead code. It cannot possibly affect the end result. Therefore it can be safely eliminated.) Sometimes (often?) those guesses are wrong. When they are, we typically only find out about it after the mismatch between our expectations and the computer's have manifested themselves in some undesirable (and often unrecoverable) way.
No, it can work, provided that the temporary file is fsynced before being renamed, the parent directory is fsynced after renaming the file, and that the application only considers the rename to have taken place after the parent directory is fsynced (not after the rename call itself).
 Here's follow(1) in my homemade PL:
(when (not-eq? (length script-args) 1)
(write (sprintln "usage: " script-name " <path>") stderr)
(let path (car script-args)
f (open path)
(let stat-now (stat f)
(seq (when (and? stat-then
(lt? (find 'size stat-now)
(find 'size stat-then)))
# The file was truncated. Read from the start.
(seek s-set 0 f))
(let bytes (read default-chunk f)
(seq (when bytes (print bytes))
if you have long lived sessions tail'ing logs that get rotated (or truncated), forgetting -F is a good way to eventually end up confused, ctrl-c'ing, and finally cursing.
* Single file + log: fine grained locking in a shared C data structure. Yuck!
* Write new then move: transactional reference to a shared storage location, something like Clojure's refs. Easy enough.
The latter clearly provides the properties we'd like, the former may but it's a lot more complicated to verify and there are tons of corner cases. So I think move new file over old file is the simpler strategy and way easier to reason about.
Clojure's approach again provides an interesting solution to saving space. Taking the idea of splitting data into multiple files to the logical conclusion, you end up with the structure sharing used in efficient immutable data structures.
Linked paper mention the many popular programs forgets about (1).
Mail servers and clients (MTAs and MUAs) have been using Maildir format to escape this problem since 1996.
Filesystems have evolved. ZFS has been available for ten years.
If you take your data seriously, you do need to make informed choices. But this article isn't targeted at people who won't know about Maildir and ZFS.
The section about relative data loss failures between various applications is great. Again, careful choices: SQLite and Postgres.
The qmail spec describes queuing and delivery in a filesystem-safe manner.
Perhaps you should examine the section discussing the various problems with file systems. ZFS is hardly immune from these problems.
Origins: one starts out with proprietary developed functionality and with tons of money in a skunkworks project, using a consolidated team of developers mainly within one company through the fs development and stabilization; the other starts with an idea in a research paper (not called btrfs then because it wasn't an implementation), slow recruitment that'd eventually involve dozens of major companies, and is on a split stabilize and feature development track.
And then if you go back to look at features (commands in particular) in more detail there are some big differences that may or may not affect specific workflows. e.g. btrfs snapshots are unique and independent file trees, in effect there's no distinction between the original and snapshot, both are read-writable, modifying one doesn't affect the other, either can be deleted without affecting the other (ZFS has a distinct parent-child relationship between filesystems, clones, snapshots). Growing btrfs volumes by adding then (at some point balancing, which may be optional depending on the utilization and layout) is a lot easier and faster than ZFS. ZFS offers raid7 (3 parity drives), Brfs does not. Btrfs offers online defragmentation, ZFS does not. ZFS is more stable for $reasons. Really even ext4 and XFS are considered more stable still, although I hesitate to say Btrfs is not stable or unstable. I'd say Btrfs is earning trust, where the others have earned it (with very edge case exceptions, and there simply are more such edge cases for Btrfs still being found and fixed)
So they're actually quite different.
I've been a heavy mail user for years... Never encountered data loss due to file system problem, and honestly I can't think of a time in the last decade where anyone I'm acquainted with has. (And I ran a very large mail system for a long time)
Hell, I've been using OSX predominately for years now, and that garbage file system hasn't eaten any data yet!
There are problems, even fundamental problems, but if someone is literally unable to use any mail client, you need to look at the user before the file system.
20 years ago Mac OS crashed often, and had a file system not designed to account for that. OS X even shipped with non-journaled HFS+. It was only into the 3rd major release of OS X that journaling appeared. Data corruptions, I feel, dropped massively, because the OS didn't crash nearly as often, but did still crash. In the last 4-5 years I'd say I get maybe one or two kernel panics per year on OS X, which is a lot less than I get on Linux desktops. But even still on Linux desktops, I can't say when I've seen file system corruption not attributable to pre-existing hardware issues.
NTFS consistency checks and recovery are pretty good. But they won't bring your data back.
Microsoft's ReFS (Reliable File System) might give storage reliability to Windows one day. On Linux you can use ZFS or btrfs (with some reservations) today.
> ...work even if the power is cut?
If power is cut during NTFS logfile update, all bets are off. Hard disks can and will do weird things when losing power unexpectedly. That includes writing incorrect data, corrupting nearby blocks, corrupting any blocks, etc. That includes earlier logfile data, including checkpoints.
I wonder if the article and papers its based on account for how the VFS actually behaves, and then if someone wanting to do more research in this area could investigate this accounting for the recent VFS changes. On Linux in particular I think this is a bigger issue because there are so many file systems the user or distro could have picked, totally unbeknownst to and outside the control of the app developer.
We have libraries to implement "best practices" for network I/O, portable across systems that use poll or epoll or kqueues with best performance on each etc. What we need is the same thing for file I/O. However imperfect it might be, it would be better than where we are now.
I mean, I thought this was a major reason for VFS abstraction between the application and kernel anyway. It's also an example of the distinction between open source and free (libre). If as an application developer you have to know such esoterics to sanely optimize, you in fact aren't really free to do what you want. You have to go down a particular rabbit hole and optimize for that fs, at the expense of others. That's not fair choice to have to make.
The article doesn't say but I suspect most of the issues it mentions can be mitigated by mounting with the "sync" and "dirsync" options, but that absolutely kills performance.
The APIs involved could definitely be friendlier, but the app developer is using an API that's explicitly performance oriented by default at the cost of safety, and needs to opt-in to get safer writes. Whether the default should be the other way around is one matter, but ultimately someone has to pick which one they want and live with the consequences.
Unfortunately, this is not the case for SSDs.
It wouldn't solve problems with the block device layer itself, but it'd be interesting to see how robust and/or fast it was in real life.
The software product I develop stores its data in a local database using H2, stored in a single file. Users tend to have databases gigabytes in size.
After reading this article, I start to understand why my users occasionally get corrupted databases. A hard problem to solve.
If you start at the bottom of the stack you have all sorts of issues with drives, their firmware and controllers doing their own levels of caching & re-ordering. Over the years the upper layer developers (block layer / fs) had to deal with hardware that simply lies about what happened or is just plain buggy.
Does anyone know if any of this applies to Java's IO operations. I'm sure you can force some of this behaviour, but for instance: The flush method on OutputStream, will it ensure proper sync, or is that again dependent on the OS and file system as described in the article for syscalls?
It's only logical if you think about the bigger picture. Does Java have access to the underlying disk device or does it work with the filesystem? Which component is responsible for the filesystem?
Can't remember if we switched to ext3 in ordered or data-journaled mode but it made an immense difference...
If the block layer supported it, solving the problem of fencing an ATA device would be as simple as issuing a whole device flush instead of a SYNC_CACHE with range. Which for soft raid devices would make a huge impact because only the devices with dirty data need to be flushed.
Of course the excuse today, is that most scsi devices can't do range based sync either, and just fall back to whole device because no one uses the command. Chicken/egg.
This is especially strange as the author then mentions email there, without providing an actual email address:
| Please ping me on twitter if it looks like your email got eaten by my spam filter
Maybe the author expects a reader to brute-force lots of addresses "something [at] danluu.com"?
His email address is also given prominently on his linked resume: https://github.com/danluu/tex-resume
(periods are optional in Gmail addresses)
It gets more "fun" when you consider many (most? all?) hard disks can get corrupted without checksum failures.
Layer "violators" like ZFS and btrfs do have checksums.
Maybe conventional block / filesystem layering itself is faulty.
In particular: checksumming blocks alongside the blocks themselves (as some controllers and logical volume managers do) handles corruption within blocks, but it cannot catch dropped or misdirected writes. You need the checksum stored elsewhere, where you have some idea what the data _should_ be. Once we (as an industry) learned that those failure modes do happen and are important to address, the old layers no longer made sense. (The claim about ZFS is misleading: ZFS _is_ thoughtfully layered -- it's just that the layers are different, and more appropriate given the better understanding of filesystem failure modes that people had when it was developed.)
Fortunately, it's easy to detect programmatically. I have a little shell script which flags problems in my Markdown files: http://gwern.net/markdown-lint.sh
In this case, you can use Pandoc | elinks -dump to get a rendered text version, and then simply grep in the plain text for various things like "-e '(http' -e ')http' -e '[http' -e ']http'"
Files are of one of the following types:
For a unit file, the unit of consistency is the entire file. Unit files can be created or replaced, but not modified. Opening a unit file for writing means creating a new file. When the new file is closed successfully, the new version replaces the old version atomically. If anything goes wrong, including a system crash, between create and successful close, including program abort, the old version remains and the new version is deleted. File systems are required to maintain that guarantee.
Opens for read while updating is in progress reference the old version. Thus, all readers always see a consistent version.
They're never modified in place once written. It's easy for a file system to implement unit file semantics. The file system can cache or defer writes. There's no need to journal. The main requirement is that the close operation must block until all writes have committed to disk, then return a success status only if nothing went wrong.
In practice, most files are unit files. Much effort goes into trying to get unit file semantics - ".part" files, elaborate file renaming rituals to try to get an atomic rename (different for each OS and file system), and such. It would be easier to just provide unit file semantics. That's usually what you want.
Log files are append-only. The unit of consistency is one write. The file system is required to guarantee that, after a program abort or crash, the file will end cleanly at the end of a write. A "fsync" type operation adds the guarantee that the file is consistent to the last write. A log file can be read while being written if opened read-only. Readers can seek, but writers cannot. Append is always at the end of the file, even if multiple processes are writing the same file.
This, of course, is what you want for log files.
Temporary files disappear in a crash. There's no journaling or recovery. Random read/write access is allowed. You're guaranteed that after a crash, they're gone.
Managed files are for databases and programs that care about exactly when data is committed. A "write" API is provided which returns a status when the write is accepted, and then makes an asynchronous callback when the write is committed and safely on disk. This allows the database program to know which operations the file system has completed, but doesn't impose an ordering restriction on the file system.
This is what a database implementor needs - solid info about if and when a write has committed. If writes have to be ordered, the database program can wait for the first write to be committed before starting the second one. If something goes wrong after a write request was submitted, the caller gets status info in the callback.
This would be much better than the present situation of trying to decide when a call to "fsync" is necessary. It's less restrictive in terms of synchronization - "fsync" waits for everything to commit, which is often more than is needed just to finish one operation.
This could be retrofitted to POSIX-type systems. If you start with "creat" and "O_CREAT" you get a unit file by default, unless you specify "O_APPEND", in which case you get a log file. Files in /tmp are temporary files. Managed files have to be created with some new flag. Only a small number of programs use managed files, and they usually know who they are.
This would solve most of the problems mentioned in the original post.
PG and company still think this is a great idea, because the software this forum runs on apparently also does the file system-as-database thing.
No wonder Viaweb was rewritten in C++.
EDIT: to those downvoting, my characterization of PG's response is correct. Its arrogance is undeniable, with its air of "look how smart I am; I see through the marketing hype everyone else falls for," as is its cluelessness, with PG advocating practices that will cause data loss. Viaweb was probably a buggy, unstable mess that needed a rewrite.
Well, FWIW, it's working. At this instant his total karma is +10, he only has 2 posts, and the 2nd post is slightly greyed. So that means that his original comment is now in the range of +10.
Sadly, the fact that I'm replying here probably means that I need to get a life!
WAFL uses copy on write trees uniformly for all data and metadata. No currently live block is ever modified in place. By default a checkpoint/snapshot is generated every 10 seconds. NFS operations since the last checkpoint are written into NVRAM as a logical recovery log. Checkpoints are advanced and reclaimed in a double buffering pattern, ensuring there's always a complete snapshot available for recovery. The filer hardware also has features dedicated to operating as a high availability pair with nearly instant failover.
The netapp appliances weren't/aren't perfect, but they are far better than you're assuming. They were designed to run 24/7/365 on workloads close to the hardware bandwidth limits. For most of the 2000's, buying a pair of netapps was a simple way to just not have issues hosting files reliably.
Perhaps you should take your own advice and dial back the arrogance a bit.
Or maybe this massive non-sequitur is reason enough for downvotes. Also note this from the link you provided:
> But when we were getting bought by Yahoo, we found that they also just stored everything in files
I think PG was pretty much right in his judgement. Any file system is going to be pretty good on a reliable block store, such as Netapp.
PG obviously thinks RDBMSs are just unnecessary middlemen between you and the same file system. He doesn't realize that even if they ultimately use the same file system you do, they likely don't use it the way you do. Maybe Viaweb used something like the last snippet of code in the article, but I doubt it.
Or you can read the patent itself: https://www.google.com/patents/US5819292
I was curious why pg did it that way. Here's a brief comment from him:
>What did Viaweb use?
>pg 3160 days ago
>Keep everything in memory in the usual sort of data structures (e.g. hash tables). Save changes to disk, but never read from disk except at startup.
So similar to what Redis does today but a decade before Redis and likely faster than the databases of the day. Could have been important with loads of users editing their stores on slow hardware. Anyway it worked, they beat their 20 odd competitors and got rich. I'm sceptical that it was a poor design choice.
At the time, IMHO, PG was indeed smart to be one of the few using FreeBSD as opposed whatever the majority were using.
But he has admitted they struggled with setting up RAID. They were probably not too experienced with FreeBSD. I am sure they had their fair share of troubles.
PG's essays and his taste in software are great and the software he writes may be elegant, but that does not necessarily mean it is robust.
Best filesystem I have experienced on UNIX is tmpfs. Backing up to permanent storage is still error-prone, even in 2015.
Why was it a better OS choice at the time than, say, Solaris or IRIX or BSDI?
> I haven’t used a desktop email client in years. None of them could handle the volume of email I get without at least occasionally corrupting my mailbox.
If I were to get so many emails that it was corrupt my mailbox, I'd first ask myself why, and how to stop that.
What I would ask is: is there a way I can solve this problem without having to totally rearrange how I use email? I'm curious whether the author looked at, say, running a personal MDA that served mail over IMAP, so that it could be interacted with via a desktop email client, without requiring that client to serve as the point of truth. Not to say that corruption couldn't still happen that way, but Thunderbird (for example) can be configured to store only a subset of messages locally, or none at all. With a reasonably fast connection to the MDA, this seems like a possibly workable solution.
Not necessarily. For example, if you have monitoring software sending you all email notifications, you could change that to just write records to a database instead.