Deferred logging is a huge deal! Appending a record to a log takes one (or a hal...

neilc · on June 23, 2008

[ Replying to a very old thread ]

Indeed, fsync = off just means that the WAL isn't fsync'd at all, which can cause permanent corruption after a crash.

PostgreSQL does support a "deferred logging" mode, in which one or more transactions can avoid fsync'ing the WAL without risking data corruption -- the only risk is that those particular transactions might not be durable if the system crashes before the next fsync. This allows you to mix must-be-durable transactions with more transient ones, which is a nice feature.

dfranke · on April 24, 2007

Don't drives have hardware buffers designed so that you don't have to worry about that?

rtm · on April 24, 2007

The drives whose documentation I've read say they may not copy the write-cache to the surface during a power failure. I don't know about other drives, or about why.

Such a feature would anyway be hard or impossible to use as part of a design to get fast writes and crash recovery. Crash recovery usually depends on constraints on the order writes were applied to the disk surface -- for example that all the log blocks were on the surface before any of the B-Tree blocks. Or (for FFS) that an i-node initialization goes to the surface before the new directory entry during a creat(). Drives that just provide write caching don't guarantee any ordering (much of the point of write-caching is to change the order of writes), and don't tell the o/s which writes have actually completed. So the write-order invariants that crash recovery depends on won't hold with write-caching. That's why tagged command queuing is popular in high-end systems: TCQ lets the drive re-order concurrent writes, but tells the o/s when each completes, so for example a DB can wait for the log writes to reach the surface before starting the B-Tree writes.

In our case, perhaps a pure log-structured DB could use a disk write-cache. Crash recovery could scan the whole disk (or some guess about the tail of the log) looking for records that were written, and use the largest complete prefix of the log. But we would not be able to use the disk for anything with a more traditional crash recovery design -- for example we probably could not store our log in a file system! Perhaps we could tell the disk to write-cache our data, but not the file system's meta-data. On the other hand perhaps we'd want to write the log to the raw disk anyway, since we don't want to be slowed down by the file system adding block numbers to the i-node whenever we append the log.

rtm · on April 24, 2007

You can configure a drive to delay writing to the disk surface, and instead just write into its cache, until some later point when it's convenient to write the surface. But the reason a DB issues a write to the disk is that the DB needs the data to be recoverable after a crash before the DB can proceed. So DBs cannot easily use the disk's write-to-cache feature; the disk's cache is no more durable than main memory.

You might imagine that the disk would write-cache only an amount of data that it could write to the surface with the energy stored in its capacitors after it detected a power failure. But this is not the way disks work. Typical disk specs explicitly say that the contents of the write-cache may be lost if the power fails.

You may be thinking of "tagged queuing", in which the o/s can issue concurrent operations to the disk, and the disk chooses the order in which to apply them to the surface, and tells the o/s as each completes so the DB knows which transaction can now continue. That's a good idea if there are concurrent transactions and the DB is basically doing writes to random disk positions. In the log-append case we're talking about, tagged queuing is only going to make a difference if we hand lots of appends to the disk at the same time. In that specialized situation it's somewhat faster to issue a single big disk write. You need to defer log flushes in either case to get good performance.

dfranke · on April 24, 2007

You might imagine that the disk would write-cache only an amount of data that it could write to the surface with the energy stored in its capacitors after it detected a power failure.

That's exactly what I assumed, at least for high-end disks. Any idea why they don't do that? It seems like a pretty trivial hardware feature that would save an awful lot of software complexity.