Deferred logging is a huge deal! Appending a record to
a log takes one (or a half) of a rotation. A rotation
takes about the same amount of time as a disk seek. So
a DB that synchronously appends the on-disk log for
each transaction will be slow.
The huge win is if you can append many transactions to
the log in each rotation. To do that you have to gather
up many updates per disk operation. So deferred logging
is critical.
I suspect the reason PostgreSQL doesn't really support
delayed log flush is that they are thinking about ACID
transactions, where you really need the data to be on
disk immediately. A more technical issue is that the
log data must be on disk before the corresponding
permanent data (otherwise crash recovery will break),
and I suspect postgresql.conf's "fsync" option
has the effect of not fsync()ing the log at all, which
indeed would cause permanent corruption after a crash.
Indeed, fsync = off just means that the WAL isn't fsync'd at all, which can cause permanent corruption after a crash.
PostgreSQL does support a "deferred logging" mode, in which one or more transactions can avoid fsync'ing the WAL without risking data corruption -- the only risk is that those particular transactions might not be durable if the system crashes before the next fsync. This allows you to mix must-be-durable transactions with more transient ones, which is a nice feature.
The drives whose documentation I've
read say they may not copy the write-cache to the surface
during a power failure. I don't know about other drives, or
about why.
Such a feature would anyway be hard or impossible to use as
part of a design to get fast writes and crash recovery.
Crash recovery usually depends on constraints on the
order writes were applied to the disk surface -- for
example that all the log blocks were on the surface
before any of the B-Tree blocks. Or (for FFS) that an
i-node initialization goes to the surface before the
new directory entry during a creat(). Drives that just
provide write caching don't guarantee any ordering
(much of the point of write-caching is to change
the order of writes), and don't tell the o/s
which writes have actually completed. So the
write-order invariants that crash recovery depends on
won't hold with write-caching. That's why tagged command
queuing is popular in high-end systems: TCQ lets the
drive re-order concurrent writes, but tells the o/s when
each completes, so for example a DB can wait for the
log writes to reach the surface before starting the
B-Tree writes.
In our case, perhaps a pure log-structured DB could use
a disk write-cache. Crash recovery could scan the whole
disk (or some guess about the tail of the log) looking
for records that were written, and use the largest
complete prefix of the log. But we would not be able to
use the disk for anything with a more traditional crash
recovery design -- for example we probably could not
store our log in a file system! Perhaps we could tell the
disk to write-cache our data, but not the file system's
meta-data. On the other hand perhaps we'd want to write
the log to the raw disk anyway, since we don't want to be
slowed down by the file system adding block numbers to
the i-node whenever we append the log.
You can configure a drive to delay writing to the disk
surface, and instead just write into its cache, until
some later point when it's convenient to write the
surface. But the reason a DB issues a write to the
disk is that the DB needs the data to be recoverable
after a crash before the DB can proceed. So DBs cannot
easily use the disk's write-to-cache feature; the
disk's cache is no more durable than main memory.
You might imagine that the disk would write-cache only
an amount of data that it could write to the surface
with the energy stored in its capacitors after it detected
a power failure. But this is not the way disks work.
Typical disk specs explicitly say that the contents
of the write-cache may be lost if the power fails.
You may be thinking of "tagged queuing", in which the o/s
can issue concurrent operations to the disk, and the disk
chooses the order in which to apply them to the surface,
and tells the o/s as each completes so the DB knows
which transaction can now continue.
That's a good idea if there are concurrent transactions
and the DB is basically doing writes to random disk
positions. In the log-append case we're talking about,
tagged queuing is only going to make a difference if we
hand lots of appends to the disk at the same time. In
that specialized situation it's somewhat faster to
issue a single big disk write. You need to defer
log flushes in either case to get good performance.
You might imagine that the disk would write-cache only an amount of data that it could write to the surface with the energy stored in its capacitors after it detected a power failure.
That's exactly what I assumed, at least for high-end disks. Any idea why they don't do that? It seems like a pretty trivial hardware feature that would save an awful lot of software complexity.
The huge win is if you can append many transactions to the log in each rotation. To do that you have to gather up many updates per disk operation. So deferred logging is critical.
I suspect the reason PostgreSQL doesn't really support delayed log flush is that they are thinking about ACID transactions, where you really need the data to be on disk immediately. A more technical issue is that the log data must be on disk before the corresponding permanent data (otherwise crash recovery will break), and I suspect postgresql.conf's "fsync" option has the effect of not fsync()ing the log at all, which indeed would cause permanent corruption after a crash.