If your entire database fits in memory, then any decent DBMS will cache it all there, write changes back to disk at such time as to have a minimal impact on performance, all the while ensuring atomicity if someone kicks out the plug while it's in the middle of flushing the buffer. If you're not using a DBMS and you can still say all that about your own solution, you've probably put a lot of work into reinventing the wheel.
An ordinary DB writes each update to disk twice, once
in the log and once in the permanent DB. DBs do
defer updating the permanent DB. But they tend to be
eager about flushing the log in order to achieve durability. If you read the documentation
for PostgreSQL, for example, the only knob that seems
to allow you to defer log flushes also says it may
corrupt the internals of the DB if there's a crash. (Though
there's no fundamental reason PostgreSQL couldn't
defer log writes and have correct crash recovery,
sacrificing only durability of recent updates.)
That's a very strange knob. What's the point of doing any logging at all if it doesn't let you recover from a crash?
I don't imagine that deferred logging is a big deal, though, because log writes are by definition sequential. As Paul pointed out, sequential writes just aren't that slow. You can build an array of 62 commodity drives for maybe $4000.
Deferred logging is a huge deal! Appending a record to
a log takes one (or a half) of a rotation. A rotation
takes about the same amount of time as a disk seek. So
a DB that synchronously appends the on-disk log for
each transaction will be slow.
The huge win is if you can append many transactions to
the log in each rotation. To do that you have to gather
up many updates per disk operation. So deferred logging
is critical.
I suspect the reason PostgreSQL doesn't really support
delayed log flush is that they are thinking about ACID
transactions, where you really need the data to be on
disk immediately. A more technical issue is that the
log data must be on disk before the corresponding
permanent data (otherwise crash recovery will break),
and I suspect postgresql.conf's "fsync" option
has the effect of not fsync()ing the log at all, which
indeed would cause permanent corruption after a crash.
Indeed, fsync = off just means that the WAL isn't fsync'd at all, which can cause permanent corruption after a crash.
PostgreSQL does support a "deferred logging" mode, in which one or more transactions can avoid fsync'ing the WAL without risking data corruption -- the only risk is that those particular transactions might not be durable if the system crashes before the next fsync. This allows you to mix must-be-durable transactions with more transient ones, which is a nice feature.
The drives whose documentation I've
read say they may not copy the write-cache to the surface
during a power failure. I don't know about other drives, or
about why.
Such a feature would anyway be hard or impossible to use as
part of a design to get fast writes and crash recovery.
Crash recovery usually depends on constraints on the
order writes were applied to the disk surface -- for
example that all the log blocks were on the surface
before any of the B-Tree blocks. Or (for FFS) that an
i-node initialization goes to the surface before the
new directory entry during a creat(). Drives that just
provide write caching don't guarantee any ordering
(much of the point of write-caching is to change
the order of writes), and don't tell the o/s
which writes have actually completed. So the
write-order invariants that crash recovery depends on
won't hold with write-caching. That's why tagged command
queuing is popular in high-end systems: TCQ lets the
drive re-order concurrent writes, but tells the o/s when
each completes, so for example a DB can wait for the
log writes to reach the surface before starting the
B-Tree writes.
In our case, perhaps a pure log-structured DB could use
a disk write-cache. Crash recovery could scan the whole
disk (or some guess about the tail of the log) looking
for records that were written, and use the largest
complete prefix of the log. But we would not be able to
use the disk for anything with a more traditional crash
recovery design -- for example we probably could not
store our log in a file system! Perhaps we could tell the
disk to write-cache our data, but not the file system's
meta-data. On the other hand perhaps we'd want to write
the log to the raw disk anyway, since we don't want to be
slowed down by the file system adding block numbers to
the i-node whenever we append the log.
You can configure a drive to delay writing to the disk
surface, and instead just write into its cache, until
some later point when it's convenient to write the
surface. But the reason a DB issues a write to the
disk is that the DB needs the data to be recoverable
after a crash before the DB can proceed. So DBs cannot
easily use the disk's write-to-cache feature; the
disk's cache is no more durable than main memory.
You might imagine that the disk would write-cache only
an amount of data that it could write to the surface
with the energy stored in its capacitors after it detected
a power failure. But this is not the way disks work.
Typical disk specs explicitly say that the contents
of the write-cache may be lost if the power fails.
You may be thinking of "tagged queuing", in which the o/s
can issue concurrent operations to the disk, and the disk
chooses the order in which to apply them to the surface,
and tells the o/s as each completes so the DB knows
which transaction can now continue.
That's a good idea if there are concurrent transactions
and the DB is basically doing writes to random disk
positions. In the log-append case we're talking about,
tagged queuing is only going to make a difference if we
hand lots of appends to the disk at the same time. In
that specialized situation it's somewhat faster to
issue a single big disk write. You need to defer
log flushes in either case to get good performance.
You might imagine that the disk would write-cache only an amount of data that it could write to the surface with the energy stored in its capacitors after it detected a power failure.
That's exactly what I assumed, at least for high-end disks. Any idea why they don't do that? It seems like a pretty trivial hardware feature that would save an awful lot of software complexity.
Databases SHOULD do these things, but I haven't seen any evidence that the popular ones do. Feel free to post actual measurements for updates per second from your db.
How are you handling atomicity? Are you doing full-out undo/redo logging, or just something involving writing to a seperate file and then (atomically) retargeting a symlink?
I'm mostly an academic, so I can tell you a lot more about how things are supposed to work than about how they actually do. The only high-traffic DBMS with real users to which I can easily get access is MS-SQL, which doesn't inspire me to any leaps of faith. But once I finish building my new desktop system (with spiffy RAID array) I'll rig up some benchmarks on MySQL and PostgreSQL.
Just write each transaction as a single record to your transaction log file, ensure that only whole records are replayed (like with a checksum), and you're done. Since there are no b-trees or other complicated structures on disk, the problem gets a lot easier.
It sounds like you're describing very-nearly a textbook implementation of redo logging. I think what's making things easy for you is not the fact that you're using flat files, but rather that the entire contents of a transaction fit in memory. Most of the complexity involved in a commercial DBMS's log implementation stems from dealing with how to handle a transaction that modifies 20GB of data when you have 8GB of memory.
Edit: and yes, the irony is noted that I'm telling a Googler that he's not working with enough data.
Yes, exactly, they are solving a number of much harder problems. However, these problems are irrelevant to 99% of web sites. How often does Twitter need to perform a 20GB transaction?
This is a very good point. Some types of applications will find this scheme a bit difficult, though I'm in the midst of hacking through some POC work around it. I've got a database in the 100GB range potentially so it's a bit of a different domain.
I really think this concept works though for 8/10 applications which are using databases because, well, you've always used databases. Memory, save changes to flat files, read it all in during startup, it's really an elegant way to go about it.
It's been a few years since I was a sysadmin, but I believe either his DNS system or qmail system will show it. When I first read them, I thought they were very annoying, because it's not one monolithic program, and there are a lot of one-line shell scripts that do sequential operations using exec.
The way he describes it, he's getting rid of a lot of parsing by using the directory tree instead of text files following. But this use of directories can also be good for some of the more degenerate uses of databases.
Mostly I recall the "aha" moment of "So, this is the Unix philosophy in action." The design is very elegant.
Exactly. And it's mighty nice to have a declarative language (SQL) available to write your queries, and perform ad-hoc queries in a terminal, plus referential integrity, triggers, stored procedures etc.
On MySQL 4.1 / InnoDB / FreeBSD / FFS / SCSI, small inserts run at about 167 transactions per second, i.e. one per rotation.
Same setup but --innodb_flush_log_at_trx_commit=0,
a million transactions in 200 seconds, or 5000/second.
I don't know if InnoDB wrote its B-Tree to disk during the 200 seconds. Same performance even with InnoDB's buffer pool size set to 200 KB with --innodb_buffer_pool_size=200000.
The MySQL documentation claims that this configuration
does crash-recovery correctly, though you may lose
the last second's worth of transactions.
Keeping all data in RAM and just saving changes to a log file is a great architecture. It is a pity that the current "hip" tools/models for web development impose unnecessary cost for using it.
The problem is that the Ruby / Python / PHP stacks typically have one process per request, and they do not share memory. When you use memcached in such environment, it is a separate process and communication with it is rather slow - it involves marshaling the data, which would not be necessary if all requests were just threads in a single address space.You don't have this problem if you use Java/C++.
I have used with some success a huge mmaped file as persistent memory to store my data. It's not exactly using disc as sequential device, that depends on the application and data locality. There are just a few tools available to support that scheme.
So, how do write-intensive sites like LiveJournal handle their updates? I was under the impression that LJ was all memcached + sharding + master/master pairs for the shards.
How do you handle features that rely on frequent UPDATE statements, like say a hitcounter? Is UPDATE LOW PRIORITY (MySQL) enough, or should you work out some application-specific caching/batching mechanism and perform all your updates at once?
Ram disks (like www.texmemsys.com) for the log device are a way of speeding up traditional rdbms'. They cost a bit, but You get to keep all the 'cool' database features while minimizing the sync log write bottleneck.
Such architectures are also easier to explain to auditors if Your app is of a financial nature.
I am not affiliated with this or any other storage vendor.
--Stuki
Really great writeup Paul. I've been preaching "memory is magic" for the last couple years now -- since the capacity:price ratio got so good. Now I have a nice non-fluffy hacker post to send people to.
If there's any silver bullet to scaling most web apps it's properly utilizing gobs of memory. Pushing bytes from memory to your NIC is a pretty damn efficient operation.
A challenge in a pure version of a log-structured
database is reclaiming space from the log when data
is no longer needed. Sometimes you don't need to,
as in Venti ( http://citeseer.ist.psu.edu/531029.html ).
I wonder what would happen if you use ZFS on top of a flash memory drive and replace/add a SQL layer next to the ZPL layer. In theory, you'd get transactions and checksums for free and possibly your sequential writes as well.