The problem with conventional databases

dfranke · on April 23, 2007

If your entire database fits in memory, then any decent DBMS will cache it all there, write changes back to disk at such time as to have a minimal impact on performance, all the while ensuring atomicity if someone kicks out the plug while it's in the middle of flushing the buffer. If you're not using a DBMS and you can still say all that about your own solution, you've probably put a lot of work into reinventing the wheel.

rtm · on April 24, 2007

An ordinary DB writes each update to disk twice, once in the log and once in the permanent DB. DBs do defer updating the permanent DB. But they tend to be eager about flushing the log in order to achieve durability. If you read the documentation for PostgreSQL, for example, the only knob that seems to allow you to defer log flushes also says it may corrupt the internals of the DB if there's a crash. (Though there's no fundamental reason PostgreSQL couldn't defer log writes and have correct crash recovery, sacrificing only durability of recent updates.)

dfranke · on April 24, 2007

That's a very strange knob. What's the point of doing any logging at all if it doesn't let you recover from a crash?

I don't imagine that deferred logging is a big deal, though, because log writes are by definition sequential. As Paul pointed out, sequential writes just aren't that slow. You can build an array of 62 commodity drives for maybe $4000.

rtm · on April 24, 2007

Deferred logging is a huge deal! Appending a record to a log takes one (or a half) of a rotation. A rotation takes about the same amount of time as a disk seek. So a DB that synchronously appends the on-disk log for each transaction will be slow.

The huge win is if you can append many transactions to the log in each rotation. To do that you have to gather up many updates per disk operation. So deferred logging is critical.

I suspect the reason PostgreSQL doesn't really support delayed log flush is that they are thinking about ACID transactions, where you really need the data to be on disk immediately. A more technical issue is that the log data must be on disk before the corresponding permanent data (otherwise crash recovery will break), and I suspect postgresql.conf's "fsync" option has the effect of not fsync()ing the log at all, which indeed would cause permanent corruption after a crash.

neilc · on June 23, 2008

[ Replying to a very old thread ]

Indeed, fsync = off just means that the WAL isn't fsync'd at all, which can cause permanent corruption after a crash.

PostgreSQL does support a "deferred logging" mode, in which one or more transactions can avoid fsync'ing the WAL without risking data corruption -- the only risk is that those particular transactions might not be durable if the system crashes before the next fsync. This allows you to mix must-be-durable transactions with more transient ones, which is a nice feature.

dfranke · on April 24, 2007

Don't drives have hardware buffers designed so that you don't have to worry about that?

rtm · on April 24, 2007

The drives whose documentation I've read say they may not copy the write-cache to the surface during a power failure. I don't know about other drives, or about why.

Such a feature would anyway be hard or impossible to use as part of a design to get fast writes and crash recovery. Crash recovery usually depends on constraints on the order writes were applied to the disk surface -- for example that all the log blocks were on the surface before any of the B-Tree blocks. Or (for FFS) that an i-node initialization goes to the surface before the new directory entry during a creat(). Drives that just provide write caching don't guarantee any ordering (much of the point of write-caching is to change the order of writes), and don't tell the o/s which writes have actually completed. So the write-order invariants that crash recovery depends on won't hold with write-caching. That's why tagged command queuing is popular in high-end systems: TCQ lets the drive re-order concurrent writes, but tells the o/s when each completes, so for example a DB can wait for the log writes to reach the surface before starting the B-Tree writes.

In our case, perhaps a pure log-structured DB could use a disk write-cache. Crash recovery could scan the whole disk (or some guess about the tail of the log) looking for records that were written, and use the largest complete prefix of the log. But we would not be able to use the disk for anything with a more traditional crash recovery design -- for example we probably could not store our log in a file system! Perhaps we could tell the disk to write-cache our data, but not the file system's meta-data. On the other hand perhaps we'd want to write the log to the raw disk anyway, since we don't want to be slowed down by the file system adding block numbers to the i-node whenever we append the log.

rtm · on April 24, 2007

You can configure a drive to delay writing to the disk surface, and instead just write into its cache, until some later point when it's convenient to write the surface. But the reason a DB issues a write to the disk is that the DB needs the data to be recoverable after a crash before the DB can proceed. So DBs cannot easily use the disk's write-to-cache feature; the disk's cache is no more durable than main memory.

You might imagine that the disk would write-cache only an amount of data that it could write to the surface with the energy stored in its capacitors after it detected a power failure. But this is not the way disks work. Typical disk specs explicitly say that the contents of the write-cache may be lost if the power fails.

You may be thinking of "tagged queuing", in which the o/s can issue concurrent operations to the disk, and the disk chooses the order in which to apply them to the surface, and tells the o/s as each completes so the DB knows which transaction can now continue. That's a good idea if there are concurrent transactions and the DB is basically doing writes to random disk positions. In the log-append case we're talking about, tagged queuing is only going to make a difference if we hand lots of appends to the disk at the same time. In that specialized situation it's somewhat faster to issue a single big disk write. You need to defer log flushes in either case to get good performance.

dfranke · on April 24, 2007

You might imagine that the disk would write-cache only an amount of data that it could write to the surface with the energy stored in its capacitors after it detected a power failure.

That's exactly what I assumed, at least for high-end disks. Any idea why they don't do that? It seems like a pretty trivial hardware feature that would save an awful lot of software complexity.

brlewis · on April 24, 2007

The knob is there to let the OS do its best buffering "if you trust your operating system, your hardware, and your utility company."

http://www.postgresql.org/docs/8.2/interactive/runtime-config-wal.html

paul · on April 23, 2007

Ensuring atomicity is actually very easy.

Databases SHOULD do these things, but I haven't seen any evidence that the popular ones do. Feel free to post actual measurements for updates per second from your db.

dfranke · on April 23, 2007

How are you handling atomicity? Are you doing full-out undo/redo logging, or just something involving writing to a seperate file and then (atomically) retargeting a symlink?

I'm mostly an academic, so I can tell you a lot more about how things are supposed to work than about how they actually do. The only high-traffic DBMS with real users to which I can easily get access is MS-SQL, which doesn't inspire me to any leaps of faith. But once I finish building my new desktop system (with spiffy RAID array) I'll rig up some benchmarks on MySQL and PostgreSQL.

paul · on April 23, 2007

Just write each transaction as a single record to your transaction log file, ensure that only whole records are replayed (like with a checksum), and you're done. Since there are no b-trees or other complicated structures on disk, the problem gets a lot easier.

dfranke · on April 23, 2007

It sounds like you're describing very-nearly a textbook implementation of redo logging. I think what's making things easy for you is not the fact that you're using flat files, but rather that the entire contents of a transaction fit in memory. Most of the complexity involved in a commercial DBMS's log implementation stems from dealing with how to handle a transaction that modifies 20GB of data when you have 8GB of memory.

Edit: and yes, the irony is noted that I'm telling a Googler that he's not working with enough data.

paul · on April 24, 2007

Yes, exactly, they are solving a number of much harder problems. However, these problems are irrelevant to 99% of web sites. How often does Twitter need to perform a 20GB transaction?

mattculbreth · on April 24, 2007

This is a very good point. Some types of applications will find this scheme a bit difficult, though I'm in the midst of hacking through some POC work around it. I've got a database in the 100GB range potentially so it's a bit of a different domain.

I really think this concept works though for 8/10 applications which are using databases because, well, you've always used databases. Memory, save changes to flat files, read it all in during startup, it's really an elegant way to go about it.

owinebar · on April 24, 2007

If you want to see elegant, read dan bernstein's code. Let the filesystem be your database.

mattculbreth · on April 24, 2007

Cool, I'll do that. Which set of his code? I've found his website and there's a bunch there.

owinebar · on April 28, 2007

It's been a few years since I was a sysadmin, but I believe either his DNS system or qmail system will show it. When I first read them, I thought they were very annoying, because it's not one monolithic program, and there are a lot of one-line shell scripts that do sequential operations using exec. The way he describes it, he's getting rid of a lot of parsing by using the directory tree instead of text files following. But this use of directories can also be good for some of the more degenerate uses of databases. Mostly I recall the "aha" moment of "So, this is the Unix philosophy in action." The design is very elegant.

lupin_sansei · on April 24, 2007

Exactly. And it's mighty nice to have a declarative language (SQL) available to write your queries, and perform ad-hoc queries in a terminal, plus referential integrity, triggers, stored procedures etc.

BitGeek · on April 24, 2007

I know you beleive that, but just look at the number of startups using MySQL.

dfranke · on April 24, 2007

I said any decent DBMS.

rtm · on April 24, 2007

On MySQL 4.1 / InnoDB / FreeBSD / FFS / SCSI, small inserts run at about 167 transactions per second, i.e. one per rotation.

Same setup but --innodb_flush_log_at_trx_commit=0, a million transactions in 200 seconds, or 5000/second.

I don't know if InnoDB wrote its B-Tree to disk during the 200 seconds. Same performance even with InnoDB's buffer pool size set to 200 KB with --innodb_buffer_pool_size=200000.

The MySQL documentation claims that this configuration does crash-recovery correctly, though you may lose the last second's worth of transactions.

ntoshev · on April 24, 2007

Keeping all data in RAM and just saving changes to a log file is a great architecture. It is a pity that the current "hip" tools/models for web development impose unnecessary cost for using it.

The problem is that the Ruby / Python / PHP stacks typically have one process per request, and they do not share memory. When you use memcached in such environment, it is a separate process and communication with it is rather slow - it involves marshaling the data, which would not be necessary if all requests were just threads in a single address space.You don't have this problem if you use Java/C++.

I have used with some success a huge mmaped file as persistent memory to store my data. It's not exactly using disc as sequential device, that depends on the application and data locality. There are just a few tools available to support that scheme.

nostrademons · on April 23, 2007

So, how do write-intensive sites like LiveJournal handle their updates? I was under the impression that LJ was all memcached + sharding + master/master pairs for the shards.

How do you handle features that rely on frequent UPDATE statements, like say a hitcounter? Is UPDATE LOW PRIORITY (MySQL) enough, or should you work out some application-specific caching/batching mechanism and perform all your updates at once?

abstractbill · on April 23, 2007

"For sequential access, DRAM is about 62x the speed of disk -- it's slower, but only by a few orders of magnitude."

Shouldn't that be "it's faster"?

Excellent post otherwise, btw.

paul · on April 23, 2007

Fixed. Thanks

Sam_Odio · on April 23, 2007

I think he meant the disk, not the DRAM, is slower.

stuki · on April 25, 2007

Ram disks (like www.texmemsys.com) for the log device are a way of speeding up traditional rdbms'. They cost a bit, but You get to keep all the 'cool' database features while minimizing the sync log write bottleneck. Such architectures are also easier to explain to auditors if Your app is of a financial nature. I am not affiliated with this or any other storage vendor. --Stuki

staunch · on April 24, 2007

Really great writeup Paul. I've been preaching "memory is magic" for the last couple years now -- since the capacity:price ratio got so good. Now I have a nice non-fluffy hacker post to send people to.

If there's any silver bullet to scaling most web apps it's properly utilizing gobs of memory. Pushing bytes from memory to your NIC is a pretty damn efficient operation.

rtm · on April 24, 2007

A challenge in a pure version of a log-structured database is reclaiming space from the log when data is no longer needed. Sometimes you don't need to, as in Venti ( http://citeseer.ist.psu.edu/531029.html ).

lupin_sansei · on April 24, 2007

8 GB of flash memory cost about $80

It does? Where can I get that?

paul · on April 24, 2007

$59.99: http://www.newegg.com/Product/Product.asp?Item=N82E16820151029&ATT=20-151-029&CMP=OTC-Froogle

dyu · on April 24, 2007

What about dirty read, locking, lost update and all those problems that a DBMS takes care for you? Just wondering.

Readmore · on April 23, 2007

So what we need is a DB that stores itself completely in memory and only writes to disk every Nminutes?

paul · on April 23, 2007

No, it can write continually, but in a sequential fashion.

jmarinez · on April 24, 2007

I wonder what would happen if you use ZFS on top of a flash memory drive and replace/add a SQL layer next to the ZPL layer. In theory, you'd get transactions and checksums for free and possibly your sequential writes as well.

Readmore · on April 23, 2007

I just read up on memcached, now I get it. Very nice ;)