
The problem with conventional databases - Sam_Odio
http://paulbuchheit.blogspot.com/2007/04/problem-with-conventional-databases.html
======
dfranke
If your entire database fits in memory, then any decent DBMS will cache it all
there, write changes back to disk at such time as to have a minimal impact on
performance, all the while ensuring atomicity if someone kicks out the plug
while it's in the middle of flushing the buffer. If you're not using a DBMS
and you can still say all that about your own solution, you've probably put a
lot of work into reinventing the wheel.

~~~
rtm
An ordinary DB writes each update to disk twice, once in the log and once in
the permanent DB. DBs do defer updating the permanent DB. But they tend to be
eager about flushing the log in order to achieve durability. If you read the
documentation for PostgreSQL, for example, the only knob that seems to allow
you to defer log flushes also says it may corrupt the internals of the DB if
there's a crash. (Though there's no fundamental reason PostgreSQL couldn't
defer log writes and have correct crash recovery, sacrificing only durability
of recent updates.)

~~~
dfranke
That's a very strange knob. What's the point of doing any logging at all if it
doesn't let you recover from a crash?

I don't imagine that deferred logging is a big deal, though, because log
writes are by definition sequential. As Paul pointed out, sequential writes
just aren't that slow. You can build an array of 62 commodity drives for maybe
$4000.

~~~
rtm
Deferred logging is a huge deal! Appending a record to a log takes one (or a
half) of a rotation. A rotation takes about the same amount of time as a disk
seek. So a DB that synchronously appends the on-disk log for each transaction
will be slow.

The huge win is if you can append many transactions to the log in each
rotation. To do that you have to gather up many updates per disk operation. So
deferred logging is critical.

I suspect the reason PostgreSQL doesn't really support delayed log flush is
that they are thinking about ACID transactions, where you really need the data
to be on disk immediately. A more technical issue is that the log data _must_
be on disk before the corresponding permanent data (otherwise crash recovery
will break), and I suspect postgresql.conf's "fsync" option has the effect of
not fsync()ing the log at all, which indeed would cause permanent corruption
after a crash.

~~~
dfranke
Don't drives have hardware buffers designed so that you don't have to worry
about that?

~~~
rtm
You can configure a drive to delay writing to the disk surface, and instead
just write into its cache, until some later point when it's convenient to
write the surface. But the reason a DB issues a write to the disk is that the
DB needs the data to be recoverable after a crash before the DB can proceed.
So DBs cannot easily use the disk's write-to-cache feature; the disk's cache
is no more durable than main memory.

You might imagine that the disk would write-cache only an amount of data that
it could write to the surface with the energy stored in its capacitors after
it detected a power failure. But this is not the way disks work. Typical disk
specs explicitly say that the contents of the write-cache may be lost if the
power fails.

You may be thinking of "tagged queuing", in which the o/s can issue concurrent
operations to the disk, and the disk chooses the order in which to apply them
to the surface, and tells the o/s as each completes so the DB knows which
transaction can now continue. That's a good idea if there are concurrent
transactions and the DB is basically doing writes to random disk positions. In
the log-append case we're talking about, tagged queuing is only going to make
a difference if we hand lots of appends to the disk at the same time. In that
specialized situation it's somewhat faster to issue a single big disk write.
You need to defer log flushes in either case to get good performance.

~~~
dfranke
_You might imagine that the disk would write-cache only an amount of data that
it could write to the surface with the energy stored in its capacitors after
it detected a power failure._

That's exactly what I assumed, at least for high-end disks. Any idea why they
don't do that? It seems like a pretty trivial hardware feature that would save
an awful lot of software complexity.

------
rtm
On MySQL 4.1 / InnoDB / FreeBSD / FFS / SCSI, small inserts run at about 167
transactions per second, i.e. one per rotation.

Same setup but --innodb_flush_log_at_trx_commit=0, a million transactions in
200 seconds, or 5000/second.

I don't know if InnoDB wrote its B-Tree to disk during the 200 seconds. Same
performance even with InnoDB's buffer pool size set to 200 KB with
--innodb_buffer_pool_size=200000.

The MySQL documentation claims that this configuration does crash-recovery
correctly, though you may lose the last second's worth of transactions.

------
ntoshev
Keeping all data in RAM and just saving changes to a log file is a great
architecture. It is a pity that the current "hip" tools/models for web
development impose unnecessary cost for using it.

The problem is that the Ruby / Python / PHP stacks typically have one process
per request, and they do not share memory. When you use memcached in such
environment, it is a separate process and communication with it is rather slow
- it involves marshaling the data, which would not be necessary if all
requests were just threads in a single address space.You don't have this
problem if you use Java/C++.

I have used with some success a huge mmaped file as persistent memory to store
my data. It's not exactly using disc as sequential device, that depends on the
application and data locality. There are just a few tools available to support
that scheme.

------
nostrademons
So, how do write-intensive sites like LiveJournal handle their updates? I was
under the impression that LJ was all memcached + sharding + master/master
pairs for the shards.

How do you handle features that rely on frequent UPDATE statements, like say a
hitcounter? Is UPDATE LOW PRIORITY (MySQL) enough, or should you work out some
application-specific caching/batching mechanism and perform all your updates
at once?

------
abstractbill
"For sequential access, DRAM is about 62x the speed of disk -- it's slower,
but only by a few orders of magnitude."

Shouldn't that be "it's _faster_ "?

Excellent post otherwise, btw.

~~~
paul
Fixed. Thanks

------
stuki
Ram disks (like www.texmemsys.com) for the log device are a way of speeding up
traditional rdbms'. They cost a bit, but You get to keep all the 'cool'
database features while minimizing the sync log write bottleneck. Such
architectures are also easier to explain to auditors if Your app is of a
financial nature. I am not affiliated with this or any other storage vendor.
\--Stuki

------
staunch
Really great writeup Paul. I've been preaching "memory is magic" for the last
couple years now -- since the capacity:price ratio got so good. Now I have a
nice non-fluffy hacker post to send people to.

If there's any silver bullet to scaling most web apps it's properly utilizing
gobs of memory. Pushing bytes from memory to your NIC is a pretty damn
efficient operation.

------
rtm
A challenge in a pure version of a log-structured database is reclaiming space
from the log when data is no longer needed. Sometimes you don't need to, as in
Venti ( <http://citeseer.ist.psu.edu/531029.html> ).

------
lupin_sansei
8 GB of flash memory cost about $80

It does? Where can I get that?

~~~
paul
$59.99:
[http://www.newegg.com/Product/Product.asp?Item=N82E16820151029&ATT;=20-151-029&CMP;=OTC-
Froogle](http://www.newegg.com/Product/Product.asp?Item=N82E16820151029&ATT=20-151-029&CMP=OTC-
Froogle)

------
dyu
What about dirty read, locking, lost update and all those problems that a DBMS
takes care for you? Just wondering.

------
Readmore
So what we need is a DB that stores itself completely in memory and only
writes to disk every Nminutes?

~~~
paul
No, it can write continually, but in a sequential fashion.

~~~
jmarinez
I wonder what would happen if you use ZFS on top of a flash memory drive and
replace/add a SQL layer next to the ZPL layer. In theory, you'd get
transactions and checksums for free and possibly your sequential writes as
well.

