
Why Is Twitter Using A Database In The First Place? - mattjaynes
http://mooseyard.com/Jens/2007/04/twitter-rails-hammers-and-11000-nails-per-second/
======
staunch
_"(Or if you're a crazed wunderkind like LiveJournal founder Brad Fitzpatrick,
you invent a memory-based distributed hashtable as a cache to put in front of
the database.)"_

The cool thing about Brad is that he released that creation as open source --
we can all benefit from his genius, like Facebook already has. Memcached is an
amazingly effective way of getting the benefits of SQL storage in a simple,
scalable, and reliable way. It's impossible to over-hype how much it kicks
ass.

~~~
danw
Using MogileFS (also by Brad) might be a solution.

~~~
ralph
MogileFS looks interesting.

If a class of files requests N>1 copies, at what point after the HTTP PUT can
the application be happy that N copies exist? It seems fine to think I've
three copies of that file but what if machine failure occurs before MogileFS
has created the other two?

Also, it's intended to operate on whole files at a time, although HTTP GET
might be usable to fetch a run of bytes. If two web servers both try and write
the same filename, doesn't the latest one win?

I can see it's great for certain things, e.g. storing the user's images, but
not for the stuff traditionally in the database. Or have I missed something?

Cheers, Ralph.

------
timg
SQL Databases are so astonishingly slow that I just switched my most intensive
app to only use the database for backing up the data to disk and reloading
from on startup.

This arrangement is so much faster and easy to understand/measure/optimize for
me that I can't see myself going back.

~~~
staunch
Can you be more detailed about what you're doing and the application? Has your
solution opened your app to corruption or data loss in the event of app or
server crashes?

~~~
timg
I have a separate thread that writes changes back to disk at its leisure.
There is also a special shutdown/restart mode that I can trigger which stops
accepting input from the user and flushes everything to the db.

This can be implemented very simply and has lots of advantages all around.

~~~
ralph
You're fortunate your data fits in RAM. And it seems you don't have to worry
about machine failure causing unwritten data to be lost despite having
accepted it from the user?

Cheers, Ralph.

~~~
timg
"your data fits in RAM"

Not so. It's not too hard to check if you have some data and then retrieve it
from the db as necessary.

~~~
ralph
Oh, you said earlier "I just switched my most intensive app to only use the
database for backing up the data to disk and reloading from on startup" so I
thought you meant you only read data from the database on start-up, hence all
data fitting in core.

If you're reading data from it whenever you find the data isn't in core then
aren't you using the database?

As for the "thread that writes changes back to disk at its leisure", how do
you guard against machine failure after accepting data for writing but before
it's been written?

Cheers, Ralph.

------
mattculbreth
Remember Paul Buchheit's advice at Startup School. "Maybe consider not using a
database", or some similar statement.

~~~
ntoshev
"use in-memory hashmap for small data, Amazon S3 or filesystem for large data.
treat disk as sequental device."

~~~
mattculbreth
Yeah there you go.

------
mdakin
Twitter is using a database because the Twitter engineers chose not to
prematurely optimize their system. They now fully understand their problem
domain and thus now would be the appropriate time to make optimizations such
as replacing the SQL back-end with faster, less flexible solutions.

~~~
herdrick
The article does have a strong whiff of premature opt. However, using a SQL db
in the first place is usually causes extra work to use something unhelpful.
Usually it's way less code to just skip it.

~~~
mdakin
A non-SQL solution surely can work well and be a fine solution to some
problems. Especially problems that are very well understood (like problems at
the optimization stage of the project).

But can a minimal non-SQL solution provide the basic features that people want
and expect from a persistent storage layer: transactions, allowing multiple
process concurrent access to the data, relatively foolproof failure recovery
procedures, etc?

A decent RDBMS gives you those features out of the box in addition to allowing
you great data manipulation flexibility. It is this sweet spot of data
manipulation flexibility and fault tolerance that makes SQL/RDBMS such a
ubiquitous tool.

I suspect code you're "skipping" is the code that would give you those extra
features that help with reliability and fault tolerance.

While it is possible to implement those features without using the RDBMS-
crutch it takes real code and real engineering effort to do it. If you are
writing "way less code" you are likely not providing replacement features.

------
bootload
_'... up to 11,000 requests per second'. Jesus Christ, thats a lot. Where does
that come from? ...'_

One of the bottle necks is the continuous polling on the public timeline and
its RSS file. It makes me wonder why they don't charge for the privilege.

_'... polling for updates every 15 minutes, 24 hours a day, thats still only
about 1,000 hits/sec ...'_

try every 0-10 seconds per client per person using such clients. [0] If this
was happening in other sites it would be throttled.

Reference

[0] google search, 'twitter updates timeline every seconds'

<http://www.google.com/search?q=twitter+updates+timeline+every+seconds>

~~~
jey
Why not simply static render these things when something actually do change...
huhu.

------
ralph
Anyone know of detailed write-ups by people that didn't use a database. I too
dislike the overhead of SQL as an interface. How did they cope with the
problems that DBs solve? Did they ever have to scale to more than one "DB
server"?

Cheers, Ralph.

~~~
dhyasama
<http://radar.oreilly.com/archives/2006/04/database_war_stories_2_bloglin.html>

Check out the whole series.

~~~
ralph
Thanks, I've read that and will continue through the others.

One of the cases given could fit all the data in core, the other used
BerkeleyDB for its smallish "database" data, cutting out SQL, and a GFS-like
system for the large amount of BLOB archiving it had to do. It's the doesn't
fit in core, and is changing data not archiving, case where it seems harder to
use flat files since the DB server is a convenient place for concurrancy
controls.

Cheers, Ralph.

------
jaggederest
As I posted over there:

Look at Prevayler and HAppS, two systems that don't use a database at all. In-
memory persistence with write-ahead logging, and they handle give-or-take 1000
hits/s on a stock Xeon server.

~~~
ntoshev
Having looked into Pevayler, I think it buys you more problems than it solves.
Thanks for the happs reference.

Terracotta may be a good solution for Java.

~~~
jaggederest
Yeah, prevayler is handicapped by the fact that it's java. Happs is really
cool, haven't worked with prevayler, but it's cited as an influence.

