
Scaling MySQL to 700 million inserts per day - bluesmoon
http://tech.bluesmoon.info/2009/09/scaling-writes-in-mysql.html
======
cperciva
This article was mis-titled: The title should be "how to get poor performance
by using a completely inappropriate tool".

Quoth the article: " _Several million records a day need to be written to a
table. These records are then read out once at the end of the day, summarised
and then very rarely touched again._ "

This isn't a job for a database. This is a job for a _log file_. You might
want to use a database to store the summarized data, of course; but bulk data
which is written, never updated, and quite likely only ever read once? That's
practically the definition of a log file.

~~~
bluesmoon
some hackers hack simply to find out what's on the other side

~~~
ntoshev
No, it's just another case of RDBMS blinding people for all other possible
solutions.

~~~
silentbicycle
Have you ever noticed how when this happens, it's practically always MySQL?
(And when people complain about how badly relational databases perform and it
turns out they don't know about indexes, transactions, or stored procedures,
it's _always_ MySQL...)

~~~
donw
This.

The non-MySQL RDBMS people that I know, know when their RDBMS of choice isn't
the right tool for the job. Most of the MySQL people I know, seem to think
that you just shove everything into the DB, and let Eris sort it out.

Because no other member of the pantheon is crazy enough to even look at that
mess.

~~~
silentbicycle
I suspect a network effect is in place where the vast majority of people whose
entire understating of databases came from "PHP + MySQL For Dummies"-caliber
books, blog posts, etc. use MySQL by default. A critical mass of them end up
perpetuating bad advice, griping about how since MySQL hasn't worked well for
them, all RDBMSs are junk, etc. Then, people looking for advice starting with
databases read _their_ tutorials. To some degree, MySQL probably brought this
upon themselves[1], but now there's a feedback loop in place[2][3][4].

    
    
       [1]: http://www.reddit.com/r/programming/comments/764fp/mysql_vs_postgresql/c05sayb
       [2]: http://sqlanywhere.blogspot.com/2008/03/unpublished-mysql-faq.html
       [3]: http://news.ycombinator.com/item?id=561277
       [4]: http://ask.metafilter.com/117908/Theres-got-to-be-a-faster-way-to-update
    

There seems to be a similar effect between Linux and BSD. I'm not going to
claim that BSD is objectively superior in every regard, but on average, the
BSD community seems to be quite a bit better informed than the Linux
community. It may just be because Linux is so much more visible. People using
BSD, Postgres, etc. probably already knew enough to evaluate their options,
while the path of least resistance stays overrun with clueless newbies.

------
MichaelGG
I wonder why they used a database at all: "These records are then read out
once at the end of the day, summarised and then very rarely touched again."

And later on they end up writing them to a text file and bulk inserting them
in 10K batches. Wouldn't it be easier to just write them all to a file then
summarize at the end of the day? (It also says they do a DROP TABLE on each
day's data.)

------
forkqueue
A couple of points for those thinking about doing something similar:

"Since InnoDB stores the table in the primary key, I decided that rather than
use an auto_increment column, I'd cover several columns with the primary key
to guarantee uniqueness. This had the added advantage that if the same record
was inserted more than once, it would not result in duplicates."

The 'correct' way to deal with this in MySQL is using the
auto_increment_increment.

[http://dev.mysql.com/doc/refman/5.0/en/server-system-
variabl...](http://dev.mysql.com/doc/refman/5.0/en/server-system-
variables.html)

Of course, the real difficulty with mutli-master setups split across data
centres isn't ensuring uniqueness of primary keys, it's ensuring data-
integrity under a split-brain scenario, i.e. where one server can't reach the
other, but users can reach one or the other. UPDATEs and DELETEs to rows can
then become extremely difficult to merge back together. This wasn't a problem
for this application, but as others have commented, this use case probably
wasn't best suited for an RDBMS anyway.

~~~
bluesmoon
with an autoincrement id, duplicate rows may get inserted. having a primary
key derived from the data results in duplicate rows getting discarded (this is
important).

secondly, the autoincrement id adds 4 bytes to each row which are never used
for anything. only use an id if you need to reference a row from another
table.

~~~
silentbicycle
That's a good warning sign that the data isn't relational in the first place.
An RDBMS would probably be a good fit for the analyzed data (which is likely
to have explicit _relations_ ), though.

~~~
encoderer
No it's not. There are plenty of tables that use composite keys that also
contain foreign keys.

The use of ORMs like active record, many of which choke on natural keys, has
turned a lot of devs into automatons for artificial key creation. Natural keys
are often superior.

------
Paul_Morgan
Why no discussion of the disk drives supporting this database? The write
performance dropped to 150 when the disk starts getting pounded as, up to that
point, everything was in a memory buffer. Does the temporal partition spread
the file over multiple drives? What's the capacity and theoretical throughput
of the drives?

~~~
bluesmoon
RAID 10 7200rpm 6x1TB

------
danwolff2
Interesting use of tables, as previously commented, but we at room.ly surely
appreciative of your post as we are currently on a PHP/MySQL diet and looking
to reach critical mass soon after a few updates

------
raghus
This seems to be at Yahoo!

