
Loading half a billion rows into MySQL - conesus
http://derwiki.tumblr.com/post/24490758395/loading-half-a-billion-rows-into-mysql
======
alexatkeplar
Assuming your event data is immutable (i.e. no UPDATEs, just INSERTs), you'd
probably have fewer headaches long-term if you just dumped the database to
flatfiles, stored in HDFS and queried using Hive (which has MySQLish query
syntax anyway). This architecture will take you to billions of rows quite
happily.

This is the architecture we use for eventstream analysis at SnowPlow
(<https://github.com/snowplow/snowplow>).

~~~
firemanx
How big are the flatfiles that you're storing in HDFS? I've looked at it
before for such a use, but for durability I want to write events in an isolate
manner, which means lots and lots of small writes, either to single files or
as a series of small files. I was under the impression that HDFS doesn't
perform well in a use case like this (due to the size of it's write block
size), but would LOVE if I could use it like that!

~~~
MichaelSalib
We're using HDFS with a periodic merger process that occasionally merges small
files into larger files. Given the block size, HDFS really does want larger
files, but it can tolerate a decent number of small files. The bigger problem
with this approach is providing a consistent view of the dataset so that
already running programs don't have the world totally change out from under
them.

------
whalesalad
I want to give this 10 upvotes. This is one of those things you only see on HN
occasionally, and it's full of all kinds of awesome little nuggets. That
command to get the count from the information_schema in and of itself is gold
(I now know about the tee command)

~~~
jakejake
I ran this select from information_schema to get the count of a table with
aprox 1.2 million rows. each time I run the query, though, the number returned
is different. it seems to vary by about 200k either way. does anybody know why
that would be happening?

~~~
kristjan
It's just an estimate for the query engine, not a true row count. But good
enough for a progress bar.

From <http://dev.mysql.com/doc/refman/5.0/en/tables-table.html>:

> For InnoDB tables, the row count is only a rough estimate used in SQL
> optimization.

~~~
fizx
If you run it a dozen times, it converges on the truth.

------
spudlyo
_I saw chunk load time increase from 1m40s to around an hour per million
inserts._

Your insert performance falls off a cliff when a majority of the index pages
for the table no longer fit into the innodb buffer pool. After that happens,
there is gonna be a bunch of random i/o. You can solve this problem by using
partitioning, that way only a single partition's worth of index pages need to
fit into the buffer pool to keep inserts into that partition fast. Of course
you have to size your partitions accordingly.

A few other tips. Disable fsync() at commit entirely. Set
innodb_flush_log_at_trx_commit=0. If you crash curing the data load, start
over. Set your transaction logs to be as large as possible, which is usually
4G.

------
jtchang
Does it blow anyone else's mind that database servers are so powerful nowadays
that you can just load half a billion rows into a SINGLE TABLE and pretty much
expect things to work? Obviously there are caveats but all in all a nice chunk
of RAM backed by a large SSD goes a long way.

~~~
fleitz
I'm more impressed with how cheap it is.

An SSD is on par with a 15K disk, but delivers 10X the perf, throw 10 SSDs in
an array and you've got what used to cost $100,000 for about $5K

~~~
Andys
But SSDs that cost the same as a 15K disk do not have anywhere near the life
time of the spinning disk due to write endurance limits.

So load half a billion rows but don't update it too often!

~~~
henrikschroder
SSDs live long enough. You get magnitudes more IOPS performance out of a
single SSD than a RAID of 15k disks, it consumes magnitudes less power, it
costs less, and it needs less maintenance. It's a complete no-brainer these
days to put an SSD in your database server.

------
juiceandjuice
I routinely load and reload ~7 billion rows into oracle 11g, once every 5
months or so. It takes about 4 days, 20 days if you do something stupid like
create the indexes before loading, although I think oracle can go quite a bit
faster, and that 4 days is limited by processing and non-DB I/O.

We use oracle because the partitioning options are better and bitmapped
indexes. (We wanted more partitions so we could use a hierarchical triangular
mesh for point-radius searches)

~~~
reitzensteinm
To save everyone the math, that's ~20k inserts a second.

~~~
lukev
Oracle can do a hell of a lot more than that if you preload your tables as
transportable table spaces.

~~~
dkhenry
MySQL can do a whole lot more then that too. I routinly get 15k to 20k writes
per second while the server is under normal load. You can spike that a great
deal higher depending on what else your doing and what kind of table your
writing.

------
cookiecaper
If this data is primarily archival and there are multiple backups of the same
dataset out there, why not use MyISAM? In my mind the only reasons to use
InnoDB are integrity-related, things like real foreign keys and a more ACIDy
commit method. If the dataset is read-only and copied in several places,
surely this stuff does not matter too much and MyISAM is much more performant.
Maybe I misread the use case?

~~~
crazygringo
I'm thinking if the server dies, MyISAM tables tend to corrupt more easily,
and repairing a table that large will take a long time.

But that's the only thing I can think of -- and the speed/size of MyISAM
tables have a lot going for them, I'd be tempted to go with MyISAM. Anybody
else?

~~~
spudlyo
Ya, recovery time is a big deal. It's like the difference between ext2 and
ext3. With ext2 a crash is gonna require a lengthy and painful fsck,
especially if your table/filesystem is large. A journaled filesystem is a lot
like a database engine with transaction/redo logs. After you crash, you replay
the last 'n' uncommitted transactions (they're all idempotent so you can do
this multiple times) in the journal and you're good to go.

Also don't forget about MyISAM's lack of row level locks, and really nowadays
there are fewer and fewer cases where it's faster than InnoDB.

------
gcarre
"MySQL Partitioning. We decided against it because it seemed likely that it
wouldn’t be much faster than our current solution."

What if you partition by HASH(user_id) instead of partitioning by month
([http://dev.mysql.com/doc/refman/5.5/en/partitioning-
hash.htm...](http://dev.mysql.com/doc/refman/5.5/en/partitioning-hash.html))?

~~~
lann
(another Causes engineer here) HASH partitioning could make sense if our
queries were primarily filtering on e.g. user_id, but that isn't the case
here. Instead we are often querying against one or both of two columns:
user_id and action_id. Partitioning by timestamp allows us to ignore
partitions older than the user and/or action's creation dates.

------
neilk
This struck me as odd:

"You could drop the indices completely and add them later, but with a table
size this big I didn’t think it would help much."

Surely this is one of the main reasons why the load speed went way down as the
database size increases. It has to be better to add an index later, possibly
_much_ better.

~~~
evanelias
I initially thought so too, but I found the same result -- loading indexes
later gave little or no improvement in import speed. Sometimes it can even be
a penalty. This actually makes sense though:

InnoDB secondary indexes are just additional B-trees, with the key storing the
indexed columns and the leaf nodes storing the full primary key (I think minus
any redundant columns with the index, but don't quote me on that).

When doing a huge import, you always want to do it in order, since InnoDB's
clustered indexes mean that the main table data is stored in primary key
order. So you don't get B-tree page splits in the primary data structure while
doing a sequential import.

Secondary indexes, however, won't have this benefit. Their insertion will
effectively be random I/O with lots of page splits and fragmentation. This
will be equally true whether you build them at import time or build them
later. In both cases, MySQL doesn't have the data ordered by the new secondary
index yet until it has built the index.

Building secondary indexes at import time also benefits from being able to do
a chunked concurrent import (multiple LOAD DATA INFILE queries at once). On my
hardware I've found this will build a secondary index on a standby slave much
faster than an ALTER TABLE will, since an ALTER TABLE rewrites the entire
table, in 5.1 anyway.

So actually when adding indexes to existing large tables, I sometimes dump out
all the data in parallel INTO OUTFILE queries, drop the table, recreate the
table with the new indexes in the DDL, and then re-import all the data with
parallel LOAD DATA INFILE queries. As a bonus this also defragments. That
said, a single ALTER TABLE is a lot less human work, if you haven't automated
your imports and exports.

------
tantalor
I'm not familiar with Percona, what problem does it solve?

~~~
derwiki
Performance, for one: <http://www.percona.com/software/percona-
server/benchmarks/>

They also provide a lot of nice tools:
<http://www.percona.com/software/percona-toolkit/>

~~~
beersigns
Percona also absorbed Maatkit in their tools for those familiar with it. Very
handy stuff in there for automation of various processes.

Pretty happy with Percona as a drop in replacement for MySQL.

------
plasma
In my experience the built-in partitioning support of MySQL (using PARTITION
BY RANGE for example) is very good.

Inserts no longer drag due to huge indexes, delete's are instant (drop a
partition of stale data), and SELECT's can limit their lookups to specific
underlying files.

This is for a 500m row table of time series data.

------
j-b
Issues that can also severely degrade a systems performance and may not come
into play until your data is growing over extended periods of time (not
represented in the benchmark here):

1) Table statistics can grow stale and degrade the engines ability to select
the correct access path. This can be corrected by running CREATE STATISTICS,
depending on your DB version.

2) Page fragmentation, not to be confused with disk fragmentation, when the
engine is selecting non-contiguous space. A rebuild of the db will be
necessary if your pages are extremely fragmented and/or using a non-optimized
allocation size.

------
kodablah
"push as much complexity as possible to the database"

In general, this is usually bad practice and difficult to scale. I do agree w/
using a single table, but I disagree with the general premise of that
statement.

~~~
kristjan
Maybe this is more consumable as "remove as much complexity as possible from
the code". In this case, both the code and the DB become considerably simpler,
and it doesn't sound like there was anything fancy at all going on in the end
result. I think derwiki just meant the DB will handle this, so you can leave
the implementation there, not in custom sharding code.

------
tantalor

        SELCT table_rows
    

Should be,

    
    
        SELECT table_rows

~~~
derwiki
Thanks, fixed

~~~
autoreverse
The chown statement in the article uses a period instead of a colon.

Should it be "chown -R mysql:mysql" ?

~~~
lost-theory
The period works with GNU coreutils chown (e.g. on Linux), but not other
versions.

~~~
staunch
I think it whines that it's deprecated though.

------
gtani
Good info. Did percona fix the load nulls ticket?

(LOAD DATA INFILE has this ticket from 2006, that,sometime during the 2000's
was converted to a feature request

<http://bugs.mysql.com/bug.php?id=23212>

I think that's why my load does

    
    
        FIELDS ESCAPED BY '\\'
    

but it's been so long i can't remember (at this point a lot fo folks would
mention Postgres, but i'll refrain

------
Gonsalu
I fail to see why setting the transaction isolation level to read uncommitted
would make any difference to the data load process.

~~~
stonemetal
Without thinking about it to hard my guess is there is some overhead in
tracking what is readable and what is not readable at the higher read levels.
Read uncommitted means it doesn't have to track anything.

~~~
Gonsalu
Writes still take locks. The isolation level is called _Read_ -uncommitted for
a reason.

------
tlrobinson
Good stuff. I remember trying to load a Wikipedia database dump into MySQL on
my laptop years ago and giving up.

------
dkhenry
I want to say this gentleman is doing it wrong. If his current set-up is what
I perceive from the description he is not going to be gaining any speed or
scalability. by moving to one giant table. In fact he might be losing some. It
sounds like he set up 12 monthly tables and is inserting data into each one by
month then he is querying across all 12 tables and joining the results. You
can do that query in one fell swoop using UNION, If you have a significant
number of entries the overhead of opening the 12 tables is negligible. If your
looking up on a primary key then you will be more then fast enough.

Switch your tables into MyISAM and make a 13th table. insert the current month
into the 13 table and query across all 13 tables when you need to get data
out. At the end of the month move all records in the 13th table into its
proper month table. ( make the 13th table a memory table sync it to disk every
few minutes or put it on a really fast SSD )

~~~
eklitzke
Why do you think it's going to be faster to do a UNION query against 12 (or
13) tables than it will be to do a non-UNION query against a single table?

It's also unclear to me why you're suggesting the author switch to MyISAM.
Presumably they're using InnoDB because it's actually ACID compliant, and even
if they're not, according to MySQL's own benchmarks InnoDB is as fast or
faster than MyISAM for heavy read workloads, e.g. as described at
[http://www.oracle.com/partners/en/knowledge-
zone/mysql-5-5-i...](http://www.oracle.com/partners/en/knowledge-
zone/mysql-5-5-innodb-myisam-522945.pdf)

~~~
dkhenry
The main source for my reasoning is experience.

The UNION query will be faster because you will be doing a UNION against 12
MyISAM tables instead of doing a single query against a InnoDB table. MyISAM
is much faster in read only situations. Once you start writing to it it
becomes slower, but the 13th table prevents writes, for that 13th table you
might need to go with a slow table like InnoDB. ( yeah I know about that
benchmark its a flawed benchmark )

------
amalag
If you are storing this data for archival purposes & querying, did you check
out Infobright? They have built it as a MySQL engine.

------
mcartyem
Why would anyone partition tables -- ever?

~~~
kristjan
Lots of reasons :)

~~~
derwiki
... says the engineer who may well be responsible for this Monthly sharding :)

------
evanelias
All great advice! As a fellow habitual migrator of billions of rows in MySQL,
a few things I'd add:

Do several chunked LOAD DATA INFILE queries in parallel. It's counter-
intuitive but you'll get a performance boost. I usually chunk large data sets
into 100+ pieces and do 40 concurrent import queries. Mileage may vary,
depends a lot on your hardware and data set's sweet spot.

Don't use mysqldump for the export. SELECT...INTO OUTFILE is faster,
especially since you can apply the exact same parallelized chunking technique.
Do this on a standby slave (ie, one not receiving reads) with replication
stopped, so that you get a consistent dump at a point in time.

The TSV dump doesn't need to be in your database directory. It can be anywhere
mysqld can access. It can be beneficial to put it on a separate volume
entirely, so that you're reading heavily from one volume while writing heavily
to another.

Be mindful of disk scheduler choices in Linux if you're exporting from a
faster disk to a slower disk. There's some definite edge cases that can lead
to starvation of core processes. Also, be mindful of disk scheduler choices in
Linux in general. Avoid use of CFQ on your database data volume.

Watch disk utilization with iostat or another tool of your choice. If you're
coming close to saturating one or more volumes, that probably means you're on
the right track :)

As other commenters have mentioned, the row count in information_schema (or
equivalently SHOW TABLE STATUS) is an estimate, due to how MVCC works. If you
need a more accurate count (but not quite perfect count for a specific moment-
in-time for a growing data set), do that in parallel chunked SELECT COUNT(*)
queries.

Happy migrating!

~~~
dasil003
> _Do several chunked LOAD DATA INFILE queries in parallel. It's counter-
> intuitive but you'll get a performance boost. I usually chunk large data
> sets into 100+ pieces and do 40 concurrent import queries. Mileage may vary,
> depends a lot on your hardware and data set's sweet spot._

I'm super curious about this. Is it because loading data—even in raw CSV
rather than SQL format—is CPU-bound rather than IO bound?

~~~
evanelias
It's hard to say. A single-threaded import might not be saturating any
particular hardware component; there's other overhead in the work. And when
you account for multiple CPUs and multi-disk RAID arrays, the picture gets
more complicated.

InnoDB journals data. It's writing the data sequentially and then later
arranging it to the correct place, which will also be sequential by primary
key. So if you're loading in a bunch of sequential chunks in parallel, I
imagine InnoDB can probably order the final B-Tree writes so that they also
end up being sequential, even if the initial writes to its logfile ended up
being interleaved.

------
advocate1
It's more work for the computer to process data stored as "rows" than data
stored as columns.

Think of how a disk is organized.

If you abandon MySQL and store data as columns you can load a trillion rows.

But forget I mentioned this.

~~~
dfragnito
We have been doing some tests with <http://www.monetdb.org/Home>. We saw
inserts per second in the range of 50k/sec on a laptop.

~~~
advocate1
mmap. +1. 10MB source tarball. -1. Looks like it's worth a test. Thanks for
the reference.

------
advocate1
This discussion reminds me of non-technical users. They will tolerate
incredible sluggishness. It is amazing what Oracle and the free alternatives
get away with. People, including engineers/developers, not only tolerate
sluggishness but when something actually works on the first try they think
it's amazing.

I've witnessed this tolerance for mediocrity for years and it still continues
to blow my mind.

------
dstorrs
FTA:

"...I decided to migrate the existing system into a single table setup."

"Alternative Proposed Solutions: MySQL Partitioning and Redis"

I'm surprised he didn't consider at Mongo, Couch, etc.

~~~
derwiki
My coworkers and I have absolutely zero experience with Mongo, Couch, etc. We
do have experience with MySQL and Redis (see
<https://github.com/causes/mock_redis>), and I feel "safer" keeping important
data in something as mature as MySQL.

~~~
whalesalad
I've heard so many stories of people moving away from MongoDB once they reach
scale like yours, I don't blame you.

