Hacker News new | past | comments | ask | show | jobs | submit login
Loading half a billion rows into MySQL (derwiki.tumblr.com)
295 points by conesus on June 5, 2012 | hide | past | web | favorite | 99 comments

Assuming your event data is immutable (i.e. no UPDATEs, just INSERTs), you'd probably have fewer headaches long-term if you just dumped the database to flatfiles, stored in HDFS and queried using Hive (which has MySQLish query syntax anyway). This architecture will take you to billions of rows quite happily.

This is the architecture we use for eventstream analysis at SnowPlow (https://github.com/snowplow/snowplow).

At my company, we experimented with Hive and found it to be too slow. Once you decide that every row of data will be materialized into a java hash table instance, there are real performance limitations that follow from that representation. We decided to use HDFS and Hadoop but built our own query language called Trecul; it uses LLVM to jit compile dataflow code into native code. Picking off a single field to filter on is a single machine instruction for us.

The code is open source (albeit with rough edges) at https://github.com/akamai-tech/trecul/

Thanks for sharing Michael - Trecul looks to be the first OSS release from Akamai (at least on GitHub)?

While we're on the subject of high-performance alternatives to Hive for event analysis, we've been watching the development of Ben Johnson's behavioural db, called Sky, with interest:


Thanks for the shout out, Alex.

Michael- Trecul looks cool. I'm doing something similar with Sky. I'm building an LLVM-based query language on it called EQL (Event Query Language) that's optimized for evented data. It does a lot of function rewriting to optimize loops and avoid heap allocations so it can crank through tens of millions of events per second. It's not finished yet but it should be done in the next couple weeks.

On the other end of the spectrum, for "small data" (i.e. millions of rows as opposed to tens of billons)... check out crush-tools. I love it, use it all the time.


Thanks for the tip - looks cool, like a mini-map-reduce for Bash!

How big are the flatfiles that you're storing in HDFS? I've looked at it before for such a use, but for durability I want to write events in an isolate manner, which means lots and lots of small writes, either to single files or as a series of small files. I was under the impression that HDFS doesn't perform well in a use case like this (due to the size of it's write block size), but would LOVE if I could use it like that!

We're using HDFS with a periodic merger process that occasionally merges small files into larger files. Given the block size, HDFS really does want larger files, but it can tolerate a decent number of small files. The bigger problem with this approach is providing a consistent view of the dataset so that already running programs don't have the world totally change out from under them.

You're right, there can be something of a "small files" issue with HDFS. This is a good article for strategies to get round it: http://www.cloudera.com/blog/2009/02/the-small-files-problem...

It's also 3-4 orders of magnitude more expensive to support: MySQL is an out of the box install of a single process which many people are familiar with and which has various well-known data warehousing techniques and many GUIs and other tools for casual analysis.

Hadoop + Hive is a beast which will require multiple high memory systems just to run without daemons crashing and mysteriously deadlocking the entire cluster (i.e. you'll be searching for non-obvious messages in log files, googling and reading the source until you learn that some Java developers are still struggling with 30 year-old memory management and error handling techniques). You then need to write custom data loaders, completely architect around the write-once file I/O model, and then learn Hive before you can get a single result.

If you actually really need the things which Hadoop can better, it's worth that investment - but the decision is akin to knowing that you need to haul a billion pounds of coal and thus building a railroad makes sense. If you don't know that - and if you don't already know your access patterns well enough to carve them into stone - the overhead cost dwarfs the benefit.

It sounds like a problem better solved by triggers, or a few materialized views.

Are there any good resources which cover HDFS+Hive? I'd really love to see performance measurements but the entire premise of "Let's analyze ALL the data every single query, we'll just use a zillion workers!" has always came across to me as incredibly inefficient and computationally expensive.

I'd wager they query their data frequently and in a predictable manner, at which point using some sort data structure actually designed for fast searching (such as a simple btree in most rdbms) makes perfect sense here.

If the data is purely for archive purposes and is rarely queried or is queried in very random patterns that can't really make adequate usage of indices, then I'd agree with your suggestion.

Sadly there's not a great lot of documentation about HDFS/Hive - we're learning a lot as we go with SnowPlow.

But I do agree with you - if the same queries keep coming up frequently, then it's worth putting some tech in place to save on ad hoc querying costs. But an RDBMS isn't a great fit for this - because the data is a) non-relational and b) immutable. Happily there's a whole class of database designed for this type of work - analytics databases such as Greenplum and Infobright (Infobright is a modded MySQL too).

In any case, if you have your event data stored in flat files, then as well as a "raw load" into your analytics database, you can also schedule regular map-reduce jobs to populate specific cubes into your analytics db. This is something we're working on for SnowPlow now as well.

I want to give this 10 upvotes. This is one of those things you only see on HN occasionally, and it's full of all kinds of awesome little nuggets. That command to get the count from the information_schema in and of itself is gold (I now know about the tee command)

I ran this select from information_schema to get the count of a table with aprox 1.2 million rows. each time I run the query, though, the number returned is different. it seems to vary by about 200k either way. does anybody know why that would be happening?

It's just an estimate for the query engine, not a true row count. But good enough for a progress bar.

From http://dev.mysql.com/doc/refman/5.0/en/tables-table.html:

> For InnoDB tables, the row count is only a rough estimate used in SQL optimization.

If you run it a dozen times, it converges on the truth.

aha, thanks

Also the while; do; loop could probably be replaced with 'watch'.

I tried 'watch' first (love that lil' command), but for some reason it parses incorrectly and doesn't work.

While it is possible to beat that command with backslashes until it works, I often just give up and make it a shell script, on the grounds that usually by the time I have something that complicated I want to "watch" I ought to be checking it into source control, or at least have it on disk for the next time I want it. For database stuff like that I've had nearly a 100% hit rate of either reusing such things, or at least consulting them later for the crafting of a related command.

Setting a really large $HISTSIZE can accomplish some of this.

Though I'll admit to occasional traumatic moments when my workstation crashes and bash history is polluted with multiple shells filling it and the stuff I've been ctrl-R invoking (recursive reverse search) for months evaporates.

You should set up a simple cron job to backup your history file (to somewhere else in your home directory) on some schedule (hourly?) and then when this happens just have an short alias that restores the last backup.


The problem is curation.

It's really handy to have stuff in history. But there's a ton of crap I don't need.

What I need to do is get better about scripting the stuff that I'm invoking frequently. Though zsh probably has an extension that does this for you, and scratches your back.

Mind: virtually all my shell scripts start off as one-liners that I've invoked and re-edited (M-x M-e FTFW) until I've realized I have something worth saving.

> Setting a really large $HISTSIZE can accomplish some of this.

Or making your history permanent. I have a reasonable sized history in my shell, but also dump out each line I typed to an eternal history file.

See if bash's ignoredups and erasedups helps.

iirc, watch calls /bin/sh to execute what you give it. Maybe there was a quoting issue?

I saw chunk load time increase from 1m40s to around an hour per million inserts.

Your insert performance falls off a cliff when a majority of the index pages for the table no longer fit into the innodb buffer pool. After that happens, there is gonna be a bunch of random i/o. You can solve this problem by using partitioning, that way only a single partition's worth of index pages need to fit into the buffer pool to keep inserts into that partition fast. Of course you have to size your partitions accordingly.

A few other tips. Disable fsync() at commit entirely. Set innodb_flush_log_at_trx_commit=0. If you crash curing the data load, start over. Set your transaction logs to be as large as possible, which is usually 4G.

Does it blow anyone else's mind that database servers are so powerful nowadays that you can just load half a billion rows into a SINGLE TABLE and pretty much expect things to work? Obviously there are caveats but all in all a nice chunk of RAM backed by a large SSD goes a long way.

As a long time skeptic of the NoSQL craze that seems to have (happily!) peaked, I have to say that no, I'm not very surprised by this.

Databases (real databases) are using indexing technology that's been proven over not just years but decades of use and refinement. log2(500000000) is still only 28. Double that and it's only 29. And these databases' indexes are typically btrees with much higher branching factors than 2.

One of my favorite articles on this is Dennis Forbes' "The Impact of SSDs on Database Performance and the Performance Paradox of Data Explodification" (http://www.yafla.com/dforbes/The_Impact_of_SSDs_on_Database_...).

I'm more impressed with how cheap it is.

An SSD is on par with a 15K disk, but delivers 10X the perf, throw 10 SSDs in an array and you've got what used to cost $100,000 for about $5K

But SSDs that cost the same as a 15K disk do not have anywhere near the life time of the spinning disk due to write endurance limits.

So load half a billion rows but don't update it too often!

SSDs live long enough. You get magnitudes more IOPS performance out of a single SSD than a RAID of 15k disks, it consumes magnitudes less power, it costs less, and it needs less maintenance. It's a complete no-brainer these days to put an SSD in your database server.

Given the cost of power and space and vs. the number of additional drives necessary I'm not sure that longevity is worth it.

I was impressed by stuff like this in the 90s ;-)

What's impressive now is not the software but that the hardware is so cheap.

My telemetry table is >2B rows and growing. 5B is starting to worry me; .5B, not so much. This is even backed by ( not much ) spinning rust.

Hopefully you don't have an auto-increment int (2^31) as a PK, or you'll be worrying sooner than you think.

That's one reason why switching auto-increments to bigint does help (a lot):


That is interesting to know, I have been setting my index's to bigint for a long time.

Ugh. No. This uses a multi-column natural key. I have a tendency to avoid synthetic keys when I can.

I routinely load and reload ~7 billion rows into oracle 11g, once every 5 months or so. It takes about 4 days, 20 days if you do something stupid like create the indexes before loading, although I think oracle can go quite a bit faster, and that 4 days is limited by processing and non-DB I/O.

We use oracle because the partitioning options are better and bitmapped indexes. (We wanted more partitions so we could use a hierarchical triangular mesh for point-radius searches)

To save everyone the math, that's ~20k inserts a second.

Oracle can do a hell of a lot more than that if you preload your tables as transportable table spaces.

MySQL can do a whole lot more then that too. I routinly get 15k to 20k writes per second while the server is under normal load. You can spike that a great deal higher depending on what else your doing and what kind of table your writing.

Just curious; what's the hardware like? Is it reading from spinning disks?

Dell X4240 with dual quad core opteron 2384 (2700Mhz), 64GB memory, 6TB disk space in RAID-5, I believe 10000RPM drives. Oracle partitions are set by week, then subpartitioned by hierarchical triangular mesh region (http://www.skyserver.org/htm/), and our data is inherently anisotropic. Block size is 4kb I believe.

If this data is primarily archival and there are multiple backups of the same dataset out there, why not use MyISAM? In my mind the only reasons to use InnoDB are integrity-related, things like real foreign keys and a more ACIDy commit method. If the dataset is read-only and copied in several places, surely this stuff does not matter too much and MyISAM is much more performant. Maybe I misread the use case?

I'm thinking if the server dies, MyISAM tables tend to corrupt more easily, and repairing a table that large will take a long time.

But that's the only thing I can think of -- and the speed/size of MyISAM tables have a lot going for them, I'd be tempted to go with MyISAM. Anybody else?

Ya, recovery time is a big deal. It's like the difference between ext2 and ext3. With ext2 a crash is gonna require a lengthy and painful fsck, especially if your table/filesystem is large. A journaled filesystem is a lot like a database engine with transaction/redo logs. After you crash, you replay the last 'n' uncommitted transactions (they're all idempotent so you can do this multiple times) in the journal and you're good to go.

Also don't forget about MyISAM's lack of row level locks, and really nowadays there are fewer and fewer cases where it's faster than InnoDB.

* Adding an index to myisam takes the same time as recreating the table (additionally taking as much space)

* Any operation on myisam table locks it from writing

"MySQL Partitioning. We decided against it because it seemed likely that it wouldn’t be much faster than our current solution."

What if you partition by HASH(user_id) instead of partitioning by month (http://dev.mysql.com/doc/refman/5.5/en/partitioning-hash.htm...)?

(another Causes engineer here) HASH partitioning could make sense if our queries were primarily filtering on e.g. user_id, but that isn't the case here. Instead we are often querying against one or both of two columns: user_id and action_id. Partitioning by timestamp allows us to ignore partitions older than the user and/or action's creation dates.

This struck me as odd:

"You could drop the indices completely and add them later, but with a table size this big I didn’t think it would help much."

Surely this is one of the main reasons why the load speed went way down as the database size increases. It has to be better to add an index later, possibly much better.

I initially thought so too, but I found the same result -- loading indexes later gave little or no improvement in import speed. Sometimes it can even be a penalty. This actually makes sense though:

InnoDB secondary indexes are just additional B-trees, with the key storing the indexed columns and the leaf nodes storing the full primary key (I think minus any redundant columns with the index, but don't quote me on that).

When doing a huge import, you always want to do it in order, since InnoDB's clustered indexes mean that the main table data is stored in primary key order. So you don't get B-tree page splits in the primary data structure while doing a sequential import.

Secondary indexes, however, won't have this benefit. Their insertion will effectively be random I/O with lots of page splits and fragmentation. This will be equally true whether you build them at import time or build them later. In both cases, MySQL doesn't have the data ordered by the new secondary index yet until it has built the index.

Building secondary indexes at import time also benefits from being able to do a chunked concurrent import (multiple LOAD DATA INFILE queries at once). On my hardware I've found this will build a secondary index on a standby slave much faster than an ALTER TABLE will, since an ALTER TABLE rewrites the entire table, in 5.1 anyway.

So actually when adding indexes to existing large tables, I sometimes dump out all the data in parallel INTO OUTFILE queries, drop the table, recreate the table with the new indexes in the DDL, and then re-import all the data with parallel LOAD DATA INFILE queries. As a bonus this also defragments. That said, a single ALTER TABLE is a lot less human work, if you haven't automated your imports and exports.

I'm not familiar with Percona, what problem does it solve?

It's a fork of MySQL that provides substantially better durability, admin tools, an easy hot-backup solution, better instrumentation of your DB, etc etc.

Deploying is it a question of "stop db, replace mysql server binary with percona server binary, start db".

My personal opinion is that unless you have a support contract with MySQL AB, there is very little reason not to use Percona.

Percona also absorbed Maatkit in their tools for those familiar with it. Very handy stuff in there for automation of various processes.

Pretty happy with Percona as a drop in replacement for MySQL.

Their Xtradb storage engine has very tangible performance benefits over InnoDB, especially on SSDs.

Beyond the other listed benefits: Percona also provides support contracts quite inexpensively, which can be worth their weight in diamonds. I bought a support contract after our Percona instance started to slow down, and the level of quality of the analysis and the results blew my mind. And I've been doing SQL for 25 years.

In my experience the built-in partitioning support of MySQL (using PARTITION BY RANGE for example) is very good.

Inserts no longer drag due to huge indexes, delete's are instant (drop a partition of stale data), and SELECT's can limit their lookups to specific underlying files.

This is for a 500m row table of time series data.

Issues that can also severely degrade a systems performance and may not come into play until your data is growing over extended periods of time (not represented in the benchmark here):

1) Table statistics can grow stale and degrade the engines ability to select the correct access path. This can be corrected by running CREATE STATISTICS, depending on your DB version.

2) Page fragmentation, not to be confused with disk fragmentation, when the engine is selecting non-contiguous space. A rebuild of the db will be necessary if your pages are extremely fragmented and/or using a non-optimized allocation size.

"push as much complexity as possible to the database"

In general, this is usually bad practice and difficult to scale. I do agree w/ using a single table, but I disagree with the general premise of that statement.

Maybe this is more consumable as "remove as much complexity as possible from the code". In this case, both the code and the DB become considerably simpler, and it doesn't sound like there was anything fancy at all going on in the end result. I think derwiki just meant the DB will handle this, so you can leave the implementation there, not in custom sharding code.

    SELCT table_rows
Should be,

    SELECT table_rows

Thanks, fixed

The chown statement in the article uses a period instead of a colon.

Should it be "chown -R mysql:mysql" ?

The period works with GNU coreutils chown (e.g. on Linux), but not other versions.

I think it whines that it's deprecated though.

Yes, it's a good habit to get into because . is allowed in user and group names whereas : can't be; it's the field separator in /etc/{passwd,group}.

Good info. Did percona fix the load nulls ticket?

(LOAD DATA INFILE has this ticket from 2006, that,sometime during the 2000's was converted to a feature request


I think that's why my load does

but it's been so long i can't remember (at this point a lot fo folks would mention Postgres, but i'll refrain

I fail to see why setting the transaction isolation level to read uncommitted would make any difference to the data load process.

Without thinking about it to hard my guess is there is some overhead in tracking what is readable and what is not readable at the higher read levels. Read uncommitted means it doesn't have to track anything.

Writes still take locks. The isolation level is called Read-uncommitted for a reason.

Good stuff. I remember trying to load a Wikipedia database dump into MySQL on my laptop years ago and giving up.

I want to say this gentleman is doing it wrong. If his current set-up is what I perceive from the description he is not going to be gaining any speed or scalability. by moving to one giant table. In fact he might be losing some. It sounds like he set up 12 monthly tables and is inserting data into each one by month then he is querying across all 12 tables and joining the results. You can do that query in one fell swoop using UNION, If you have a significant number of entries the overhead of opening the 12 tables is negligible. If your looking up on a primary key then you will be more then fast enough.

Switch your tables into MyISAM and make a 13th table. insert the current month into the 13 table and query across all 13 tables when you need to get data out. At the end of the month move all records in the 13th table into its proper month table. ( make the 13th table a memory table sync it to disk every few minutes or put it on a really fast SSD )

Why do you think it's going to be faster to do a UNION query against 12 (or 13) tables than it will be to do a non-UNION query against a single table?

It's also unclear to me why you're suggesting the author switch to MyISAM. Presumably they're using InnoDB because it's actually ACID compliant, and even if they're not, according to MySQL's own benchmarks InnoDB is as fast or faster than MyISAM for heavy read workloads, e.g. as described at http://www.oracle.com/partners/en/knowledge-zone/mysql-5-5-i...

The main source for my reasoning is experience.

The UNION query will be faster because you will be doing a UNION against 12 MyISAM tables instead of doing a single query against a InnoDB table. MyISAM is much faster in read only situations. Once you start writing to it it becomes slower, but the 13th table prevents writes, for that 13th table you might need to go with a slow table like InnoDB. ( yeah I know about that benchmark its a flawed benchmark )

I didn't downvotes but I can see that while he gist of your comment may be reasonable, the first sentence is harsh and not necessarily correct with our limited info.

If you are storing this data for archival purposes & querying, did you check out Infobright? They have built it as a MySQL engine.

Why would anyone partition tables -- ever?

Lots of reasons :)

... says the engineer who may well be responsible for this Monthly sharding :)

All great advice! As a fellow habitual migrator of billions of rows in MySQL, a few things I'd add:

Do several chunked LOAD DATA INFILE queries in parallel. It's counter-intuitive but you'll get a performance boost. I usually chunk large data sets into 100+ pieces and do 40 concurrent import queries. Mileage may vary, depends a lot on your hardware and data set's sweet spot.

Don't use mysqldump for the export. SELECT...INTO OUTFILE is faster, especially since you can apply the exact same parallelized chunking technique. Do this on a standby slave (ie, one not receiving reads) with replication stopped, so that you get a consistent dump at a point in time.

The TSV dump doesn't need to be in your database directory. It can be anywhere mysqld can access. It can be beneficial to put it on a separate volume entirely, so that you're reading heavily from one volume while writing heavily to another.

Be mindful of disk scheduler choices in Linux if you're exporting from a faster disk to a slower disk. There's some definite edge cases that can lead to starvation of core processes. Also, be mindful of disk scheduler choices in Linux in general. Avoid use of CFQ on your database data volume.

Watch disk utilization with iostat or another tool of your choice. If you're coming close to saturating one or more volumes, that probably means you're on the right track :)

As other commenters have mentioned, the row count in information_schema (or equivalently SHOW TABLE STATUS) is an estimate, due to how MVCC works. If you need a more accurate count (but not quite perfect count for a specific moment-in-time for a growing data set), do that in parallel chunked SELECT COUNT(*) queries.

Happy migrating!

> Do several chunked LOAD DATA INFILE queries in parallel. It's counter-intuitive but you'll get a performance boost. I usually chunk large data sets into 100+ pieces and do 40 concurrent import queries. Mileage may vary, depends a lot on your hardware and data set's sweet spot.

I'm super curious about this. Is it because loading data—even in raw CSV rather than SQL format—is CPU-bound rather than IO bound?

It's hard to say. A single-threaded import might not be saturating any particular hardware component; there's other overhead in the work. And when you account for multiple CPUs and multi-disk RAID arrays, the picture gets more complicated.

InnoDB journals data. It's writing the data sequentially and then later arranging it to the correct place, which will also be sequential by primary key. So if you're loading in a bunch of sequential chunks in parallel, I imagine InnoDB can probably order the final B-Tree writes so that they also end up being sequential, even if the initial writes to its logfile ended up being interleaved.

It's more work for the computer to process data stored as "rows" than data stored as columns.

Think of how a disk is organized.

If you abandon MySQL and store data as columns you can load a trillion rows.

But forget I mentioned this.

We have been doing some tests with http://www.monetdb.org/Home. We saw inserts per second in the range of 50k/sec on a laptop.

mmap. +1. 10MB source tarball. -1. Looks like it's worth a test. Thanks for the reference.

what is the technical difference? I would have thought the dbms takes care of a performant storage/organization system for me.

This discussion reminds me of non-technical users. They will tolerate incredible sluggishness. It is amazing what Oracle and the free alternatives get away with. People, including engineers/developers, not only tolerate sluggishness but when something actually works on the first try they think it's amazing.

I've witnessed this tolerance for mediocrity for years and it still continues to blow my mind.


"...I decided to migrate the existing system into a single table setup."

"Alternative Proposed Solutions: MySQL Partitioning and Redis"

I'm surprised he didn't consider at Mongo, Couch, etc.

My coworkers and I have absolutely zero experience with Mongo, Couch, etc. We do have experience with MySQL and Redis (see https://github.com/causes/mock_redis), and I feel "safer" keeping important data in something as mature as MySQL.

I've heard so many stories of people moving away from MongoDB once they reach scale like yours, I don't blame you.

Is there a pattern for putting multiple Redis "collectors" in front of a single RDBMS aggregate / history store?

One of the shockers I came across with MongoDB is that each instance of a key takes up memory. There is no form of a symbol table for the keys, so this means a huge amount of data overhead if each 'row' of data uses keys at all, which they likely do. No one just uses arrays.

Not really sure what you mean about the rows given that it is JSON but anyway.

I think what you are referring to is the tokenization of field names: https://jira.mongodb.org/browse/SERVER-863

Right, what I really meant was a document. I was trying to relate it to SQL, but I may have made it more confusing than anything.

That issue is what I was referring to, and it would be a good step forward. However, it's a shame you have the overhead of field names in the first place. I understand why it is like it is, being schemaless, and I suppose in terms of scaling it isn't a huge issue, as the overhead scales linearly which is manageable. But, in most cases, it's still huge compared to the size of the data itself.

I'm not sure how they'll fix it..and I don't know much about other schemaless DBs, but perhaps some sort of pattern recognization would be appropriate. Now that MongoDB has lots of funding for research, it will be interesting to see what they come up with.

As in an INFORMATION_SCHEMA collection ?

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact