This is the architecture we use for eventstream analysis at SnowPlow (https://github.com/snowplow/snowplow).
The code is open source (albeit with rough edges) at https://github.com/akamai-tech/trecul/
While we're on the subject of high-performance alternatives to Hive for event analysis, we've been watching the development of Ben Johnson's behavioural db, called Sky, with interest:
Michael- Trecul looks cool. I'm doing something similar with Sky. I'm building an LLVM-based query language on it called EQL (Event Query Language) that's optimized for evented data. It does a lot of function rewriting to optimize loops and avoid heap allocations so it can crank through tens of millions of events per second. It's not finished yet but it should be done in the next couple weeks.
Hadoop + Hive is a beast which will require multiple high memory systems just to run without daemons crashing and mysteriously deadlocking the entire cluster (i.e. you'll be searching for non-obvious messages in log files, googling and reading the source until you learn that some Java developers are still struggling with 30 year-old memory management and error handling techniques). You then need to write custom data loaders, completely architect around the write-once file I/O model, and then learn Hive before you can get a single result.
If you actually really need the things which Hadoop can better, it's worth that investment - but the decision is akin to knowing that you need to haul a billion pounds of coal and thus building a railroad makes sense. If you don't know that - and if you don't already know your access patterns well enough to carve them into stone - the overhead cost dwarfs the benefit.
I'd wager they query their data frequently and in a predictable manner, at which point using some sort data structure actually designed for fast searching (such as a simple btree in most rdbms) makes perfect sense here.
If the data is purely for archive purposes and is rarely queried or is queried in very random patterns that can't really make adequate usage of indices, then I'd agree with your suggestion.
But I do agree with you - if the same queries keep coming up frequently, then it's worth putting some tech in place to save on ad hoc querying costs. But an RDBMS isn't a great fit for this - because the data is a) non-relational and b) immutable. Happily there's a whole class of database designed for this type of work - analytics databases such as Greenplum and Infobright (Infobright is a modded MySQL too).
In any case, if you have your event data stored in flat files, then as well as a "raw load" into your analytics database, you can also schedule regular map-reduce jobs to populate specific cubes into your analytics db. This is something we're working on for SnowPlow now as well.
> For InnoDB tables, the row count is only a rough estimate used in SQL optimization.
Though I'll admit to occasional traumatic moments when my workstation crashes and bash history is polluted with multiple shells filling it and the stuff I've been ctrl-R invoking (recursive reverse search) for months evaporates.
The problem is curation.
It's really handy to have stuff in history. But there's a ton of crap I don't need.
What I need to do is get better about scripting the stuff that I'm invoking frequently. Though zsh probably has an extension that does this for you, and scratches your back.
Mind: virtually all my shell scripts start off as one-liners that I've invoked and re-edited (M-x M-e FTFW) until I've realized I have something worth saving.
Or making your history permanent. I have a reasonable sized history in my shell, but also dump out each line I typed to an eternal history file.
Your insert performance falls off a cliff when a majority of the index pages for the table no longer fit into the innodb buffer pool. After that happens, there is gonna be a bunch of random i/o. You can solve this problem by using partitioning, that way only a single partition's worth of index pages need to fit into the buffer pool to keep inserts into that partition fast. Of course you have to size your partitions accordingly.
A few other tips. Disable fsync() at commit entirely. Set innodb_flush_log_at_trx_commit=0. If you crash curing the data load, start over. Set your transaction logs to be as large as possible, which is usually 4G.
Databases (real databases) are using indexing technology that's been proven over not just years but decades of use and refinement. log2(500000000) is still only 28. Double that and it's only 29. And these databases' indexes are typically btrees with much higher branching factors than 2.
One of my favorite articles on this is Dennis Forbes' "The Impact of SSDs on Database Performance and the Performance Paradox of Data Explodification" (http://www.yafla.com/dforbes/The_Impact_of_SSDs_on_Database_...).
An SSD is on par with a 15K disk, but delivers 10X the perf, throw 10 SSDs in an array and you've got what used to cost $100,000 for about $5K
So load half a billion rows but don't update it too often!
What's impressive now is not the software but that the hardware is so cheap.
We use oracle because the partitioning options are better and bitmapped indexes. (We wanted more partitions so we could use a hierarchical triangular mesh for point-radius searches)
But that's the only thing I can think of -- and the speed/size of MyISAM tables have a lot going for them, I'd be tempted to go with MyISAM. Anybody else?
Also don't forget about MyISAM's lack of row level locks, and really nowadays there are fewer and fewer cases where it's faster than InnoDB.
* Any operation on myisam table locks it from writing
What if you partition by HASH(user_id) instead of partitioning by month (http://dev.mysql.com/doc/refman/5.5/en/partitioning-hash.htm...)?
"You could drop the indices completely and add them later, but with a table size this big I didn’t think it would help much."
Surely this is one of the main reasons why the load speed went way down as the database size increases. It has to be better to add an index later, possibly much better.
InnoDB secondary indexes are just additional B-trees, with the key storing the indexed columns and the leaf nodes storing the full primary key (I think minus any redundant columns with the index, but don't quote me on that).
When doing a huge import, you always want to do it in order, since InnoDB's clustered indexes mean that the main table data is stored in primary key order. So you don't get B-tree page splits in the primary data structure while doing a sequential import.
Secondary indexes, however, won't have this benefit. Their insertion will effectively be random I/O with lots of page splits and fragmentation. This will be equally true whether you build them at import time or build them later. In both cases, MySQL doesn't have the data ordered by the new secondary index yet until it has built the index.
Building secondary indexes at import time also benefits from being able to do a chunked concurrent import (multiple LOAD DATA INFILE queries at once). On my hardware I've found this will build a secondary index on a standby slave much faster than an ALTER TABLE will, since an ALTER TABLE rewrites the entire table, in 5.1 anyway.
So actually when adding indexes to existing large tables, I sometimes dump out all the data in parallel INTO OUTFILE queries, drop the table, recreate the table with the new indexes in the DDL, and then re-import all the data with parallel LOAD DATA INFILE queries. As a bonus this also defragments. That said, a single ALTER TABLE is a lot less human work, if you haven't automated your imports and exports.
Deploying is it a question of "stop db, replace mysql server binary with percona server binary, start db".
My personal opinion is that unless you have a support contract with MySQL AB, there is very little reason not to use Percona.
They also provide a lot of nice tools:
Pretty happy with Percona as a drop in replacement for MySQL.
Inserts no longer drag due to huge indexes, delete's are instant (drop a partition of stale data), and SELECT's can limit their lookups to specific underlying files.
This is for a 500m row table of time series data.
1) Table statistics can grow stale and degrade the engines ability to select the correct access path. This can be corrected by running CREATE STATISTICS, depending on your DB version.
2) Page fragmentation, not to be confused with disk fragmentation, when the engine is selecting non-contiguous space. A rebuild of the db will be necessary if your pages are extremely fragmented and/or using a non-optimized allocation size.
In general, this is usually bad practice and difficult to scale. I do agree w/ using a single table, but I disagree with the general premise of that statement.
Should it be "chown -R mysql:mysql" ?
(LOAD DATA INFILE has this ticket from 2006, that,sometime during the 2000's was converted to a feature request
I think that's why my load does
FIELDS ESCAPED BY '\\'
Switch your tables into MyISAM and make a 13th table. insert the current month into the 13 table and query across all 13 tables when you need to get data out. At the end of the month move all records in the 13th table into its proper month table. ( make the 13th table a memory table sync it to disk every few minutes or put it on a really fast SSD )
It's also unclear to me why you're suggesting the author switch to MyISAM. Presumably they're using InnoDB because it's actually ACID compliant, and even if they're not, according to MySQL's own benchmarks InnoDB is as fast or faster than MyISAM for heavy read workloads, e.g. as described at http://www.oracle.com/partners/en/knowledge-zone/mysql-5-5-i...
The UNION query will be faster because you will be doing a UNION against 12 MyISAM tables instead of doing a single query against a InnoDB table. MyISAM is much faster in read only situations. Once you start writing to it it becomes slower, but the 13th table prevents writes, for that 13th table you might need to go with a slow table like InnoDB. ( yeah I know about that benchmark its a flawed benchmark )
Do several chunked LOAD DATA INFILE queries in parallel. It's counter-intuitive but you'll get a performance boost. I usually chunk large data sets into 100+ pieces and do 40 concurrent import queries. Mileage may vary, depends a lot on your hardware and data set's sweet spot.
Don't use mysqldump for the export. SELECT...INTO OUTFILE is faster, especially since you can apply the exact same parallelized chunking technique. Do this on a standby slave (ie, one not receiving reads) with replication stopped, so that you get a consistent dump at a point in time.
The TSV dump doesn't need to be in your database directory. It can be anywhere mysqld can access. It can be beneficial to put it on a separate volume entirely, so that you're reading heavily from one volume while writing heavily to another.
Be mindful of disk scheduler choices in Linux if you're exporting from a faster disk to a slower disk. There's some definite edge cases that can lead to starvation of core processes. Also, be mindful of disk scheduler choices in Linux in general. Avoid use of CFQ on your database data volume.
Watch disk utilization with iostat or another tool of your choice. If you're coming close to saturating one or more volumes, that probably means you're on the right track :)
As other commenters have mentioned, the row count in information_schema (or equivalently SHOW TABLE STATUS) is an estimate, due to how MVCC works. If you need a more accurate count (but not quite perfect count for a specific moment-in-time for a growing data set), do that in parallel chunked SELECT COUNT(*) queries.
I'm super curious about this. Is it because loading data—even in raw CSV rather than SQL format—is CPU-bound rather than IO bound?
InnoDB journals data. It's writing the data sequentially and then later arranging it to the correct place, which will also be sequential by primary key. So if you're loading in a bunch of sequential chunks in parallel, I imagine InnoDB can probably order the final B-Tree writes so that they also end up being sequential, even if the initial writes to its logfile ended up being interleaved.
Think of how a disk is organized.
If you abandon MySQL and store data as columns you can load a trillion rows.
But forget I mentioned this.
I've witnessed this tolerance for mediocrity for years and it still continues to blow my mind.
"...I decided to migrate the existing system into a single table setup."
"Alternative Proposed Solutions: MySQL Partitioning and Redis"
I'm surprised he didn't consider at Mongo, Couch, etc.
I think what you are referring to is the tokenization of field names:
That issue is what I was referring to, and it would be a good step forward. However, it's a shame you have the overhead of field names in the first place. I understand why it is like it is, being schemaless, and I suppose in terms of scaling it isn't a huge issue, as the overhead scales linearly which is manageable. But, in most cases, it's still huge compared to the size of the data itself.
I'm not sure how they'll fix it..and I don't know much about other schemaless DBs, but perhaps some sort of pattern recognization would be appropriate. Now that MongoDB has lots of funding for research, it will be interesting to see what they come up with.