I always knew the company I work for does cool things with MySQL, but I never knew how far we really have pushed the envelope until I see how much more we push MySQL then the twitters and facebooks of the world. Doing some napkin math I would say our largest cluster of MySQL does about 320 Billion inserts a Day across 80 nodes and some of our nodes can peak at about 200K queries per second ( that would be 16 Million QPS if all the nodes were fully loaded ). Meaning at one of our installs we do the same volume as two facebooks !
Coincidently I work on performance monitoring at Facebook. We gave up logging perf data to MySQL because it was a management headache at scale. So instead my colleagues built Scuba, which is essentially a search engine they taught to perform aggregate functions.
I am in no way an official spokesperson for them, just a developer, I don't know how much publicity I am allowed to draw to our back end architecture without getting into trouble. It is after all the thing that separates us from everyone else out there. (BTW I have posted the Who's Hiring post a couple times so my Username is pretty easy to track to whom I work for)
The speed of their transaction search actively discourages me using it. It is extremely slow even when searching by the transaction Id which you would assume at least that would have a decent index on it.
I have, and I can tell you that the speed at which they are able to search for transactions is disturbing. Regardless of if it's related to indexes or another cause, it's still a huge issue for people who have to use PayPal. That appears to be the point the parent comment was trying to make.
Your comment had the potential to be constructive and educational, but instead you took a condescending approach. That's unfortunate.
Just the concept of using time for the logical clock is sketchy. It's not monotonically increasing, it drifts, it can go decrease, etc... Curious what their cross host/region ordering and consistency looks like.
curiously, 4629 tweets a second is 4 million a day, which is more than the number of rows per day (3 million).
this must be someone not thinking, right? i don't see how they can have less rows than tweets. even with de-duplication of repeated text they still need a key (and timestamp) per tweet.
oh! and it gets worse. the next line says "400 million new tweets a day". that one must be plain wrong (it's a rate 100x higher than the number per second), given that the other two are (comparatively) consistent (it would also mean an average of 3 tweets per day per active user (140 million), and i suspect they define active user to be anyone over 1 tweet per month...).
so the section "by the numbers" contains four items, but only two independent values. and appears to be inconsistent twice.
 almost exactly - presumably the per second value is derived from a rough figure for the daily value.
Aside from your bad math above, the reporter heard million when I said billion. More than 3 billion rows per day. That math is easy enough to come by if you do any math on the rest of the numbers: 400M (Tweets per day) * 4 (replication) = 1.6B rows per day to store the Tweets, plus the same amount for an entry in a timeline. So that's 3.2B right there. And there are a lot of other types of indexes.
Assume they normalized the timestamp, and only stored something on the level of "these people tweeted/retweeted this message sometime during this 60 second interval."
The only way to avoid adding a new row per account that tweeted/retweeted would be to store the list of account in a single row, and keep updating it on every re-tweet. This seems like it would be a less than optimal solution.
it certainly seems like you would need to aggregate keys somewhere (if it's not the typo i assumed). that might just be possible in postgres these days (update: a quick google suggests it's in development for 9.3), but i am pretty sure not in mysql. on the other hand, for all i know, at this scale, they may be treating mysql as some kind of distributed hashtable and doing a lot more work in higher layers...
[update: I completely screwed up the maths; looks like this is not rows per tweet, but support data.]
Financial transactions tend to create a lot of data for the logging and recoverability they require, and the logging isn't terribly interesting.
The trick is realizing that things can and do fail in any way for any reason, and being able to automatically recover from any point in that process. Our (WePay's) system currently requires, at an absolute minimum, seven inserts across six tables, though more commonly 9-10 rows across seven, plus updates on several others, and significantly more if we can't authorize the payment on the first attempt (e.g., mistyped ZIP code on your billing address). Only one of those tables is for logging (account_history, which is functionally identical to a check ledger).
I've completely ignored both the logging of state changes (largely redundant at this point, it was more for early debugging a couple years ago, but still sometimes useful for figuring out why something stalled) and recording of fraud-detection data which can easily be a hundred rows, albeit very small ones (80 bytes or so). It's also interesting to look at what updates need to be performed in a transaction and which ones don't, although that's of course irrelevant to the actual amount of data produced.
That doesn't account for the product side of payments at all - recording contents of shopping carts, donations, ticket purchases, etc. That's at best one stateless row, but use your imagination about the data layout for various money-collecting tools.
Archival is interesting and something I'm sure we'll look at more in the future, but right now it would tend to create more problems than it solves -- we have automated data integrity checks running multiple times per day to ensure nothing is out of line, and kicking data out to an archive somewhere would complicate that significantly. We also of course don't have nearly as much data as PayPal being significantly newer, so it's less of a problem.
conflicts/row locks being one of the more common, but that's easy to deal with. It's when an external call to a processor dies halfway through that things get tricky.