Slowest API ever....
Because there are a lot more contributors to request/response performance than just database indexes. Especially when you are at the scale of Paypal.
Your comment had the potential to be constructive and educational, but instead you took a condescending approach. That's unfortunate.
Honest question: Which part of a replication/db system would be using time stamps that weren't universal?
this must be someone not thinking, right? i don't see how they can have less rows than tweets. even with de-duplication of repeated text they still need a key (and timestamp) per tweet.
oh! and it gets worse. the next line says "400 million new tweets a day". that one must be plain wrong (it's a rate 100x higher than the number per second), given that the other two are (comparatively) consistent (it would also mean an average of 3 tweets per day per active user (140 million), and i suspect they define active user to be anyone over 1 tweet per month...).
so the section "by the numbers" contains four items, but only two independent values. and appears to be inconsistent twice.
 almost exactly - presumably the per second value is derived from a rough figure for the daily value.
Aside from your bad math above, the reporter heard million when I said billion. More than 3 billion rows per day. That math is easy enough to come by if you do any math on the rest of the numbers: 400M (Tweets per day) * 4 (replication) = 1.6B rows per day to store the Tweets, plus the same amount for an entry in a timeline. So that's 3.2B right there. And there are a lot of other types of indexes.
So that checks out... I'm guessing the number of new rows per day is excluding tweets.
From the UI Twitter only shows tweets to the minute so you could imagine that there is an optimisation there for merging retweets with the same text within a 60 second period.
The only way to avoid adding a new row per account that tweeted/retweeted would be to store the list of account in a single row, and keep updating it on every re-tweet. This seems like it would be a less than optimal solution.
[update: I completely screwed up the maths; looks like this is not rows per tweet, but support data.]
> for all i know, at this scale, they may be treating
> mysql as some kind of distributed hashtable and doing
> a lot more work in higher layers...
But do you think major banks on Wall Street tracking billions of trades a day in near real-time would be satifisfied with these numbers?
It all depends on what your users demand and what they will put up with.
The trick is realizing that things can and do fail in any way for any reason, and being able to automatically recover from any point in that process. Our (WePay's) system currently requires, at an absolute minimum, seven inserts across six tables, though more commonly 9-10 rows across seven, plus updates on several others, and significantly more if we can't authorize the payment on the first attempt (e.g., mistyped ZIP code on your billing address). Only one of those tables is for logging (account_history, which is functionally identical to a check ledger).
I've completely ignored both the logging of state changes (largely redundant at this point, it was more for early debugging a couple years ago, but still sometimes useful for figuring out why something stalled) and recording of fraud-detection data which can easily be a hundred rows, albeit very small ones (80 bytes or so). It's also interesting to look at what updates need to be performed in a transaction and which ones don't, although that's of course irrelevant to the actual amount of data produced.
That doesn't account for the product side of payments at all - recording contents of shopping carts, donations, ticket purchases, etc. That's at best one stateless row, but use your imagination about the data layout for various money-collecting tools.
Archival is interesting and something I'm sure we'll look at more in the future, but right now it would tend to create more problems than it solves -- we have automated data integrity checks running multiple times per day to ensure nothing is out of line, and kicking data out to an archive somewhere would complicate that significantly. We also of course don't have nearly as much data as PayPal being significantly newer, so it's less of a problem.
conflicts/row locks being one of the more common, but that's easy to deal with. It's when an external call to a processor dies halfway through that things get tricky.
And Skype sounds pretty interesting too.