

Twitter, PayPal reveal database performance - werner
http://www.itnews.com.au/News/317811,twitter-paypal-reveal-database-performance.aspx

======
dkhenry
I always knew the company I work for does cool things with MySQL, but I never
knew how far we really have pushed the envelope until I see how much more we
push MySQL then the twitters and facebooks of the world. Doing some napkin
math I would say our largest cluster of MySQL does about 320 Billion inserts a
Day across 80 nodes and some of our nodes can peak at about 200K queries per
second ( that would be 16 Million QPS if all the nodes were fully loaded ).
Meaning at one of our installs we do the same volume as two facebooks !

~~~
ericcholis
Out of curiosity, what industry do you work in?

~~~
dkhenry
Performance Monitoring.

~~~
aristus
Coincidently I work on performance monitoring at Facebook. We gave up logging
perf data to MySQL because it was a management headache at scale. So instead
my colleagues built Scuba, which is essentially a search engine they taught to
perform aggregate functions.

[http://gigaom.com/cloud/exclusive-facebooks-scuba-project-
di...](http://gigaom.com/cloud/exclusive-facebooks-scuba-project-dives-into-
performance-data/)

~~~
dkhenry
Lots of people gave up trying to do their own performance logging, it turns
out its a non trivial problem when you get to large scale systems. Which is
why the company I work for exists.

~~~
aristus
Does the name start with New and end with Relic, by chance? :) I'd love to
compare notes sometime. My email's in my profile.

~~~
dkhenry
No, but I understand why you might think that. I sent you an E-mail.

~~~
nc17
Your company is easy to find on Google. It's not stealth, and it's venture-
funded. Curious why you can't or won't mention the name here.

~~~
dkhenry
I am in no way an official spokesperson for them, just a developer, I don't
know how much publicity I am allowed to draw to our back end architecture
without getting into trouble. It is after all the thing that separates us from
everyone else out there. (BTW I have posted the Who's Hiring post a couple
times so my Username is pretty easy to track to whom I work for)

------
ericcholis
Not surprising, especially if you've ever used their interface or API. I'm
reasonably certain that they don't have any indexes.

Slowest API ever....

~~~
ch0wn
The sandbox they provide is even worse in terms of response times.

~~~
nopassrecover
And lacks documentation. Or the documentation that is there is wrong. Or it's
there but missing fundamental details. Or it's right but you question how it
could possibly be a good design decision.

~~~
ch0wn
Or the documentation that describes something like it works in the sandbox,
but not in the production system or vice versa. Developing for Paypal is a lot
of fun.

------
robryan
The speed of their transaction search actively discourages me using it. It is
extremely slow even when searching by the transaction Id which you would
assume at least that would have a decent index on it.

~~~
taligent
Have you EVER actually worked on a reasonably sized system ?

Because there are a lot more contributors to request/response performance than
just database indexes. Especially when you are at the scale of Paypal.

~~~
drm237
I have, and I can tell you that the speed at which they are able to search for
transactions is disturbing. Regardless of if it's related to indexes or
another cause, it's still a huge issue for people who have to use PayPal. That
appears to be the point the parent comment was trying to make.

Your comment had the potential to be constructive and educational, but instead
you took a condescending approach. That's unfortunate.

------
aidos
_"get a timestamp error if one node attempts to read data that was written by
another node in the future (due to data moving across datelines)"_

Honest question: Which part of a replication/db system would be using time
stamps that weren't universal?

~~~
donavanm
Just the concept of using time for the logical clock is sketchy. It's not
monotonically increasing, it drifts, it can go decrease, etc... Curious what
their cross host/region ordering and consistency looks like.

~~~
sanxiyn
"An atomic clock is not that expensive." --
<http://research.google.com/archive/spanner.html>

~~~
jrockway
Indeed. A GPS clock accurate to 50ns is around $25 on eBay.

------
andrewcooke
curiously, 4629 tweets a second is 4 million a day[1], which is more than the
number of rows per day (3 million).

this must be someone not thinking, right? i don't see how they can have less
rows than tweets. even with de-duplication of repeated text they still need a
key (and timestamp) per tweet.

oh! and it gets worse. the next line says "400 million new tweets a day". that
one must be plain wrong (it's a rate 100x higher than the number per second),
given that the other two are (comparatively) consistent (it would also mean an
average of 3 tweets per day per active user (140 million), and i suspect they
define active user to be anyone over 1 tweet per month...).

 _so the section "by the numbers" contains four items, but only two
independent values. and appears to be inconsistent twice._

[1] almost exactly - presumably the per second value is derived from a rough
figure for the daily value.

~~~
taligent
Yes because finding ways to not insert a new row for each tweet is a classic
sign of someone not thinking.

From the UI Twitter only shows tweets to the minute so you could imagine that
there is an optimisation there for merging retweets with the same text within
a 60 second period.

~~~
pyre
Assume they normalized the timestamp, and only stored something on the level
of "these people tweeted/retweeted this message sometime during this 60 second
interval."

The only way to avoid adding a new row per account that tweeted/retweeted
would be to store the list of account in a single row, and keep updating it on
every re-tweet. This seems like it would be a less than optimal solution.

~~~
andrewcooke
it certainly seems like you would need to aggregate keys somewhere (if it's
not the typo i assumed). that might just be possible in postgres these days
(update: a quick google suggests it's in development for 9.3), but i am pretty
sure not in mysql. on the other hand, for all i know, at this scale, they may
be treating mysql as some kind of distributed hashtable and doing a lot more
work in higher layers...

[ _update: I completely screwed up the maths; looks like this is not rows per
tweet, but support data._ ]

~~~
pyre

      > for all i know, at this scale, they may be treating
      > mysql as some kind of distributed hashtable and doing
      > a lot more work in higher layers...
    

If MySQL is just used as hash table, and all of the interesting stuff happens
at higher layers, then talking about the number of rows isn't that meaningful.

------
gogetter
Twitter, Paypal don't really need to be that fast. Nor does Facebook.

But do you think major banks on Wall Street tracking billions of trades a day
in near real-time would be satifisfied with these numbers?

It all depends on what your users demand and what they will put up with.

------
spudlyo
If anyone is interested in watching Jeremy's presentation online it can be
found at:

<http://medianetwork.oracle.com/video/player/1873920417001>

------
keyz
What I would really like to see is a guesstimate of the volume of user
behavior/analytics data that FB/Twitter generate. It would be interesting to
see the average amount of data per user-interaction.

------
VMG
That's surprising to me. I didn't expect financial transactions to create that
much data. Also wouldn't most of the data be archived and only rarely be
accessed?

~~~
Firehed
Financial transactions tend to create a lot of data for the logging and
recoverability they require, and the logging isn't terribly interesting.

The trick is realizing that things can and do fail in any way for any reason
_, and being able to automatically recover from any point in that process. Our
(WePay's) system currently requires, at an absolute minimum, seven inserts
across six tables, though more commonly 9-10 rows across seven, plus updates
on several others, and significantly more if we can't authorize the payment on
the first attempt (e.g., mistyped ZIP code on your billing address). Only one
of those tables is for logging (account_history, which is functionally
identical to a check ledger).

I've completely ignored both the logging of state changes (largely redundant
at this point, it was more for early debugging a couple years ago, but still
sometimes useful for figuring out why something stalled) and recording of
fraud-detection data which can easily be a hundred rows, albeit very small
ones (80 bytes or so). It's also interesting to look at what updates need to
be performed in a transaction and which ones don't, although that's of course
irrelevant to the actual amount of data produced.

That doesn't account for the product side of payments at all - recording
contents of shopping carts, donations, ticket purchases, etc. That's at best
one stateless row, but use your imagination about the data layout for various
money-collecting tools.

Archival is interesting and something I'm sure we'll look at more in the
future, but right now it would tend to create more problems than it solves --
we have automated data integrity checks running multiple times per day to
ensure nothing is out of line, and kicking data out to an archive somewhere
would complicate that significantly. We also of course don't have nearly as
much data as PayPal being significantly newer, so it's less of a problem.

_ conflicts/row locks being one of the more common, but that's easy to deal
with. It's when an external call to a processor dies halfway through that
things get tricky.

------
trustfundbaby
It'd be great to read about people's experiences using postgres on a
comparable scale to that discussed in this article ... does that even happen?

~~~
d0ugal
Yahoo had a huge postgres database a few years ago. No idea what they are
doing now.
[http://www.computerworld.com/s/article/9087918/Size_matters_...](http://www.computerworld.com/s/article/9087918/Size_matters_Yahoo_claims_2_petabyte_database_is_world_s_biggest_busiest)

And Skype sounds pretty interesting too.
[http://www.slideshare.net/adorepump/small-overview-of-
skype-...](http://www.slideshare.net/adorepump/small-overview-of-skype-
database-tools)

------
hendler
Does the "commercial instance of MySQL" refer to Percona, something else, or
is it simply the MySQL commercial license?

~~~
jeremycole
The article author is confused. I clearly stated in the presentation that
we're developing our own branch of MySQL, which is available on Github:

<https://github.com/twitter/mysql>

