There's a full video of the talk available at http://www.livestream.com/facebookevents/video?clipId=flv_cc...
The fact that it's MySQL makes it even more interesting, given the shift of scalability interest to MongoDB, etc.
This can reap huge benefits and doesn't need to be difficult. Just enable the slow query log in MySQL, use the EXPLAIN command to analyze the results, then add indexes where appropriate. I was able to fix poorly indexed tables in a vendor's application with dramatic results. In one case, a twenty-minute(!) query was reduced to less than a second.
When all you have is a hammer...
McDonald's, to give a famous example, proved analytically a few years back that -- beyond a certain threshold of commonly expected service times -- their customers would rather get semi-slow service on a consistent basis than highly variable service. In a perfect world, of course, average wait times are as short as possible in addition to a minimization of variance. But when you're at the level of acceptability, there are diminishing returns on speed increases and increasing returns on reduced variance.
Facebook (and Google!) cares about performance variance because it has more of an impact on overall site performance than average performance does. Variable performance has huge impact on downstream systems, and you can quickly end up with cascading performance problems.
I think that quote is slightly misleading without more context. They prioritize optimizing variable-performing queries higher than others. They aren't going to be using slow queries on the Facebook home page.
From what I hear, Google/Bing do track response latencies at 99+ percentile.
I'm having trouble understanding the motivation. If a slow query is always slow, then I'm always going to be kept waiting for that page/data. It seems logical to worry about the queries that 100% of the time keeps users waiting rather than the queries that keep users waiting <100% of the time.
Does anyone care to explain why this is a good idea (for Facebook at least)?
The harder problem is figuring out why that 20ms query suddenly balloons to 200ms. You can say, "no big deal, it only happens 1% of the time," but if you don't know why, you could make changes to the system that cause it to happen much more frequently and eventually bring the whole system down.
Also, there's a bit of UX here. People are much more frustrated by things they don't understand and/or aren't used. There are parts of GMail that are always slow (archiving a lot of messages). I know this so I know I have to wait 5-10 seconds. What if sometimes it took 1 second and sometimes it took 20 seconds? What if it took 20 seconds 5% of the time. I'd probably always click again and think something was broken. If it's always slow, I want it to be faster, but at least I know what to expect.
Think of it this way: if you know a certain function will reliably take a little while to complete, you can justify the effort of adding progress indicators and other feedback to let the user know.
But if query performance is unpredictable, even planning the UI design becomes difficult - not to mention the end user's experience.
See Deming's work on statistical process control used in wartime production during WWII: http://en.wikipedia.org/wiki/W._Edwards_Deming
I'm glad Facebook is following this old school engineering tradition
This is talked about here:
The section on diagnosing should be taken with a grain of salt, though. If your company ever gets to the point where you need to monitor everything at subsecond level to catch problems or analyze and understand every layer of your stack to see how it performs, you've already won. That amount of attention to scalability means your company has a huge base of users. Not only that, it means you have the large and impressive engineering resources to devote to that problem.
That's definitely not my startup, and so the tools described, while definitely useful (and probably fun to build!), aren't anything approaching a priority for me. In the words of the stereotypical Yiddish grandmother, you should be so lucky to have those sorts of problems!
They switch the the biggest machine IBM can give them every few month.
Yea, almost all this "web scale" scalability can achieve those numbers because they can partition, relax transactional requirements, and scale out. For system-wide true transactions, scaling up is still the only option, it seems.
Although, H-Store/VoltDB claims some impressive numbers if your OLTP app fits the requirements.
I'd suspect that if those don't qualify as 'very large scale' to your definition (which would be entirely justifiable) that such systems don't exist because they're not currently possible. For example I don't think it's possible to implement Facebook with SERIALIZABLE no matter how much money you throw at the problem.
(I'd love to be proven wrong of course)
Normally, whenever physical objects or money is involved, transactions should be considered.
I'm just astonished that a company like Oracle being around as long as they are could be so dumb. At the fortune-5, Oracle has a similar practice of gouging us on Peoplesoft licenses due to, in my opinion, lost DB sales.
Charging a customer a license by CPU core is just unethical.
It's a no wonder... Go riddance