Hacker News new | past | comments | ask | show | jobs | submit login
7 Years Of YouTube Scalability Lessons In 30 Minutes (highscalability.com)
504 points by yarapavan on Mar 26, 2012 | hide | past | favorite | 70 comments

Most of this is relatively straightforward and unsurprising. But the one part that grabbed me is about "jittering". They insert random delays into timed events (the example given is cache expiration) to prevent a thundering herd problem when all the parts of the distributed system see the event at the same time (and for popular content, presumably repopulate the cache from the backend simulteously).

This is simple enough when described, but is not a technique I've seen applied much in practice or discussed in the community. I'm wondering if it's something that gets reinvented for all the projects that need it or if it's secret sauce known only in youtube. Regardless, I thought it was pretty insightful.

As a datapoint, we use jitter a lot in cache infrastructure at Facebook

The Adblock Plus blog details the thundering herd problems they faced. Their ad blocking lists checked for updates every 5 days. Eventually, many users' update schedules would converge on Mondays because office computers did not run over the weekend. Updates that had been scheduled for Saturday or Sunday would spill over to Monday.


Windows does much the same thing w.r.t. policy refresh (which sucks down files from domain controllers) and update of "occasionally updated" metadata like last logon timestamp.

A related technique is to use distinct prime numbers for periodicities in the first place:


(http://www.stdlib.net/~colmmacc/2009/09/27/period-pain-part-... http://www.stdlib.net/~colmmacc/2009/09/14/period-pain/ for the background material).

The reasons primes are optimal should be obvious :)

As an interesting counter point, Jeff Dean at Google says they prefer to have a known hit every once in a while, so they will have all the cron jobs go off at the same time versus introducing jitter. It perhaps has to do with their emphasis on reducing variability to control the size of the long tail distribution.

A nice extension of this is to use exponentially distributed delays. Then reoccurring events form poisson processes which are really easy to reason about when composed eg if you have N servers firing off events at the same exponentially distributed rate the times are distributed the same as if you had one server firing events at N x the rate and distributing jobs uniformly at random to the other servers.


A counterexample is that the Linux kernel tries to schedule timer events for the same deadline time. That allows the processor to sleep longer because the kernel doesn't need to wake up as often just to handle 1 or 2 timer events.

It's kind of a weird way to describe it. The better way is that you want your caching to be probabilistic rather than deterministic. In general, you want to avoid anything that would get the nodes in a distributed system to harmonize.

The other way to solve the problem though would be to handle that circumstance cleanly. There are ways to resolve a thundering herd without creating a scalability problem.

I've heard it called "splay" before e.g. `chef-client` will take an interval option and a splay option. Splay is a random amount of time added to the interval to prevent thundering herds.

On this topic, an interesting paper from 1994 on adding randomization to network traffic sources:


Fascinating paper, even if it is 18 years old now. Thanks.

In the past i've seen jitter used in all sorts of applications, from cron jobs to configuration management to memcache key expiration. Any time you have a crapload of nodes that all need to do an operation at a specific time you rely on jitter to keep resources from bottlenecking. Probably used anywhere the systems or network has a miniscule amount of resource headroom (like cluster nodes that run at 93% utilization)

Same in multicast DNS which is used in zeroconf:


Search for "random". Cool spec written by Stuart Cheshire of Apple.

I hadn't thought about it in the general case, but I do frequently find myself adding a sleep for PID mod some appropriate constant to the beginning of big distributed batch jobs in order to keep shared resources (NFS, database, etc) from getting hammered all at once.

Why is this not done in electronic exchanges to render HFT operations pointless?

First come, first served is one of the rules that makes the exchanges the exchanges.

What's this? Python? Apache? MySQL? But I thought you had to be running beta-release key value stores, esoteric web servers, and experimental programming languages if you wanted to scale!


Got your experimental programming language:

> Vitess - a new project released by YouTube, written in Go, it’s a frontend to MySQL. It does a lot of optimization on the fly, it rewrites queries and acts as a proxy. Currently it serves every YouTube database request. It’s RPC based.

Ah yes, but it's using a ~30 year old technology (RPC).

Is RPC an actual technology? I thought it was more of a protocol design pattern.

Remote Procedure Call is a design paradigm for synchronous call-and-response network communication. The Sun RPC protocol is an actual technology defined in RFC1057: http://www.ietf.org/rfc/rfc1057.txt

It's not insane, though not terribly relevant in the modern world. The only common technology still using it is NFS.

If you look at the original RPC work by Bruce Nelson [1], it's pretty clear that there's no strict definition of it. I think most would argue that SOAP would be included, which is still pretty common.

1: http://nd.edu/~dthain/courses/cse598z/fall2004/papers/birrel...

No, that's the ambiguity I was addressing. RPC means two things. The protocol is used, for the most part, only by NFS. The concept is pervasive.

That's ONC RPC. Not the only implementation. XDR is still used in places, though you're right that it isn't widely used.

man callrpc

When Youtube refers to Vitess as being RPC-based, they are not referring to Sun RPC (callrpc), but rather to the generic design pattern of exposing service calls over the network. In particular, Vitess makes services callable [-] using either BSON or JSON serialisation over HTTP CONNECT calls.

[-] http://code.google.com/p/vitess/source/browse/go/rpcwrap/

I know the rage is async & nonblocking, but putting Python behind Apache is a good way to get predictable performance without concerns for calls that block, since they're mitigated via threads.

Also, it helps to have money since this approach requires more boxes. But as I said, it's very reliable.

Nope, you just need lots of cash.

speaking of - are they profitable yet?

Fast, cheap, scalable: pick two.

The fastest, easiest, arguably most reliable way to scale is throwing money at it. And apparently it's easier to hire good people who know 20-year old tech rather than 3-year old.

The first 10 minutes are about monetization from one of the Youtube dev advocates. Skip to 9:45 to get to the "good stuff".

As an aside, this fellow is probably one of the best presenters I've seen from the pycon videos for this year. Confident, smooth, not reading from a computer screen or sheet of paper, clearly smart and in firm command of the subject matter.

I'd love to see more talks from him.

as someone forever with his head buried in technical matters, I really enjoyed seeing the talk balanced with some biz.

"They wrote their own BSON implementation which is 10-15 times faster than the one you can download."

Curious to hear more about that one. If true, I hope they open source it, because that could potentially make MongoDB a lot faster for everyone.

EDIT: It's apparently in their vitess code. Relevant code: http://code.google.com/p/vitess/source/browse/#hg%2Fgo%2Fbso...

fwiw I've done some python benchmarking; it happens that I'm actually genuinely needing a faster protocol right now: http://stackoverflow.com/questions/9884080/fastest-packing-o...

I believe it could be faster still. Their Itoa cache is a map[int]string, where it could be a []string. Also, I suspect that a few more primitive type special cases in the first type switch in EncodeField could go a long way.

I got the impression that the 15x they talk of is on the Python side?

I fully agree with Youtube faking data. However, I reckon they are faking a bit too much. Many times I would see 2000 likes and the video having 1700 views (Viral videos that is).

I knew the view counter wasn't propagated but the likes were and I was like: "Damn this is Youtube, kinda disappointing..."

I guess if both were propagated at the same time I wouldn't mind.

I honestly don't understand why they simply don't use out of sync data. You could have nodes periodically send aggregates of likes & views, and then add those in to the total ever N heartbeats. Why bother fudging the in-between.

They're probably propagating the likes and views independently. Which still doesn't explain why they allow counter-intuitive gaps like that instead of fixing them up on the client-side in javascript.

Yeah, the view counter is a clear case where they've gone too far. People care about view counts and it seems like every other video I watch has people complaining in the comments about the view count being obviously wrong.

The only time I notice this is when the view counts freeze at 300-ish for a few hours. That's because they're checking if the views are legit or from a bot, IIRC.

Youtube started off as a dating website?

This has to go down in history as one of the best pivot decisions ever made.

Originally ebay started out as "auction web" hosted on the same site that Pierre Omidyar used for hosting information about the ebola virus.

"Dummer code is easier to grep for and easier to maintain.

The more magical the code is the harder is to figure out how it works."

A nice formulation of the kind of advice I keep reading here in HN.

"Debugging is twice as hard as writing the code in the first place. Therefore, if you write the code as cleverly as possible, you are, by definition, not smart enough to debug it." -- Brian Kernighan

Complicated code that isn't easy to maintain isn't smart, it's just complicated. One thing people screw up a lot is implementing some design pattern in a way that ends up shotgunning configuration or business logic over a wide area instead of keeping it in one location and using DRY.

At 11:28 he says, "at last count, there was over a million lines of python running this thing"

Having never worked with code-bases larger than ~50kloc, I have a lot of trouble understanding what 1 million lines of code is needed for, especially considering that python is such a high-level language.

Does anyone have any idea why there would be this much code?

> Does anyone have any idea why there would be this much code?

It's the world's 3rd biggest website with hundreds of billions of views and dozens of millions of users, so maybe that's why.

I think it's a credit to Python that a website that does that and has grown in a fairly haphazard fashion only has about 1000k SLOCs.

Oh, it wasn't meant to be snarky or a slam -- I'm honestly just curious what kinds of things require so much code? E.g., is it one or two things that dominate usually in codebases this size, or is it just a LOT of components, each of which is tens of thousands of lines long? Do these kind of counts usually include auto-generated code?

that would be 1M LOC split across all the projects, libraries and modules.

I love the part on faking data. I take the viewpoint that only software testers care that the comment count is exactly correct in the majority of system. Users don't care.

I always thought it was a bug that view counts on YouTube varied depending on what page you saw them on. It can be different on the video view page, the channel page, and in your dashboard.

Turns out they're just making that shit up.

The publisher who uploaded the video might care. Especially so, if they are selling to advertisers. Comments, ratings, and favorites are three variables we calculate to reflect how engaging a video is.

LOL users care and they notice a LOT..Probably something like 5% of videos have a comment about how the view count is inaccurate.

does it stop them from watching videos or in any way hinder their enjoyment of the content?

If they feel strongly enough about it to leave a comment then I think it's safe to say it does hinder their enjoyment - in the same way that obviously broken things distract and displeasure in any medium.

I faked data for a client once. He had phpbb running and wanted a script that would slowly and randomly generate views on a specific topic.

Since this was just a field in a database, it involved some simple update code.

The results? More people are interested in anything that they think are popular (or they are curious as to why so many people viewed it).

My client got more actual hits overall on these topics.

My biggest gripe with youtube: why are comment almost always repeated? Yea i realize that most you tube comments are relatively worthless but I do tend to speed through them to get a feel for what the response is to a particular video. Inevitably I get through 20 comments and then the same 20 are repeated over again, often they are repeated several times. Perhaps they are trying to give the illusion of lots of comments or assuming the comments don't matter. Personally I find it extremely annoying, I'd rather them block to load more than repeat.

http://www.youtube.com/watch?v=G-lGCC4KKok Video to the article from pycon 2012. Talk starts around 10 minutes.

What about this:

> The number of videos has gone up 9 orders of magnitude and the number of developers has only gone up two orders of magnitude.

2 orders of magnitude means at the very least, going from 9 to 100 developers, which is a huge increase, but it could mean way more. I wonder how big the team really is, and what the changing team dynamics are like on that scale at that pace.

I'm sure many of us are disappointed that YouTube doesn't see consistent presentation of user comments as mission critical.

I am not. From what I have seen of YouTube, comments are vile, and mostly there are two strangers posting pointless arguments about something equally pointless.

For me, YouTube is good for watching videos. If I want to discuss it, I post it on FB.

I wonder how much speedup they could get from PyPy

The part about "Faking Data" is quite worrisome.

Uh, cheating?

"Cheating - Know How to Fake Data

Awesome technique. The fastest function call is the one that doesn’t happen. When you have a monotonically increasing counter, like movie view counts or profile view counts, you could do a transaction every update. Or you could do a transaction every once in awhile and update by a random amount and as long as it changes from odd to even people would probably believe it’s real. Know how to fake data."

So all those people who buy views are kinda screwed now :-) I suspect this is a bad example. I HOPE this is a bad example, if only for the KONY2012 campaign :P

No no, the correct amount of views will be recorded for a specific video, it's just that each webserver doesn't know the exact number all the time. You make each webserver fetch the correct value perhaps every hour, and fake it inbetween. You'll get an ok approximation, users can't tell the difference, and you don't have to fetch the actual number every single pageview.

Sir, thank you :-) I appreciate you clarifying this!

Thumbs up if you're the 311th viewer

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact