
7 Years Of YouTube Scalability Lessons In 30 Minutes  - yarapavan
http://highscalability.com/blog/2012/3/26/7-years-of-youtube-scalability-lessons-in-30-minutes.html
======
ajross
Most of this is relatively straightforward and unsurprising. But the one part
that grabbed me is about "jittering". They insert random delays into timed
events (the example given is cache expiration) to prevent a thundering herd
problem when all the parts of the distributed system see the event at the same
time (and for popular content, presumably repopulate the cache from the
backend simulteously).

This is simple enough when described, but is not a technique I've seen applied
much in practice or discussed in the community. I'm wondering if it's
something that gets reinvented for all the projects that need it or if it's
secret sauce known only in youtube. Regardless, I thought it was pretty
insightful.

~~~
cpeterso
The Adblock Plus blog details the thundering herd problems they faced. Their
ad blocking lists checked for updates every 5 days. Eventually, many users'
update schedules would converge on Mondays because office computers did not
run over the weekend. Updates that had been scheduled for Saturday or Sunday
would spill over to Monday.

[https://adblockplus.org/blog/downloading-a-file-regularly-
ho...](https://adblockplus.org/blog/downloading-a-file-regularly-how-hard-can-
it-be)

~~~
sk5t
Windows does much the same thing w.r.t. policy refresh (which sucks down files
from domain controllers) and update of "occasionally updated" metadata like
last logon timestamp.

------
blhack
What's this? Python? Apache? MySQL? But I thought you had to be running beta-
release key value stores, esoteric web servers, and experimental programming
languages if you wanted to scale!

/s

~~~
zmj
Got your experimental programming language:

 _> Vitess - a new project released by YouTube, written in Go, it’s a frontend
to MySQL. It does a lot of optimization on the fly, it rewrites queries and
acts as a proxy. Currently it serves every YouTube database request. It’s RPC
based._

~~~
maukdaddy
Ah yes, but it's using a ~30 year old technology (RPC).

~~~
simonw
Is RPC an actual technology? I thought it was more of a protocol design
pattern.

~~~
ajross
Remote Procedure Call is a design paradigm for synchronous call-and-response
network communication. The Sun RPC protocol is an actual technology defined in
RFC1057: <http://www.ietf.org/rfc/rfc1057.txt>

It's not insane, though not terribly relevant in the modern world. The only
common technology still using it is NFS.

~~~
m0th87
If you look at the original RPC work by Bruce Nelson [1], it's pretty clear
that there's no strict definition of it. I think most would argue that SOAP
would be included, which is still pretty common.

1:
[http://nd.edu/~dthain/courses/cse598z/fall2004/papers/birrel...](http://nd.edu/~dthain/courses/cse598z/fall2004/papers/birrell-
rpc.pdf)

~~~
ajross
No, that's the ambiguity I was addressing. RPC means two things. The
_protocol_ is used, for the most part, only by NFS. The _concept_ is
pervasive.

------
mattdeboard
The first 10 minutes are about monetization from one of the Youtube dev
advocates. Skip to 9:45 to get to the "good stuff".

As an aside, this fellow is probably one of the best presenters I've seen from
the pycon videos for this year. Confident, smooth, not reading from a computer
screen or sheet of paper, clearly smart and in firm command of the subject
matter.

I'd love to see more talks from him.

~~~
tricolon
Wadsworth to the rescue!
[http://www.youtube.com/watch?v=G-lGCC4KKok&wadsworth=1](http://www.youtube.com/watch?v=G-lGCC4KKok&wadsworth=1)

------
hendzen
"They wrote their own BSON implementation which is 10-15 times faster than the
one you can download."

Curious to hear more about that one. If true, I hope they open source it,
because that could potentially make MongoDB a lot faster for everyone.

EDIT: It's apparently in their vitess code. Relevant code:
[http://code.google.com/p/vitess/source/browse/#hg%2Fgo%2Fbso...](http://code.google.com/p/vitess/source/browse/#hg%2Fgo%2Fbson)

~~~
willvarfar
fwiw I've done some python benchmarking; it happens that I'm actually
genuinely needing a faster protocol right now:
[http://stackoverflow.com/questions/9884080/fastest-
packing-o...](http://stackoverflow.com/questions/9884080/fastest-packing-of-
data-in-python-and-java)

------
pothibo
I fully agree with Youtube faking data. However, I reckon they are faking a
bit too much. Many times I would see 2000 likes and the video having 1700
views (Viral videos that is).

I knew the view counter wasn't propagated but the likes were and I was like:
"Damn this is Youtube, kinda disappointing..."

I guess if both were propagated at the same time I wouldn't mind.

~~~
cbsmith
I honestly don't understand why they simply don't use out of sync data. You
could have nodes periodically send aggregates of likes & views, and then add
those in to the total ever N heartbeats. Why bother fudging the in-between.

~~~
moe
They're probably propagating the likes and views independently. Which still
doesn't explain why they allow counter-intuitive gaps like that instead of
fixing them up on the client-side in javascript.

------
swalsh
Youtube started off as a dating website?

This has to go down in history as one of the best pivot decisions ever made.

~~~
InclinedPlane
Originally ebay started out as "auction web" hosted on the same site that
Pierre Omidyar used for hosting information about the ebola virus.

------
ez77
"Dummer code is easier to grep for and easier to maintain.

The more magical the code is the harder is to figure out how it works."

A nice formulation of the kind of advice I keep reading here in HN.

~~~
xentronium
"Debugging is twice as hard as writing the code in the first place. Therefore,
if you write the code as cleverly as possible, you are, by definition, not
smart enough to debug it." -- Brian Kernighan

------
apu
At 11:28 he says, "at last count, there was over a million lines of python
running this thing"

Having never worked with code-bases larger than ~50kloc, I have a lot of
trouble understanding what 1 million lines of code is needed for, especially
considering that python is such a high-level language.

Does anyone have any idea why there would be this much code?

~~~
quink
> Does anyone have any idea why there would be this much code?

It's the world's 3rd biggest website with hundreds of billions of views and
dozens of millions of users, so maybe that's why.

I think it's a credit to Python that a website that does that and has grown in
a fairly haphazard fashion only has about 1000k SLOCs.

~~~
apu
Oh, it wasn't meant to be snarky or a slam -- I'm honestly just curious what
kinds of things require so much code? E.g., is it one or two things that
dominate usually in codebases this size, or is it just a LOT of components,
each of which is tens of thousands of lines long? Do these kind of counts
usually include auto-generated code?

------
cwbrandsma
I love the part on faking data. I take the viewpoint that only software
testers care that the comment count is exactly correct in the majority of
system. Users don't care.

~~~
ilaksh
LOL users care and they notice a LOT..Probably something like 5% of videos
have a comment about how the view count is inaccurate.

~~~
kami8845
does it stop them from watching videos or in any way hinder their enjoyment of
the content?

~~~
ckg
If they feel strongly enough about it to leave a comment then I think it's
safe to say it does hinder their enjoyment - in the same way that obviously
broken things distract and displeasure in any medium.

------
mbell
My biggest gripe with youtube: why are comment almost always repeated? Yea i
realize that most you tube comments are relatively worthless but I do tend to
speed through them to get a feel for what the response is to a particular
video. Inevitably I get through 20 comments and then the same 20 are repeated
over again, often they are repeated several times. Perhaps they are trying to
give the illusion of lots of comments or assuming the comments don't matter.
Personally I find it extremely annoying, I'd rather them block to load more
than repeat.

------
spdy
<http://www.youtube.com/watch?v=G-lGCC4KKok> Video to the article from pycon
2012. Talk starts around 10 minutes.

------
jcromartie
What about this:

> The number of videos has gone up 9 orders of magnitude and the number of
> developers has only gone up two orders of magnitude.

2 orders of magnitude means at the very least, going from 9 to 100 developers,
which is a huge increase, but it could mean way more. I wonder how big the
team really is, and what the changing team dynamics are like on that scale at
that pace.

------
Tloewald
I'm sure many of us are disappointed that YouTube doesn't see consistent
presentation of user comments as mission critical.

~~~
irahul
I am not. From what I have seen of YouTube, comments are vile, and mostly
there are two strangers posting pointless arguments about something equally
pointless.

For me, YouTube is good for watching videos. If I want to discuss it, I post
it on FB.

------
sylvinus
I wonder how much speedup they could get from PyPy

------
dropshopsa
The part about "Faking Data" is quite worrisome.

------
tbsdy
Uh, cheating?

"Cheating - Know How to Fake Data

Awesome technique. The fastest function call is the one that doesn’t happen.
When you have a monotonically increasing counter, like movie view counts or
profile view counts, you could do a transaction every update. Or you could do
a transaction every once in awhile and update by a random amount and as long
as it changes from odd to even people would probably believe it’s real. Know
how to fake data."

So all those people who buy views are kinda screwed now :-) I suspect this is
a bad example. I HOPE this is a bad example, if only for the KONY2012 campaign
:P

~~~
henrikschroder
No no, the correct amount of views will be recorded for a specific video, it's
just that each webserver doesn't know the exact number all the time. You make
each webserver fetch the correct value perhaps every hour, and fake it
inbetween. You'll get an ok approximation, users can't tell the difference,
and you don't have to fetch the actual number every single pageview.

~~~
tbsdy
Sir, thank you :-) I appreciate you clarifying this!

------
charbach007
Thumbs up if you're the 311th viewer

