
Tumblr Architecture - 15B Page Views a Month and Harder to Scale than Twitter - there
http://highscalability.com/blog/2012/2/13/tumblr-architecture-15-billion-page-views-a-month-and-harder.html
======
feralchimp
> Posts are about 50GB a day. Follower list updates are about 2.7TB a day.

Holy _schnikeys_! The graph changes are _two orders of magnitude_ larger than
the content additions.

~~~
thesethings
That caught my eye too, and I guessed on Twitter that it might be a
typo/under-explained.

Somebody from Tumblr (though on a different team) was nice enough to respond
and guessed that it was because all media on S3 was not counted.

[https://twitter.com/#!/adamlaiacano/status/16914079173274009...](https://twitter.com/#!/adamlaiacano/status/169140791732740096)

~~~
mcherm
Thanks. It seemed completely unreasonable to me until I saw this explanation.

------
famoreira
It seems that the only way to scale a site like this is to build decoupled
services maybe using a messaging queue (AMQP, etc..). How can I learn more
about this kind of systems? Also is there a open source application[s] built
in a distributed way that I could learn from?

~~~
runjake
I hate to point out the obvious answer, but it's a good introduction with
links to several implementations:

<http://en.wikipedia.org/wiki/AMQP>

~~~
famoreira
I meant some open source applications that are built in a distributed way
(hence using something like AMQP in a publish/subscribe fashion)

~~~
rabidsnail
I don't really know of any. Nobody writes these sort of distributed
applications from scratch. They're almost always evolved from monolithic
database-backed applications. And the process of evolution is half development
half ops. I doubt that tumblr as it exists today could be deployed from
scratch, even if you had all of the source code of all of the systems.

~~~
rabidsnail
However, if you want to go truly distributed (much more so than any website I
know of) there are open source projects you can learn from:

<http://freenetproject.org/>

<http://amule.org> (or <http://mldonkey.sourceforge.net/Main_Page> if you're
into functional programming)

<http://www.torproject.org> (especially the hidden services stuff)

And you probably want to read this:
<http://pmg.csail.mit.edu/papers/osdi99.pdf>

------
nessus42
My favorite part of the blog post:

" _Don’t hire people based on their survival through a useless technological
gauntlet. Hire them because they fit your team and can do the job._ "

~~~
rhizome
Those don't appear mutually exclusive to me.

~~~
jebblue
Agreed and to me it sounds incredibly short sighted. People who have survived
technological gauntlets know how to adapt and help instill a sense of
continuity in fellow team members.

~~~
nessus42
I, for the life of me, don't understand why people defend these gauntlets.
Wasn't getting a degree from MIT enough of a gauntlet? Wasn't writing reams
and reams of working, robust, well-documented code proof enough of one's
software engineering skills?

I think so!

~~~
jebblue
Try interviewing, every 2 or 3 years the interviewing crowd has been
influenced by their day's generational thinking (not a bad thing just
something to be aware of); I for one always welcomed people who have passed
through amazing experiences. Apparently from recent interviewing I was/am in
the minority. Oh, I meant the gauntlet of experience, learning under fire; I'm
not a college graduate.

Perhaps I need to use hair coloring.

~~~
fooandbarify
I got the impression form the article that they were referring to passing
through a technological gauntlet _during the interview process_ (ie.
complicated programming puzzles) rather than life experience. Is that also
what you are referring to?

~~~
jebblue
You're right, I misunderstood the context of the comment during the first
reading.

------
DLarsen
> Initially an Actor model was used with Finagle, but that was dropped.

The Akka folks do a good job of keep the actor model on your mind when
building Scala systems, but it's obviously not always the right approach. I'd
love to hear more about how they ended up abandoning the Actor model.

------
retroafroman
This may be a slightly naive question, but why list both "500 web servers" and
"8 nginx"? Seeing as nginx IS a webserver, is it fulfilling a different
function (serving static pages?) compared with the Apache servers?

~~~
simonw
$ curl -i '<http://www.tumblr.com/> HTTP/1.1 302 Found Date: Mon, 13 Feb 2012
19:31:35 GMT Server: Apache P3P: CP="ALL ADM DEV PSAi COM OUR OTRo STP IND
ONL" Location: <https://www.tumblr.com/> Vary: Accept-Encoding X-Tumblr-Usec:
D=15547 Content-Length: 0 Connection: close Content-Type: text/html;
charset=UTF-8

$ curl -i '<http://assets.tumblr.com/images/favicon.gif?2> HTTP/1.1 200 OK
Server: nginx/0.8.53 Content-Type: image/gif Last-Modified: Fri, 15 Apr 2011
22:13:30 GMT Accept-Ranges: bytes Content-Length: 635 X-Varnish: 1795992965
Cache-Control: max-age=2196224 Expires: Sat, 10 Mar 2012 05:35:30 GMT Date:
Mon, 13 Feb 2012 19:31:46 GMT Connection: keep-alive

So it looks like their application is being served by Apache (and PHP), while
their static assets are served by nginx behind Varnish.

~~~
silverlight
Well, their assets are begin served from a CDN. Are they counting that in
their server count?

    
    
      $ host assets.tumblr.com
      assets.tumblr.com is an alias for
      assets.tumblr.com.edgesuite.net.
      assets.tumblr.com.edgesuite.net is an alias for
      a1092.g.akamai.net.
      a1092.g.akamai.net has address 69.31.106.32
      a1092.g.akamai.net has address 69.31.106.50

------
B-Scan
After fairly bad review about Tumblr's availability [1] this was something
they needed to write. They had (and still have) pretty big challenges. It's
not easy to scale something like that and this article confirms that.

[1] <http://news.ycombinator.com/item?id=3468879>

------
matth
Off-topic, but I put together a list of HN members on Tumblr not too long ago:
[http://blog.dozierhudson.com/post/9596967319/list-of-
hacker-...](http://blog.dozierhudson.com/post/9596967319/list-of-hacker-news-
members-on-tumblr)

Get in touch and I'll add you to the list.

~~~
willvarfar
<http://williamedwardscoder.tumblr.com>

------
thebluesky
Always interesting to read about real-world usage of different tools and
languages.

From the article: ... • MySQL (plus sharding) scales, apps don't. • Redis is
amazing. • Scala apps perform fantastically. ...

Some of the organizations (Including Tumblr) using Scala were discussed as
part of this talk: <http://www.youtube.com/watch?v=qqQNqIy5LdM> and slides
here:
[http://mrkn.co/s/video_martin_odersky_what_s_next_for_scala,...](http://mrkn.co/s/video_martin_odersky_what_s_next_for_scala,575/index.html)

------
andybak
I'm stunned by the traffic figures as I don't know one person that uses it
(and I suspect the majority of my non-tech, Facebook loving friends have never
heard of it).

Is that anything to do with my demographic (41, M, UK based)?

~~~
fooandbarify
Yes (demographic). Although all sorts of people use tumblr, it seems to be
most popular among (predominantly American) high-school age (and a bit older)
teenagers. It has settled into a middle ground between Twitter and traditional
blogs for a lot of these users -- single-paragraph posts or photo posts are
the norm, and re-blogging is very common/encouraged.

------
ck2
_of the 500+ million page views a day, 70% of that is for the Dashboard_

"dashboard" is their admin area?

So in theory 700 of their 1000 servers are for people to make a new post?

If I remember correctly, wp.com has a similar problem.

~~~
citricsquid
A relevant comment of my own from a few weeks ago:
<http://news.ycombinator.com/item?id=3468904>

Tumblr is _not_ a blogging platform like Wordpress, it's much more a community
like Reddit. Yes, you _can_ use Tumblr to host your blog, but the majority of
people use the dashboard to interact with others and use their blog to share
their content. Very few people will ever see username.tumblr.com, they'll see
the posts via tumblr.com/dashboard. Like Twitter, very few people visit
twitter.com/citricsquid, they follow me and see my tweets in their stream.

For example, my dashboard right now: <http://i.imgur.com/0YAYv.jpg>
(potentially nsfw material, didn't check)

~~~
skeletonjelly
The list is NSFW depending on where you work for the curious. So I guess the
dashboard is the feed. Not a graph based dashboard, nor an admin section.

------
hungryblank
You do not write an article about technology by vomiting hundreds of bullet
points mentioning every technology a company ever used. It's an appallingly
bad example of technical journalism. This is not informational, it's just
bullet point driven bike shedding.

~~~
benjaminwootton
He puts the articles together by researching from various sources, often
without direct access to the company themselves.

It's going to be a little fragmented, but nonetheless, I found this a hugely
interesting article.

------
didip
I hope they managed to release
staircar([http://engineering.tumblr.com/post/7819252942/staircar-
redis...](http://engineering.tumblr.com/post/7819252942/staircar-redis-
powered-notifications)) before phasing it out with Scala.

------
fauigerzigerk
I'm skeptical about the inbox model.

I use Google Reader for two main reasons. One is that once I subscribe to a
feed, I get access to all messages, not just new ones. The second is that I
can search old messages as well.

So if I understand Tumblrs inbox model correctly, that's exactly the kind of
usage pattern that isn't supported.

~~~
apgwoz
I don't see how this is a problem. Every cell has a copy of every post--that
should make it possible to search it and get all older posts as well.

~~~
fauigerzigerk
Absolutely it should be possible. It just sounded to me as if they don't do
it. But maybe I'm wrong. I'm not a Tumblr user.

~~~
fooandbarify
The UI is very similar to a Twitter dashboard (if you are familiar with that).
I don't think it's possible to search for old posts in either case, and the
style of the site doesn't really lend itself to that kind of use (for better
or worse).

------
unicornporn
Aside from all this, can anybody imagine what their Amazon bill is like? Lucky
them they have not so demanding venture capitalists to pay their bills. I
honestly wonder how long they will be able to keep growing like they do
without making any money.

------
willvarfar
Which implies that disqus is getting hammered too.

Most blogs have disqus comments on each post; every page-view on tumblr is at
least one fetch on disqus; many more for index pages.

Very clever strategy for tumblr to make comments someone else's problem ;)

~~~
Avshalom
It's not that bad. 70% of views are from the Dashboard, they don't see disqus,
and I suspect that "most blogs" is an artifact of the circles you run in.
Anti-cdotely none of the blogs I follow have disqus.

------
democracy
A bit surprising is the tone of apology when touching JVM even with Scala
gloves.

~~~
soc88
Really? Didn't see that. Is this a political or a technical comment? When
building highly-scalable and stable systems, there isn't much except the JVM.
Sure, for some corner cases Erlang might be a solution and if reliability is
less important also .NET.

------
wtn
Do sites like Tumblr use graph databases?

~~~
jrockway
What questions do they need to answer that require the entire graph? It seems
like the most complex thing they may need to calculate is friends-of-friends,
and that's easy to do even with an adjacency list in a SQL database.

------
wingo
> New Tumblr > Changed to a JVM centric approach for hiring and speed of
> development reasons.

Interesting.

~~~
douglasfshearer
It seems that in New York it was easier to hire people who had worked at scale
with the JVM, since that is what the nearby banking institutions might
standardise on.

------
Alind
"Internally they had a lot of people with Ruby and PHP experience, so Scala
was appealing. "

Anyone notice this ??? logical?

~~~
spullara
Scala looks a lot more like Ruby than Java does. It makes people feel more
comfortable with the language. Also, a lot of the patterns, like collections,
are more similar between Scala and Ruby than Java and Ruby. This will change
with JDK 8, but for now it is true.

------
its_so_on
I'm sorry, but I have a hard time believing that anything is harder to scale
than Twitter. With an audience of 175 million users, a Twitter field of 140
bytes makes for 22 gigabytes of uncompressed text per person-tweet. If
everyone Tweets ten tweets per hour, that's 220 gigabytes. A one-TB HD would
only be good for 5 hours or so, meaning Twitter would need to buy a good 3000
hard-drives per year. That's probably more than Amazon has in total.

~~~
recursive
> If everyone Tweets ten tweets per hour

Well, luckily for them, that's not actually the case.

~~~
sakai
Ya... and Amazon almost certainly has more than 3000 TB of storage (3 PB).

~~~
ceejayoz
Hell, they've got pricing tiers for S3 _customers_ that go up to 5PB.

------
joshu
sounds like they have some twitter envy?

------
gaius
When I look at the amount of effort that goes into this, I have to wonder if
it wouldn't have been better to just write it in C in the first place.

------
cagenut
Tumblr is an extremely pageview heavy design, but the industry has moved away
from PV as an important metric for a few years now. Fortunately they're nice
enough to post their quantcast data publicly:
<http://www.quantcast.com/p-19UtqE8ngoZbM>

They're still phenomenal numbers, but IMHO should be much closer to a
100-server environment than a 1000 server one.

~~~
jonknee
100 servers for billions of requests per day seems crazy optimistic. It sounds
like they're running pretty lean as it is.

~~~
sanswork
I've done ~1b/day on around 6-10 1gb/gogrid instances though admittedly the
complexity was lower(ad serving platform) though not just a proxy/static
server. When I read this I actually messaged my partner "Imagine what we could
do with 1000 servers". I imagine a lot of those servers aren't directly
related to serving the site though. The number of support servers required is
usually way underestimated.

~~~
encoderer
I think you're on to something when you contrast the 2 problem domains. Number
of requests is a very naive way to look at load factors.

At the startup I work, we've got 25-30 Million users, many stats similar to
Tumblr, and we're running it on about 250 Ec2 instances of varying size. I
think if Tumblr's numbers are high at all -- due to rapid iterations and no
time to focus on deep optimizations -- it's maybe 10-15% high, not 90%.

I'm saying this because I've seen periods where our usage numbers are somewhat
flat, even falling, but our hardware demands rise as we provide more features.
When there's just one or two primary ways to use a service (eg "I post status
updates and comment on my friends' status updates") it can be quite easy to
optimize. But add features. Photos. Chat. An in-house ad serving platform.
I18N. Etc. You have different types of interactions with different acceptable
service levels and varying storage requirements.

~~~
jaylevitt
Exactly. My iPhone can process 800 million requests per second. It's just that
those requests are to add two integers.

A request for static content != a request for dynamic content != a request for
inter-user messaging != a request for a recommendation engine.

------
vintagius
Its 2012 and we still embrace high Pageviews and architectures to handle them
but ignore revenue.

Its high time everypost about insane pageviews and servers be coupled with
solid revenue numbers.

~~~
ceejayoz
It's a fairly shaky assumption that an engineer putting together a MySQL
sharding presentation would know the revenue numbers for the company.

~~~
tomkarlo
Even if they did, they shouldn't be disclosing them.

