
Reddit: Lessons Learned From Scaling To 1 Billion Pageviews A Month - jpmc
http://highscalability.com/blog/2013/8/26/reddit-lessons-learned-from-mistakes-made-scaling-to-1-billi.html
======
peterwwillis
You notice how in these recaps, all you read about is "I learned that X does
Y"? They don't seem to have much in the way of lessons to take heed of for all
situations. It's more like, "If you use this specific key/value store, tweak
the thingimabob to sassyfraz to make sure your dingo does wibblydong." So if
my platform doesn't use that store, your lesson is pointless. If it's a
problem with an application, it's great that you're pointing it out, but if it
was just oversight by lazy engineers, leave it out.

Then there's the wise lessons on general topics, like the idea that you should
"wait until your site grows so you can learn where your scaling problems are
going to be". I'm pretty sure we _know_ what your scaling problems are going
to be. Every single resource in your platform and the way they are used will
eventually pose a scaling problem. Wait until they become a problem, or _plan_
for them to become a problem?

I'm not that crazy. It really doesn't take _a lot_ of time to plan ahead. Just
think about what you have, take an hour or two and come up with some potential
problems. Then sort the problems based on most-imminent most-horrible factors
and make a roadmap to fix them. I know nobody likes to take time to reflect
before they start banging away, but consider architectural engineering.
Without careful planning, the whole building may fall apart. (Granted,
nobody's going to die when your site falls apart, but it's a good mindset to
be in)

~~~
diego
No, you have no idea what your scale problems are going to be (if you ever
have them). That is because if you get lucky and your application scales, it
(and the world) will change significantly from what it is today.

Let me tell you a story: in 1998 at Inktomi (look it up) we had a distributed
search engine running on Sun hardware. We could not have anticipated that we'd
need to migrate to Linux/PC because Sun's prices would make it impractical for
us to continue scaling using their hardware. It took us three years to make
the switch, and that's one of the reasons we lost to Google. Had we started
two years later (when Gigabit ethernet became available for no-brand PC
hardware), then we would have built the entire thing on Linux to begin with.

"It really doesn't take a lot of time to plan ahead."

Have you ever experienced the growth of a startup to see your infrastructure
cost soar to five, six, seven figures per month? Two hours will get you as far
as "one day we'll probably need to replace MySQL by something else." What you
don't know is what that something else will be. Too many writes per second?
Reads? Need for geographical distribution? A schema that changes all the time
because you need features and fields you never imagined? Will you need to
break up your store into a service-oriented architecture with different types
of components? Will you run your own datacenter, or will you be on AWS? What
will be the maturity level of different pieces of software in two years?

I hope you get the point.

~~~
peterwwillis
Do you really expect me to buy the idea that your company failed because
Gigabit was too expensive? Even if 100Mbit was far cheaper, there are plenty
of workarounds to cheaply increase throughput.

I assume by no-brand you mean custom built, and also assuming you mean the
cheapest available, in which case even one gigabit interface may have been
difficult, seeing as 32-bit 33mhz bus capacity is barely above gigabit speed.
In any case, the money you saved on Sun gear could have built you a sizeable
PC cluster and even with several 100Mbit interfaces would have been more
powerful and cheaper. Really I think it wasn't built on Linux because Sun was
the more robust, stable platform. But I could be crazy.

While i'm being crazy, I should point out all the other things you mentioned
can be planned for. Anybody who's even read about high-performance systems
design should be able to account for too many reads/writes! Geographical
distribution is simple math: at some point, there are too many upstream
clients for one downstream server, capacity fills, latency goes through the
roof. A DBA knows all about schema woes. I thought service-oriented
architecture was basic CS stuff? (I don't know, I never went to school) AWS
didn't exist at the time. And the maturity level of your software in two years
will, obviously, be two years more mature.

All of these problems are what someone with no experience or training will run
into. But there should be enough material out there now that anyone can read
up enough to account for all these operational and design problems, and more.
But if your argument is that start-up people shouldn't have to know it because
hey, I haven't got time to figure out how to do it right because I have to
ship right now, I don't buy that for a minute.

There's a paper that some guy wrote years ago that goes over in great detail
every single operational scaling issue you can think of, and it's free. I
don't remember where it is. But it should be required reading for anyone who
ever works on a network of more than two servers.

As an aside: was it _really_ cost that prohibited you from porting to Linux?
This article[1] from 2000 states "adding Linux to the Traffic Server's already
impressive phalanx of operating systems, including Solaris, Windows 2000, HP-
UX, DEC, Irix and others, shows that Inktomi is dedicated to open-source
standards, similar to the way IBM Corp. has readily embraced the technology
for its eServers." And this HN thread[2] has a guy claiming that in 1996 "we
used Intel because with Sun servers you paid an extreme markup for unnecessary
reliability". However, it did take him 4 years to move to Linux. (?!) A lot of
other interesting comments on that thread.

[1] [http://www.internetnews.com/bus-
news/article.php/526691/Inkt...](http://www.internetnews.com/bus-
news/article.php/526691/Inktomi+to+Embrace+Linux+With+Latest+Traffic+Server.htm)
[2]
[https://news.ycombinator.com/item?id=3924609](https://news.ycombinator.com/item?id=3924609)

~~~
asdfjjjjjj
Hi Peter, another ex-inktomi/ex-yahoo guy here. I worked on this infrastruture
much later than Diego. Traffic Server is not a significant part of the Inktomi
environment -- you are looking at the wrong thing. Diego is describing the
search engine itself, which ran on Myrinet at that time. It did not run on
100baseT ethernet. Myrinet was costly and difficult to operate, but necessary
as the clusters performed an immense amount of network i/o.

It is also extremely non-trivial to replace your entire network _fabric_
alongside new serving hardware and a new OS platform. These are not
independent web servers, these are clustered systems which all speak peer to
peer during the process of serving a search result. This is very different
from running a few thousand web servers.

Even once migrated to gigE and linux, I watched the network topology evolve
several times as the serving footprint doubled and doubled.

I assure you, there is no single collection of "every single operational
scaling issue you can think of," because some systems have very different
architectures and scale demands -- often driven by costs unique to their
situation.

~~~
peterwwillis
What you're saying makes total sense in terms of complexity and time for
turning out a whole new platform. But to my view it depends a lot on your
application.

Was the app myrinet-specific? If so, I can understand increased difficulty in
porting. But at the same time, in 1999 and 2000, people were already building
real-time clusters on Linux Intel boxes with Myrinet. (I still don't know
exactly what time his post was referencing) If Diego's point was that they
didn't move to Linux because Gigabit wasn't cheap enough yet, why did they
stick with the expensive Sun/Myrinet gear, when they could have used
PC/Myrinet for cheaper? I must be missing something.

I can imagine your topology changing as you changed your algorithms or grew
your infrastructure to work around the massive load. I think that's natural.
My point was simply that making an attempt to understand your limitations and
anticipate growth is completely within the realm of possibility. This doesn't
sound unrealistic to me [based on working with HPC and web farms].

What I meant to say was "every _single_ issue", as in, individual problems of
scale, assumptions made about them, and how they affect your systems and end
users. It's a broad paper that generically covers all the basic "pain points"
of scaling both a network and the systems on it. You're going to have specific
concerns not listed, but it points out all the categories you should look at.
I believe it even went into datacenter design...

------
gbog
> Stay as schemaless as possible. It makes it easy to add features. All you
> need to do is add new properties without having to alter tables.

And at the same time they use and praise Postgres a lot, so it cannot be about
NoSQL.

I am wondering what they mean exactly. From my own tendency, it should mean
use a few very big and narrow tables in the form of "who - do - what - when -
where", eg "userA - vote up - comment1 - timestamp - foosubreddit", and also
"userB - posted - link1 - timestamp - barsubreddit"

Then in the same table you get kinda all events happening in the site, and you
are somewhat schemaless, in the sense that adding a new functionality do not
require schema change.

If someone with inner insight can confirm this is not too far from what reddit
team meant, I'd appreciate.

~~~
ketralnis
> And at the same time they use and praise Postgres a lot, so it cannot be
> about NoSQL.

We had a basic schema that basically made postgres into a K/V store. So we had
both.

------
na85
Reddit is an interesting case; they seem to have almost unlimited amounts of
user good will. Case in point: I get the "you broke reddit" pageload failure
message an awful lot and I'm sure others do too. How many other sites have
userbases that would tolerate such a high number of errors?

~~~
hboon
Not many perhaps. But Twitter did too.

------
continuations
> For comments it’s very fast to tell which comments you didn’t vote on, so
> the negative answers come back quickly.

Can you get into more details about how this is used? If reddit needs to
display a page that has 100 comments, do they query Cassandra on the voting
status of the user on those 100 comments?

I thought Cassandra was pretty slow in reads (slower than postgres) so how
does using Cassandra make it fast here?

~~~
extesy
As far as I understand, since user most likely voted only on a small subset of
those 100 comments (say 3) and negative lookups are very fast because of bloom
filters [1], therefore all lookups combined are fast.

[1]
[https://en.wikipedia.org/wiki/Bloom_filter](https://en.wikipedia.org/wiki/Bloom_filter)

~~~
continuations
That makes sense. Thanks.

------
jzelinskie
This looks like a summary of the talk on InfoQ on the subject:

[http://www.infoq.com/presentations/scaling-
reddit](http://www.infoq.com/presentations/scaling-reddit)

~~~
seiji
highscalability is a strange reposty/blogspam aggregation thing that takes
information from other places and just puts it up on their own site. I think
they started having some original content, but it's still mostly second hand
reports of source material found elsewhere.

(Think of it more as somebody's personal notes about how things work and not
an exclusive source of breaking news or architecture revelations.)

~~~
ketralnis
Yes, nobody was interviewed or anything to put this together. They just
cobbled together some (mostly very old!) articles

~~~
jedberg
I wish they would put this disclaimer at the top of their articles. :(

------
727374
"Treat nonlogged in users as second class citizens. By always giving logged
out always cached content Akamai bears the brunt for reddit’s traffic. Huge
performance improvement. "

This is the lowest of low hanging fruit. Many people don't realize it but a
ton of huge media sites use Akamai to offload most of their "read-only"
traffic.

~~~
ketralnis
Definitely true, and one of the earliest and longest-standing optimisations.
Even pre-Akamai, we had simple caching, both whole-page and per-object/query.

------
human_error
> Used the Pylons (Django was too slow), a Python based framework, from the
> start

This isn't quite right. It was web.py at the beginning. They have started
using Pylons after Conde Nast acquisition.

~~~
jedberg
Yes, I glossed over that part. We didn't use web.py very long.

~~~
dmead
one time, rob malda commented back to me on something. if you could do that
same, that'd be greeaaaat

~~~
jedberg
I think this kind of thing doesn't really fly on HN, but sure. :)

------
falcolas
I can certainly appreciate what Reddit has accomplished, but the thought of
losing the abilities of a full RDBMS for a key-value store makes my hair stand
on end.

I've yet to find schema changes limiting in my ability to code against a DB
(and I use MySQL, which is one of the most limiting in this regard). Plus, I
appreciate the ability to offload things like data consistancy and
relationships to the database. I understand, however, where others might not
feel the same way.

~~~
diego
The long tail of startups will rarely need something other than a relational
database because they won't get to a scale anywhere near Reddit's. It's not
that others don't "feel" the same way; there's a reason all those technologies
exist. If you want to know, go work for Google, Twitter, Facebook, LinkedIn,
etc.

~~~
j_baker
I would argue the exact opposite. The average startup is likely using MySQL as
a glorified key-value store already anyway, and they're likely using it in
lieu of a more appropriate datastore because people tell them they don't need
a NoSQL database until they get to Google-size.

The lesson is: match your database to your use-case, not the other way around.
Need advanced querying/reporting options? Get a warm, fuzzy feeling from a SQL
prompt? Use MySQL. Want a plain jane key-value store? Use Voldemort/Kyoto
Cabinet. Want flexible schemas? Use MongoDB. Want a Key-value store with
secondary indexes and lots of scaling capabilities? Use Cassandra/HBase. Want
a powerful datastore that's supported by a BigCo? Use DynamoDB or Cloud
Datastore.

~~~
asdasf
There is literally zero appropriate use cases for mysql. If you need a
relational database, use one. If you need a network hash table, use one. Don't
use mysql at all.

~~~
falcolas
This is a very common (at least on HN), and very misdirected view of MySQL.

MySQL _is_ a high performing, highly scalable, ACID compliant relational
database, when configured correctly.

The "MySQL is not production ready" meme was perpetuated by some well meaning,
if ill-informed, fans of other RDBMS platforms.

~~~
asdasf
>MySQL is a high performing, highly scalable, ACID compliant relational
database, when configured correctly.

You forgot the "with tons of brokenness, misfeatures, mistakes, problems and
otherwise NOTABUG bugs that will never be fixed and cause immense amounts of
pain". Literally every other RDBMS is a better option, thus there is no reason
to use mysql.

~~~
falcolas
Simply repeating anti-MySQL rhetoric is not going to convince anybody that it
has actual problems, just that you've had bad experiences in the past that
have biased you strongly against it.

It's particularly not going to convince people when it's so widely used (from
Wordpress installations to Facebook), and perhaps more importantly when it's
offered as part of the two largest VPS providers.

On topic, I'd be happy to offer some advice on how to set up MySQL in a way
that limits (or eliminates) the concerns proffered by most "MySQL is not
Production Ready" comments... the two most oft cited problems being sorted by
the following two my.cnf settings:

    
    
        sql-mode=TRADITIONAL
        default-storage-engine=InnoDB

~~~
jacques_chester
The baseline is what counts.

You can avoid buffer overflows in C by using a library that's got safe
strings.

Does that make C safe? Nope.

The Windows NT architecture has an enormously rich security mechanism that can
allow arbitrarily granular security statements to be made about almost
everything. But the default policy until Windows 7 was "pretend you're Windows
95".

Did that make Windows more secure than Unix? Nope.

 _The baseline is what counts_.

~~~
falcolas
The baseline (reading as default configuration) is the only thing that counts?
Then Postgres is unusable for any reasonably sized dataset.

Of course, so is Oracle, SQL Server, and every other database known to man.

You have to tailor the configuration of any database server to meet your
needs. MySQL is no different in this regard.

~~~
jacques_chester
My need is for a database that doesn't silently corrupt my data.

MySQL _is_ different in this regard.

------
chrismealy
_Queues were a saviour. When passing work between components put it into a
queue. You get a nice little buffer._

What does reddit use for queuing?

~~~
jeffasinger
I believe they use RabbitAMQP

------
WestCoastJustin
This appears to a summery of an InfoQ presentation, which was discussed about
two weeks ago @
[https://news.ycombinator.com/item?id=6222726](https://news.ycombinator.com/item?id=6222726)

------
jjwiseman
"Do not keep secret keys on the instance." I'm curious how people deal with
this--what approaches do you use?

~~~
jedberg
Amazon now provides a service to give you on instance keys:
[http://aws.amazon.com/iam/faqs/#What_is_IAM_roles_for_EC2_in...](http://aws.amazon.com/iam/faqs/#What_is_IAM_roles_for_EC2_instances)

Before that at Netflix we developed a service that would hand out temporary
keys to the requestor when they presented a proper certificate.

At reddit we put the secret keys on the instance, which was bad. :)

~~~
arohner
Very cool. Are there any publicly available options for non-AWS services?

------
misiti3780
Is it common for people to use PostGres for a key-value store in production
(rather than redis)?. This is the first time I have heard of it, and I am just
starting to use PostGres now, so I was a bit surprised

~~~
rosser
It's still relatively new functionality, so I wouldn't expect to see it in
wide use, but we're currently trying it out in a limited, point-solution kind
of role. (We're a Postgres-mostly shop already.)

So far, everything's working as well as I'd have expected from something
released by the PostgreSQL community.

~~~
yummyfajitas
The functionality was always there:

    
    
        CREATE TABLE kvstore (
            key VARCHAR(128) NOT NULL UNIQUE,
            value VARCHAR(128) NOT NULL
         );
    

Or are you referring to hstore?

~~~
rosser
I'm talking about the JSON-specific functionality, which tends to make doing
JSON-based, document-oriented, key-value-ish things in PostgreSQL
significantly better, easier, faster, whatever-er.

------
exhaze
Jeremy also gave a great Airbnb tech talk on this topic:

[http://nerds.airbnb.com/reddit-netflix-and-beyond-
building-s...](http://nerds.airbnb.com/reddit-netflix-and-beyond-building-
scalable-and-reliable-architectures-in-the-cloud/)

------
callmeed
Can someone elaborate/clarify this:

 _> Users connect to a web tier which talks to an application tier._

So, I'm assuming the web tier is nginx/haproxy and the application tier is
Pylons.

Are the 240 servers mentioned all running _both_ the web tier and the app
tier?

~~~
computer
Presumably the web tier does slightly more than just reverse proxying. For
example, it could build (render) pages based on an internal RedditAPI it
queries. This RedditAPI (application layer) would then basically be a
distributed database front-end with some state, like user sessions.

Seperating it at that point allows the web tier to offload much of the work
(mostly rendering), while keeping it stateless, thus allowing effortless
scaling of that tier.

------
chum
_Recode Python functions in C_

From a security standpoint, this sounds like a bad idea

~~~
jedberg
By the time it hits the C code it should be sanitized, but yes, it does add
some security overhead.

------
ivanbrussik
Just out of curiosity what does "stay as schemaless" as possible that did not
read right?

------
skeletonjelly
jedberg - you speak of automation, did you use anything (or is there anything
in use currently) that handles auto scaling for EC2? puppet/chef/ansible etc?
Or was this all done by hand?

------
srj55
hmm...no love for django here.

~~~
falcolas
I'm currently using Django, and I can understand where they are coming from.
The sheer amount of code that a simple request has to go through to receive an
answer is staggering sometimes, such as 18 line tracebacks just to identify an
authentication problem...

My own projects are not yet large enough to have this cause an issue, but I
can see where something the size of Reddit would indeed have issues that even
the most aggressive caching can't resolve.

