Hacker News new | comments | show | ask | jobs | submit login
Scaling Reddit from 1 Million to 1 Billion – Pitfalls and Lessons [video] (infoq.com)
201 points by veszig on Aug 16, 2013 | hide | past | web | favorite | 109 comments

What a horrible website. Made an account to get the mp3. Just to discover that the mp3 download link doesnt work even then. Thank god I used a throwaway email to sign up. Otherwise I would be even more pissed.


Bugmenot is great for sites like this.


is what i use, much simpler.

I wouldn't call that simpler. You have to register using a fake email, possibly actually go check the inbox and click a link, and possibly be required to log in again now that the account is registered. With bugmenot, all you do is copy and paste a valid username and password.

Another similar option is Mailinator: http://mailinator.com/

Some websites block emails from mailinator which is pretty annoying.

For Firefox users, made even smoother with: https://addons.mozilla.org/en-us/firefox/addon/bloody-viking...

i keep forgetting about bugmenot

Disappointment in the HN community has reached a new high today.

So, instead of discussing the topics in the video, the majority of commenters here are discussing the flaws of the website it's hosted on or debating whether or not reddit is profitable. Neither of which has ANYTHING to do with scaling.

I expected better, people. Seriously.

The dominance of non-relevance is interesting for HN. Was writing exact the same thing, and just saw yours.

In this and some other technical topics, people end up discussing their personal tastes with web site's design, their individual UI frustrations with some button on the web site, the font, the color, and other random non-relevant topic; like now the profitability.

It's no excuse, but I see a reason for this: most people feel they understand these irrelevant topics better than they understand the scalability features of the infrastructure Reddit has built. People talk about their comfort zone.

This is the fundamental rule of bikeshedding, which HN has discussed, decried and regurgitated since time immemorial.

Spot on observation IMO

Maybe they can't see the video to begin with because of excessive web advertising ploys. That might explain why they are not commenting on the video. Just a guess.

I'm not trying to be snarky, but your comment is off topic as well.. When does it end?

"ask me anything about running a profitable social media company".

Except for that reddit is not profitable.



Funny story about that. When we were owned by Conde, the accounting was a little different (they took on some of the charges, like Akamai), and so we were actually told we were slightly profitable.

When that blog post went up I was as surprised as you to see it wasn't profitable when I was there.

But weren't you "deeply involved in the business?"

Yes. What point are you trying to make?

I think his point is that he doesn't understand accounting and how complicated it can be so he thinks that if you were deeply involved in the business you should know every facet of every debit and credit.

If he's right I guess I wasted about 300 hours studying for Financial Reporting and Analysis section in the Chartered Financial Analyst (CFA) program.

Since this isn't reddit, I won't just say: ^^this.

But yes, I believe he was trying to point out that I couldn't have been involved if I didn't know, but doesn't understand the ins and outs of G&A and other such accounting practices.

I was confused about the memcached problem after moving to the cloud. I understand why network latency may have gone from submillisecond to milliseconds, but how could you improve latency by batching requests? Shouldn't that improve efficiency, not latency, at the possible expense of latency (since some requests will wait on the client as they get batched)? And while maybe efficiency is valuable, why would that be an improvement for a problem they didn't have before?

Sorry that wasn't clear. The latency didn't get better, but what happened is that instead of having to make a lot of calls to memcache it was just one (well, just a few), so while that one took longer, the total time was much less.

I actually did some (simplistic) examples of this in a small presentation to illustrate the performance improvements of batching memcached requests, if anyone's interested: https://speakerdeck.com/robotmay/a-simple-introduction-to-ef... (slides 11 to 14)

That's a better explanation than mine. :) Thanks for the link.

As long as nothing's blocked, latency could go up 'a lot' (sub-ms -> ms, maybe 1ms->2ms with batching) without meaningfully impacting overall throughput.

I can definitely see millions of networked memcache calls being a bottleneck, and if the batching adds another ms per req on average, but removes the bottleneck, then they can serve a lot more users at a cost of 1ms per req.

Is there anything in TFA that would support my theory? I don't know. I don't care enough to endure InfoQ. (I did for a Rich Hickey talk once, lo these various months, and yea it were a minor inconvenience).

Edit: whoa jedbergo!

Though it might not be the situation in the video, introducing batching can decrease system latency. I made this little graph to show how this can work:


So did anyone else take the most important lessons in the video as:

1. Use AWS

2. Use Postgres

3. Use AWS

4. Use Cassandra

5. Use python, so later you can write C when shit needs to go super fast

That's what I got.

I find this interesting, this is really great feedback for me actually.

Those are some of the important lessons, although use (postgres|cassandra) are really too prescriptive. More like "use the right tool or tools for the job".

Also, use consistent key hashing where appropriate is another important less that I should emphasize more.

And "build for 3" is another important lesson. It makes scaling much easier.

Thrilled to finally see this on InfoQ -- already an underappreciated resource for technical talks. What is it about video + slides that appeals so little to people?

Large banner and tiny video frame, site rendering completely garbled for 8 seconds until fully loaded, signup required to view the slides (after realizing that the video doesn't cut to slides at all). The talks are interesting, and the interview transcription is nice though there are UX issues there as well– I need to click to view each response, and despite the entire question being hyperlinked clicking it actually does nothing. Show all works thankfully (ah, but the frame breaks my scroll).

I'll probably be back to check out more of the videos, but definitely not because of the site. If the editing is good, YouTube is just fine, otherwise SlideShare plus an audio file is just perfect.


    [F] First Byte Time
    [C] Keep-alive Enabled
    [C] Compress Transfer
    [A] Compress Images
    [A] Progressive JPEGs
    [F] Cache static content
    [ ] Effective use of CDN
Source: http://www.webpagetest.org/result/130816_PE_AYH/

I will just comment in full support of this, it says everything.

> SlideShare plus an audio file is just perfect.

Although slideshare is not as bad as infoq by a long shot, it's not very useable (or fast) either. I much prefer speakerdeck.

Speakerdeck is actually what I was thinking of, but the gist is slides + audio = happy. A more polished UI would do some good for SlideShare, but I don't have any major issues.

What is it about video + slides that appeals so little to people?

A full transcript with interleaved slide images takes a few minutes to read at most, and lets you control the pace of information absorption. A video with slides, especially when you cannot 2x the talking speed, is a painfully slow data transfer method.

Video + Slides = analog modem

Transcript w/ embedded slides = Google fiber :-)

Of course, if you're into audio books instead of reading, maybe you consider that a feature.

// I push video for a living. It's great for visual explanation like DIY instruction (e.g. woodworking, swapping RAM on a Mac Mini), emotional content, personal story telling, etc. Systems architecture is generally not in one of those categories.

> What is it about video + slides that appeals so little to people

I don't have a solid block of 40 undisturbed minutes to listen to a talk. Give me a transcript and I can read a paragraph here and there as I do other things at my own pace. I might have ten minutes here, ten minutes there. I don't want to be constantly pausing/unpausing the video, or worse - switching between the video and my music.

Plus, if I concentrate, I could read a 40 minute talk in 20 minutes or less.

Basically, when I'm reading, I control the pace. I rarely watch videos that are longer than about 5 minutes (that aren't entertainment, which is entirely different).

Site kinda sucks and is annoying to use for those of us that prefer textual resources and the ability to flip through slides ourselves without clicking around a teensy tiny 1-px sized gaps between slides.

I can read a slide deck at a rate of 5s/slide (or faster), pausing to concentrate on the ones that interest me. I don't have the spare time to watch a video of someone, just in case the topic is interesting.

The video is 40 minutes long. If this was a blog post it'd take me maybe 5-15 minutes to read/skim.

Creating a video on youtube and linking to a pdf with the slides is easy enough to do. The value add of having the website procedurally flip slides for me is very, very small.

Does anyone know how the different storage systems are utilized, and why each system is utilized for that purpose? The presenter mentions using memcached, Cassandra, and PostgreSQL, and mentions the same type of data when discussing each (votes, for instance). I would definitely benefit from a more in-depth understanding of how each system is utilized, and why.

Each tool has a different use case. Votes is a great example.

Memcache has no guarantees about durability, but is very fast, so the vote data is stored there to make rendering of pages as quick as possible.

Cassandra is durable and fast, and gives fast negative lookups because of its bloom filter, so it was good for storing a durable copy of the votes for when the data isn't in memcache.

Postgres is rock solid and relational, so it was a good place to store votes as a backup for Cassandra (we could regenerate all the data in Cassandra from Postgres if necessary) and also for doing batch processing, which sometimes needed the relational capabilities.

That makes a lot of sense. Were the majority of your systems using this "durability chain" so to speak -- memcached -> Cassandra -> Postgres? Additionally, in retrospect do you find this type of chain to work well, and would you use it again (perhaps you already are over at Netflix)?

Side question: is vote queuing the reason behind the sometimes large drops in score on highly active and popular submissions on reddit?

No. That's generally because once something gets popular and jumps to the front page, it gets a huge boost in visibility, especially from people who weren't looking at the niche subreddit it comes from.

A lot of those people aren't interested in that content, so it will suddenly get an influx of downvotes.

Thanks for the reply.

I'm glad there is an explanation based on user behavior for this phenomenon because admin level vote tampering is such a tired theory.

I wonder if the demise of Digg three years ago and the (supposedly) inflow of new users have been problematic at the time.

There wasn't actually a very large jump in traffic when Digg v4 was launched. Most of those folks were already reddit users. Traffic bumped a little bit, but not all that much.

It's important to keep in mind that reddit was already doing twice as much traffic as Digg before they launched v4.

Thanks for your input, didn't know that.

I, for one, went from a /r/php lurker to real user at that time so I thought there was dozens of us, dozens !

Most likely as Reddit had continued downtime around that time with EC2/EBS scaling issues.

Reddit still uses Pylons, and I imagine moving to Pyramid would be very painful. He's not clear on this in the video.

I don't think I made any indications one way or another, but yes, it would probably be difficult to move.

Great presentation man. Do you think Pylons will serve Reddit for years and years to come? Is there any need for you to switch to Pyramid or another framework? I made the choice to use Pylons for a recent project, and it just feels kind of odd using an old framework which is now in "maintenance only mode", but I truly did not like Pyramid... much less Django.

For me link works only in safari. Firefox plays only video and chrome just does not work.

I am curious how much a Go lang rewrite would make a difference in scaling up.

It depends how much time is spent doing computation and how much time is spent doing lookups from disk. If the latter is clearly dominating, then a switch to Go will not help much.

thanks for the insight. i did not have much to do with Go. But hear a lot of positive benchmarks here on HN. Reddit is kind of like an example social web app, and I read a lot on its architectural changes and scaling efforts. So for social Web 2.0 apps, which are getting older now, I m curious how much Go would make a difference. Mainly for Google applications, apparently Go brings a lot of speed and performance on the same server.

It doesn't really matter what the genre of an application is. What matters is the runtime fundamentals. How much time is spent computing vs waiting for I/O? Whichever one is slower is the current bottleneck and is what you should fret about. Go becomes something to consider if the bottleneck is computation time. It's tangential otherwise.

I read a comment from one of the sysadmins (alienth?) saying their main bottlenecks were I/O.

With all of Reddit's past problems with AWS and EBS (https://news.ycombinator.com/item?id=2339214, https://news.ycombinator.com/item?id=2469838, ), I figured they would haved jumped to Google Compute Engine by now, which has much better IO all around (http://gigaom.com/2013/03/15/by-the-numbers-how-google-compu...).

I laughed. Deft trolling. Bravo.

Why? Looks like a legitimate question to me.

Some time ago the HN hivemind went through a period of blind Node.js love. Now it's going through blind Go love.

Regardless of what you think of either language, "rewrite in X" is not a magic incantation that will spontaneously solve all your architectural issues. Designing a good architecture involves balancing many components, of which your primary implementation language is an important, but not exclusive, element. There are also the organisational issues -- hiring, spending time not adding new features, etc.

Perhaps so, but then it would fit into the category of 'unaware of being a parody of itself' statements/questions that are also (unintentionally) humorous such as the classic "I can tell that site was built in rails from the design".

He forgot building the 2012 Servers outlined in Jeff Atwoods blog post "Building Servers for Fun and Prof... OK, Maybe Just for Fun". Gotta do that for bleeding (profusely) edge power.

Reddit's not profitable though..

I wrote this just above but I'll repeat it here:

Funny story about that. When we were owned by Conde, the accounting was a little different (they took on some of the charges, like Akamai), and so we were actually told we were slightly profitable.

When that blog post went up I was as surprised as you to see it wasn't profitable when I was there.

Profitable is not that big a deal with something on the size and important of Reddit though.

Firstly, as with Wikipedia, if Reddit were forced to close because of money issues, Reddit could simply post a 'donate now or reddit shuts down' post and they would likely be rolling in millions of dollars.

Second, simply because reddit itself is not profitable does not mean people are not making a lot of money off reddit. The moderator system lends itself very well to a kind of 'corporate capture' of communities where moderators can be (and are) bought off for very tidy sums.

> Firstly, as with Wikipedia, if Reddit were forced to close because of money issues, Reddit could simply post a 'donate now or reddit shuts down' post and they would likely be rolling in millions of dollars.

From what I remember, this is kind of why they started Reddit Gold.


It's ironic that the alternative to advertising, is to splash big "DONATE NOW" adverts over a website.

Its not the alternative to advertising (it is advertising), its the alternative to soliciting and displaying third-party advertisements.

For me personally, it ends up being far more annoying. I really wish wikipedia would just put small unobtrusive text adverts on each page rather than the massive intrusive banners begging for money.

There is an issue of who calls the shots -- if you solicit donations from your users, that's who you are beholden to and need to serve to get money. If you are soliciting third party advertisements, that's who you are beholden to (and if you are using a third-party ad placement service, you are beholden to them as well as, perhaps more than, the actual advertisers.)

That is one issue, indeed. But the downside is that you're hassling your users to give you money, rather than hassling advertisers to give you money.

I'd rather not be hassled as a user.

I would also click on adverts, and buy things if they're useful to me, but I don't think I'd ever donate to a website.

I really wish wikipedia would just put small unobtrusive text adverts on each page rather than the massive intrusive banners begging for money.

Hi, welcome to your first day on the Internet. Since you're new, let me tell you how things work around here.

There are probably dozens of web sites similar to Wikipedia. But Wikipedia is on the first page of search engine results for just about anything you search for. Why is that? Because people have learned that they can trust them over the last 12.5 years.

When you go to Wikipedia, you know that when you're looking for information on the Battle of Hastings that you aren't going to see ads for anatomy enlargement pills. You won't see any advertising at all in fact. You know that the community at large does a decent job at removing biased information. You know that a company can't buy their way into hiding negative information or promoting positive information.

This level of trust is what causes people to link to Wikipedia thousands of times per day.

So let's say Wikipedia takes your advice. They put a small unobtrusive text advert on each page. Suddenly you're searching for information on acne and an ad for "Acbegone" pops up that promises to cure your problem for 3 easy payments of $19.95. Acbegone ends up becoming a huge advertiser with Wikipedia - spending $1 million per month on advertising. Suddenly Wikipedia gets The Phone Call. "Hi, this is Acbegone. We'd love to continue advertising on your site but your article on acne mentions 10 other products. Get rid of those and we'll double our ad spend with you. Don't get rid of them and we'll be forced to stop advertising." Wikipedia can't make do without the income they've become accustomed to so they make editorial decisions to not mention any product - but still there's that ad from Acbegone. Suddenly Wikipedia seems like one huge cheesey ad. People stop trusting it. People stop linking to it. It stops coming up in search engine results.

For a real world excample, see http://en.wikipedia.org/wiki/Digg#Digg_v4

Look at that - a link to Wikipedia.

Welcome to the internet. You seem new.

Everyone goes to Google.com to search for things. You know that when you do a search, you're going to see helpful related adverts.

That level of conflict of interests and possible abuse, privacy concerns etc, means that the entire world uses google as their search engine. Oh and they make billions in profit.

Your hypothesis about an advertiser asking wikipedia to alter content surely applies to google search results.

Your hypothesis about an advertiser asking wikipedia to alter content surely applies to google search results.

Google indexes other people's content. All Google has to say is, "Sorry, we're not in control of the content others make, our automated systems follow an algorithm we're unable to make one-off tweaks to." It could conceivably cost Google $1MM to make a one-off tweak to their algorithm in terms of programming and testing time.

Wikipedia on the other hand is all content. They have no plausible response other than, "Yeah, it would take 5 minutes to update that but we won't do that for you." Hell, all they'd really have to do is let the advertiser update it as they want and then instruct editors to do nothing.

It really is just different for this and a number of other reasons.

The problem is, if wikipedia did alter articles based on advertisers demands (Which seems pretty far fetched to me), the public would just alter them back. Or see the edits wikipedia is making and put 2 and 2 together.

>and then instruct editors to do nothing.

Yeah good luck with getting wikipedia editors to comply with that request!

A site like wikipedia would likely have thousands upon thousands of advertisers. They wouldn't be dependent on a few big advertisers. If an advertiser came to wikipedia and asked them to change a page, wikipedia would just say "no", publish the details to make the advertiser look like a douche (cue internet witch hunt, boycot naming shaming etc), and not care about the 0.000% temporary drop in revenue.

If there is one lesson that HN needs to learn, it's that profitable is not the same thing as important.

I think that the endless stream of stories about websites shutting down and deleting all user content has made it clear to HN users that clearing a profit is pretty important.

So they are important, but need to live off ramen every day (figuratively). What good does that do to them?

I don't think grandparent should be downvoted, he raises a good point. Tons and tons of people use Reddit, but Reddit has a hard time making a living.

> So they are important, but need to live off ramen every day (figuratively). What good does that do to them?

Maybe they enjoy it? Maybe it makes them happy?

Not everything is about money.

Power and fame are strong motivators too.

Look back in time. Read up on other things that had a massive userbase, but were unsustainable.

Checkup alladvantage - they paid people to surf. Had millions of users, but ultimately failed because their "business model" was idiotic.

Getting millions of users is pretty easy if you pay them to be a user. Someday though, it's only worthwhile if you can build a sustainable business which at least doesn't lose money hand over fist.

If Reddit hadn't got bought and supported by other profitable businesses, I doubt it would have survived.

AllAdvantage was great for free money as a teenager. It only took a few minutes to slap together a VB application to move the mouse a few pixels every minute. I made a few hundred dollars from them while I slept.

I guess I didn't have my act together enough as a teenager to commit fraud over the Internet while I slept.

It wasn't really fraud. The terms of use etc weren't really clear enough or lets face it enforceable.

Was it fraud if you got a dog to play with the mouse, and never looked at the screen? The dog might have still been looking at the adverts!

That was more of a general statement than a comment specifically on reddit's position.

I can make millions of dollars selling condom wrappers, but just because I have made millions of dollars does not mean that I have done something important. I may catch quite a bit of hate for this, but a large portion of HN's content is on things that make money, but are not truly important.

I think the mantra from the startup community is often:

1. Make money doing whatever it takes. eg come up with some crappy website, sell it to google, then shut it down. 2. The money problem is solved! 3. Spend money solving world hunger, diseases, philanthropy.

Which IMHO is pretentious BS.

Without cash flow, how far can the site go? Someone needs to pay the bandwidth/server bill. Not to mention it costs time to run a site. If you are working full-time at another job (because you aren't bringing any money in), you won't have any time to work on it.

Reddit succeeded largely because the company that bought them is making money elsewhere.

>> Reddit succeeded largely because the company that bought them is making money elsewhere.

Such a nonsense makes me angry. Surely you've heard about companies called Google, Facebook, Linkedin, that succeeded without being bought and for a long time not making a profit.

The difference is Reddit isn't particularly looking for profit. Unless they completely change their business and way of doing things (unlikely to happen), they probably will never make a profit.

Agreed. If they are re-investing money to grow the product, that might be why they aren't currently profitable. That doesn't indicate a lack of success in my book.

Reddit has an exceptionally low cost to scale ratio. Where are you getting the information that they're not profitable at this point (I know they weren't in the past)?

From the horse's mouth (18th of July, 2013):

Yep, the site is still in the red. We are trying to finish the year at break-even (or slightly above, to have a margin of error) though. [1]

1: http://www.reddit.com/r/TheoryOfReddit/comments/1ihwy8/rathe...

Last time they shared, they were spending ridiculous amounts of money on Amazon ec2 a month. Like ridiculous amounts.

Someday, the money will run out, and they'll have to try and turn a profit.

Their business model is fundamentally bad.

If those outdated figures of ~$50k/month are true, I wonder if they could move to their own infrastructure or dedicated and turn some of those savings into getting from red to black.


We operate a dozen of our own colos, with a virtual colo on AWS for insta-scalable multi-region redundancy, and an Amazon "colo" costs the same as about eight of our own when spun up and serving at least a gigabit of traffic.

However, the difference is less if you're going from zero sys admins to 24/7 says admins. I'd SWAG the crossover is once your AWS budget exceeds 4 full time sys admins willing to do shift work.

That wouldn't address the lack of a sustainable business model.

Being a forum where low wage/students/anti-corporate/anti-advertising types go and share memes is the elephant in the room problem.

Just because reddit has a lot of anticorporate types doesn't mean they don't also have a lot of pro-corporate types, too.

For example, I wouldn't be surprised if the Internet's largest right-wing community turned out to be one of the subreddits.

/r/all is the internet’s largest right-wing community, on any manner of subjects from race relations in America, to multiculturalism in Europe, to feminism and women’s rights anywhere. Last time I visited was around the Zimmerman verdict, and I couldn’t decide whether the conversation on reddit more closely resembled Free Republic or Stormfront—the major difference being that neither of those other right-wing communities can match redditors in their hatred and fear of women.

> Reddit's not profitable though..


I would find it incredibly strange if reddit is never able to make money. I don't think they are really trying at the moment.

Reddit is an incredibly valuable service. Maybe a lot of people on Hacker News don't see this, but reddit has basically become the Geocities of online discussion communities. The subreddit system has eliminated the "eternal september" problem, since all non-casual users will trickle into the communities that match their interests. Even if reddit loses 90% of its users, it will still be a highly relevant online community. I am certain that they can turn a (modest) profit if they really try.

Reddit will probably never become a massive money machine. But regardless, it is a very influential community. Even community is arguably an understatement at this point, it is really closer to infrastructure. As I have said here before, I would be willing to bet that it is still around in 10 years, with a significant (millions) amount of users.

> * I don't think they are really trying at the moment.*

I can - sort of - confirm they are not trying hard. Last time I tried to advertise on Reddit, I failed because they could not accept CC payments from mainland Europe ... Just think of all the ad revenue they are losing.

Advertising on Reddit is not the same as advertising in general.

If you advertise on Reddit, you're advertising to a violently anti-corporate anti-advertising audience, who may love you, but very well may hate you. You could be subject to a witch hunt at the drop of a hat.

I very much doubt advertisers would be lining up to advertise to that crowd. They're hardly big spenders either.

This is an unsubstantiated claim. Some of reddit's communities are like this, but most are not. If you're just viewing the front page, you are viewing the lowest common denominator, which could give you this impression. But reddit is a very heterogenuous community.

That doesn't mean you can't learn from it though.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact