Hacker News new | past | comments | ask | show | jobs | submit login
Keeping Instagram up with over a million new users in twelve hours (instagram-engineering.tumblr.com)
279 points by mikeyk on April 5, 2012 | hide | past | favorite | 53 comments

What kind of instances are you guys running for Redis/memcached? I am a bit surprised on the numbers here, but to be fair I don't do much in the virtualization world. With low cpu overhead, it sounds like you might be saturating the number of interrupts on the network card if it's not a bandwidth issue. Memcache can usually push 100-300k/s on an 8-core Westmere (could go higher if you removed the big lock). Redis on the other hand with pinned processes to each physical core can do about 500,000/s. We (Twitter) saw saturation around 100,000~ on CPU0, what tipped us off was ksoftirq spinning at 100%. If you have a modern server and network card, just pin each IRQ for every TX/RX queue to an individual physical core.

Those are really useful numbers--I think a lot of it can be chalked up to virtualization, but we should definitely explore more around IRQ pinning for queues. Any good starting points / reading, are you mostly using taskset?

Taskset is fine for the process pinning. Don't forget about hyperthreading, you want to try to keep each thread on each hardware thread. IRQ pinning, see an example script I have:


This will set queues 0-7 to specific smp affinity slots.

Why do you guys use both memcache and Redis ? Redis also has LRU cache functionality.

Because in-house we have a custom version of memcache. We rewrote memcache's slab allocator, and for some use cases, is better at memory efficiency than Redis.

A slight tangent, since I saw that instagram are using both Graphite and Munin- Collectd just added a plugin to send metrics to Graphite. You might want to try it for tracking your machine stats over time.

http://collectd.org/wiki/index.php/Plugin:Write_Graphite http://collectd.org/

Along the same lines, are you doing anything special with munin to make it fast? We've had performance issues with the RRDs and graph generation that led us to pipe metrics to graphite with collectd.

We've had to split munin across three masters (by machine role) because the graphing job was just locking on IO. Munin 2.0 moved over to all-dynamic CGI graphing, but I haven't gotten the chance to play with it yet.

Isn't there a risk with EBS snapshots that the snapshot of a live instance could have been taken while your db engine was in the middle of a transaction and leave the data in the newly spun instance in an inconsistent state?

Is it that EBS snapshots are engineered to prevent this? Or just that it's not likely to happen in practice?

Yes, there is--we take all of our snapshots from a slave, and we stop the slave before taking a snapshot, then XFS-freeze all drives, then take the snapshot, to ensure it's consistent.

Are EBS snapshots not block-level atomic? In theory you should get a PITR image without stopping anything, assuming that:

1) The file system correctly orders or journals operations (I'm not familiar with XFS, but this is the case with FFS2/FreeBSD, ZFS, ext3/4 journaling, etc).

2) The database system correctly orders or journals operations, and properly fsync(s) to disk (which postgreSQL does)

Of course, there's no harm to an abundance of caution with something like this.

They are, but we software-RAID our EBS drives to get better write throughput, and we put the Write-Ahead Logs (WALs) on a different RAID from the main database, so when you have both of those going on, you need something else to atomically snapshot our PG databases.

Congratulations. Really impressive how solid you handled the Android onslaught.

We use statsd, graphite, redis and node as well. You might be interested some of my projects relating to these:

https://github.com/gflarity/nervous https://github.com/gflarity/response https://github.com/gflarity/qdis

Why use Graphite instead of Ganglia? Ganglia uses RRDs. It's been around forever, it's fairly low on resource use, it's fast, and you can generate custom graphs like with Graphite. I actually ended up doing some graphs with google charts and ganglia last time I messed with it. (Also, nobody has really simple tools to tell you which of your 3,000 cluster nodes has red flags in real time and spit them into a fire-fighting irc channel so we had to write those ourselves in python)

"Takeaway: if read capacity is likely to be a concern, bringing up read-slaves ahead of time and getting them in rotation is ideal"

Sorry but this is not 'ideal', this is Capacity Planning 101. If you're launching a new product which you expect to be very popular, take your peak traffic and double or quadruple it and build out infrastructure to handle it ahead of time. I thought this was the whole point of the "cloud"? Add a metric shit-ton of resources for a planned peak and dial it down after.

Sorry but this is not 'ideal', this is Capacity Planning 101. If you're launching a new product which you expect to be very popular, take your peak traffic and double or quadruple it and build out infrastructure to handle it ahead of time. I thought this was the whole point of the "cloud"? Add a metric shit-ton of resources for a planned peak and dial it down after.

Paul is nice so we are nice.

Last time I checked, I haven't built a service with +20mm users. I Googled you. I don't think you have built a service with +20mm users.

Programming is hard. Scaling is harder.

Let's have some empathy here. I bet the Instagram team has parents and siblings and significant others and friends that they haven't seen in a while. I bet they have responsibilities that they have neglected to keep the service up. I'd rather not poop on their head when they are trying to scale their service by millions of users.

This stuff is hard. Leaving a comment on a news aggregation service is easy.

I'm sorry that my comments come off as harsh, but the original line struck me as so completely basic it's like something you would tell someone who had never worked in IT. They clarified later that they had tried to plan ahead but came up a little short, which I can understand; no estimation is perfect.

I have no idea how many users Sportsline had but it was a bunch. Peaks of 64k hits per second on the dynamic layer, up to 8 gigabits sustained traffic in one datacenter... it was pretty ugly on firefighting days. I don't mean to poop on them, but if they're as big as they seem to be I hold them to a higher standard than a 6 month old start-up fresh out of college.

I agree it's hard. The fact that they were able to handle the traffic they did with only a small amount of downtime is a testament to the fact that they did have their shit together (as well they should with the number of users they had already).

Cloud is also (more so hopefully) about, dynamically ramping up based on actual usage. Vs making guesses about future capacity needs. Cloud is to capacity planning as agile is to waterfall.

Re RRD, have you read about graphite? "Graphite originally did use RRD for storage until fundamental limitations arose that required a new storage engine."

Dynamically ramping is nice until your site explodes and you need 20 minutes to get more capacity. Versus just pre-allocating it and not going down. Call me crazy, some people don't like to be down for 20 minutes.

Hmm, didn't know that. Too bad they didn't just extend RRD. Did they say what the limitations were? I see a note about high volume causing lots of writes and implementing caching to deal with it, but that can be dealt with via tuned filesystem parameters...

Ah, I found the page: http://graphite.wikidot.com/whisper RRD can be tuned to ignore 'irregular' data points, or include them all. The timestamp issue can be a problem but there are methods to deal with order of updates (like take them via tcp, or rrd merge tools).

If you have a lot of RAM to spare, an excellent hack is putting your RRDs with the highest amount of writes in a tmpfs volume and rsync'ing them regularly (it's insanely fast, trust me). More on tuning RRD: http://oss.oetiker.ch/rrdtool-trac/wiki/TuningRRD http://oss.oetiker.ch/rrdtool/doc/rrdcached.en.html http://sourceforge.net/apps/trac/ganglia/wiki/rrdcached_inte... http://community.zenoss.org/docs/DOC-4696

To clarify: we did do a fair amount of capacity planning and elastic ramp-up/pre-allocating of our infrastructure, but no prediction is perfect, so the blog was about diagnosing and addressing issues as they crop up under unprecedented load.


Question about quality insta-photos on Android.

I have JPG from SGS2 - http://kia4sale.narod.ru/insta/01.jpg

This is http://kia4sale.narod.ru/insta/02.jpg instaphoto (Earlybird) from Android version

This is http://distilleryimage9.instagram.com/662ade7483ce11e19e4a12... - instaphoto from SGS2 JPG but on iPhone 4.

Question: why instaphoto on Android version in blurry?


What OS are you deploying on EC2?

We use Ubuntu, running Natty Narwhal. Every Ubuntu version before that had some (unique) bugs that would bite us under load. Natty has been by far the most stable.

Thanks for OpenSourcing Node2dm. I think I'll take that for a spin this weekend.

Im curious to know what kind of EC2 instance they are running the master Postgresql on and if they've had any write bottle necks. Im using Postgres for an app, and am worried about running into write issues.

What sort of hosting do you use for your main Pg (and Redis) instances?

They use Amazon EC2 there's some more info here: http://instagram-engineering.tumblr.com/post/12202313862/sto...

From the mentions of EBS snapshots in the article, I'm going to guess they're all on Amazon EC2.

PGFouine is nice, but it needs a major do-over. It would be good written with a plpgsql backend running against database loaded csv log files, so that it could handle huge logs, unlike now.

I am curious to find out why there was a need to develop your own C2DM server - what was lacking in Google's C2DM server? I am a C2DM newbie so pardon my ignorance.

It looks like you guys use Redis for a lot of different functionality. It would be great to see an article on how you guys use Redis.

> We use the counters to track everything from number of signups per second.

Per second... It must be quite a moment when you reach this point.

To put it in perspective : there are less than 5 human births per seconds.

So if they want to keep their counter greater than that for a long time, they'll probably have to extend their market beyond humans.

Very interesting read, but doesn't New Relic do all these things for you? Maybe it's not possible to use with their setup?

I'm interested in comparing statsd to a commercial product like New Relic as well.


Statsd and NewRelic are very different.

NewRelic gives you mainly a predefined set of metrics, where you just have to install the agent to get them. Then. There's an additional module where you can send your own set of metrics and display them.

Statsd on the contrary is 'only' a tool to collect and then display metrics. You have to define everything you eant to measure yourself (or use plugins to your app).

So these two are definitely related, but better used for different (although overlapping) jobs.

Does anyone has some info on the architecture required to maintain a service like this? Servers, db, etc?


are you guys sharding redis? or does it all fit in a single machine?

We have a variety of Redis machines, some of them are in a consistent hash ring (the ones we're using for caching); some are using modulo-based hashing (the ones where losing data on adding more machines isn't an option), and some are just single-node installs.

How do you handle write replication? I haven't found any good document on how to do this (meaning, if the Redis master goes down, a slave should be promoted to master immediately).

You can hypothetically use something like hearbeatd to do it; we run every Redis master with an attached slave and manually failover for now.

For a small team like ours, we prefer solutions that are easy to reason about and get back into a healthy state (it would take one server deploy to point all appservers at new Redis master), rather than fully automated failover and the "fun" split-brain issues that ensue. Of course that may change as we build out our Ops team, etc.

Great stuff, love node2dm, and didn't know about statsd + graphite.

Thanks! This was my first node.js project, would love any feedback on it.

Also, any reason you guys are not using MixPanel for storing events as well, besides of the costs?

Thought about adding a tool like newrelic.com to your toolset?

NewRelic's product screenshots look really nice, but 1) the hosted nature and 2) the price were turnoffs. We might revisit it at some point, though--there are a few features in there that would be awesome to have.

I think New Relic saves us money - the amount of optimizations we've done because of them is just crazy and by now would have saved us a couple servers and hardware upgrades at least.

Their sales dudes like startups, we were only lightly funded when we started using them - garth@newrelic could probably help you.

Had the same first impression, but compared to rolling your own solution it's still a very good ROI. And now I'm getting more and more servers and projects on it, because it's just so handy.

(Btw not connected in any way, jsut a very happy user that convinced enough companies to try them, that I think you would be a good match, too)

What percentage of processing power is spent on making me look like a hipster?

Pretty sure that all happens on the client (your phone).

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact