Why use Graphite instead of Ganglia? Ganglia uses RRDs. It's been around forever, it's fairly low on resource use, it's fast, and you can generate custom graphs like with Graphite. I actually ended up doing some graphs with google charts and ganglia last time I messed with it. (Also, nobody has really simple tools to tell you which of your 3,000 cluster nodes has red flags in real time and spit them into a fire-fighting irc channel so we had to write those ourselves in python)
"Takeaway: if read capacity is likely to be a concern, bringing up read-slaves ahead of time and getting them in rotation is ideal"
Sorry but this is not 'ideal', this is Capacity Planning 101. If you're launching a new product which you expect to be very popular, take your peak traffic and double or quadruple it and build out infrastructure to handle it ahead of time. I thought this was the whole point of the "cloud"? Add a metric shit-ton of resources for a planned peak and dial it down after.
Sorry but this is not 'ideal', this is Capacity Planning 101. If you're launching a new product which you expect to be very popular, take your peak traffic and double or quadruple it and build out infrastructure to handle it ahead of time. I thought this was the whole point of the "cloud"? Add a metric shit-ton of resources for a planned peak and dial it down after.
Paul is nice so we are nice.
Last time I checked, I haven't built a service with +20mm users. I Googled you. I don't think you have built a service with +20mm users.
Programming is hard. Scaling is harder.
Let's have some empathy here. I bet the Instagram team has parents and siblings and significant others and friends that they haven't seen in a while. I bet they have responsibilities that they have neglected to keep the service up. I'd rather not poop on their head when they are trying to scale their service by millions of users.
This stuff is hard. Leaving a comment on a news aggregation service is easy.
I'm sorry that my comments come off as harsh, but the original line struck me as so completely basic it's like something you would tell someone who had never worked in IT. They clarified later that they had tried to plan ahead but came up a little short, which I can understand; no estimation is perfect.
I have no idea how many users Sportsline had but it was a bunch. Peaks of 64k hits per second on the dynamic layer, up to 8 gigabits sustained traffic in one datacenter... it was pretty ugly on firefighting days. I don't mean to poop on them, but if they're as big as they seem to be I hold them to a higher standard than a 6 month old start-up fresh out of college.
I agree it's hard. The fact that they were able to handle the traffic they did with only a small amount of downtime is a testament to the fact that they did have their shit together (as well they should with the number of users they had already).
Cloud is also (more so hopefully) about, dynamically ramping up based on actual usage. Vs making guesses about future capacity needs. Cloud is to capacity planning as agile is to waterfall.
Re RRD, have you read about graphite? "Graphite originally did use RRD for storage until fundamental limitations arose that required a new storage engine."
Dynamically ramping is nice until your site explodes and you need 20 minutes to get more capacity. Versus just pre-allocating it and not going down. Call me crazy, some people don't like to be down for 20 minutes.
Hmm, didn't know that. Too bad they didn't just extend RRD. Did they say what the limitations were? I see a note about high volume causing lots of writes and implementing caching to deal with it, but that can be dealt with via tuned filesystem parameters...
Ah, I found the page: http://graphite.wikidot.com/whisper
RRD can be tuned to ignore 'irregular' data points, or include them all. The timestamp issue can be a problem but there are methods to deal with order of updates (like take them via tcp, or rrd merge tools).
To clarify: we did do a fair amount of capacity planning and elastic ramp-up/pre-allocating of our infrastructure, but no prediction is perfect, so the blog was about diagnosing and addressing issues as they crop up under unprecedented load.
"Takeaway: if read capacity is likely to be a concern, bringing up read-slaves ahead of time and getting them in rotation is ideal"
Sorry but this is not 'ideal', this is Capacity Planning 101. If you're launching a new product which you expect to be very popular, take your peak traffic and double or quadruple it and build out infrastructure to handle it ahead of time. I thought this was the whole point of the "cloud"? Add a metric shit-ton of resources for a planned peak and dial it down after.