

Downtime - inmygarage
http://staff.tumblr.com/post/2127872280/downtime

======
martingordon
Contrast this with DHH's response to today's Campfire outage:
<http://twitter.com/#!/dhh>

Not only is 37signals refunding everyone for the month, DHH is replying to
pretty much everyone personally to tell them so.

~~~
pclark
Tumblr has 1.5 million daily users. I think it might make more sense for them
to write an apologetic blog post and reiterate how they're focussing on not
letting this happen again :)

~~~
zachrose
Also, it's free.

~~~
InclinedPlane
All Tumblr users received a full refund for not just a single month but an
entire year!

------
rdl
I wonder how much of the incident and poor handling of the response is due to
losing their particularly good CTO (Marco) a few months ago. I would be a bit
concerned if I were a tumblr user or investor.

Experience with LiveJournal, Friendster, Twitter, etc. has been that problems
don't just magically fix themselves; absent someone with enough vision to
figure out potential problems and actually solve them in technology and
business process, you're kind of fucked. In the case of Twitter, they had
enough money to buy some excellent talent. In the case of LJ, Brad stepped up
as an amazing hacker (with limited financial resources, and really one of the
first to solve the problem). In the case of Friendster, bad executive
leadership (CEO/board being ineffectual, VP Eng being a tool) killed them.

~~~
thesethings
As a longtime Tumblr user, I can tell you serious performance/ frequent down
blips/ bugs/ issues have been there for a long time, way before Marco left.
I'm not saying they're his fault (he seems awesome), just that they don't
correlate with his departure.

~~~
rdl
Maybe part of the problem is that NYC doesn't have as many people familiar
with running high volume consumer web apps at scale?

They have banking people (who use a largely entirely different technology
stack, budget, and way of working), and people who have worked for smaller
companies, but they don't have a huge number of veterans of large consumer
webapps. Google's NYC office probably is more of a sink than a source for ops
talent, and there isn't a large presence of declining big giants to strip for
engineers like there is in Silicon Valley.

Do people move from SF/SV to NYC to work for established companies? I could
see this if you had personal reasons for wanting to move (being from NYC and
going back, or just wanting to go there), but it doesn't seem like a
compelling option otherwise.

------
thesethings
People who don't use Tumblr will interpret this blog post differently from
people who use Tumblr all the time. If you use + love Tumblr (as I do
<http://news.ycombinator.com/item?id=1973546>) you're really less disappointed
in this specific 24 hour failure (these things happen), and more disappointed
about what's not being talked about. This would have been a great time to
start opening up about some specific chronic issues, including communication
style that is more close-lipped than Apple.

Still rooting for Tumblr like crazy, but bummed out :(

~~~
haploid
I guess I don't grok the value added by Tumblr. What exactly do they offer
that offers a better value proposition than Wordpress or Posterous?

~~~
rorymarinich
Wordpress and Posterous are designed by geeks. Tumblr is designed by a
nongeek.

First off, Tumblr is not primarily a blog. It's what we call a "tumblelog",
which is a stupid word; I like to call it "vomit" or "spew". The fact that
it's the sleekest blog platform out there is secondary to its larger function
as a spewing platform.

Blogs are for creating. Spew is for recycling, for breaking things down into
small teeny pieces, for streams and streams of things which are essentially
meaningless but contribute to a larger whole. Now, not all blogs are strictly
bloglike either; the Linked List format is somewhat in between the two, though
Linked Lists are usually more disciplined in nature. Another way to draw the
distinction is to say that blogs are for _building_ things, while tumblelogs
are for _shaping_ them. Or you could call blogs classical music and tumblelogs
jazz. One is looser, more freeform, more about the movement than about the
individual notes. The other is about the finished product, or about realizing
a vision.

A defining characteristic of a true geek is that he builds. Doesn't matter
what he builds; what matters is that he appreciates structures. So when we
look at a blogging platform, our needs traditionally tend to be focused on
constructing more elaborate things. They also tend to be relatively solitary
in nature. (Wordpress is; I never had a friend that used Posterous and so I
don't know how they handle following things.)

I like slagging on Posterous because I'm still bothered that they get compared
to Tumblr when they're really entirely different. Tumblr's breakthrough was
its deconstruction of the blog format. You could post an image without a
title. You could post a quote without something to frame it. You could post
ANYTHING without a datestamp, or a "posted by" attribute, because nobody cares
about them, they just care about the flow of content. Posterous has lines of
datestamping, and they don't handle title-less posts. They're all about
traditional title-body posting. They're all about "blogs", about these
elaborate constructions. They aren't broken down like Tumblr.

Tumblr's a fucking awesome engine because it removes everything bloggy about
blogs. It's designed for a steady vomitstream of thoughts and ideas. Not your
thoughts or ideas. Anybody's. And it strips away everything we associate with
blogs, and with everything it strips away it becomes sleeker, lither, more
powerful. No comments means no conversations other than reblogs — and reblogs
are great because first off, they let you improvise off anybody else's stream,
and second off, they make it IMPOSSIBLE to participate without being a
"primary creater" with a flow that other people are following, versus blog
comments where every commenter is attached to one site at a time.

That means it attracts people who don't build things. People who just want to
push out content without worrying about being judged for value. But they want
to push it out, because it is a creative and cathartic act just to release
these ideas out into the world, or to change their flow. Lower barrier to
entry, more ability to interact meaningfully. You can participate without any
skill but still find that people are interested in you.

Now, I use Tumblr primarily as a building tool. I find that its theming system
makes it very easy for me to design custom interfaces for complex publishing
sites, and yet still push through the entire site as a streaming feed to any
Tumblr user who wants it. And it's used by lots of serious designers who
appreciate its versatility. But I'm an edge user. Every HN user here who uses
Tumblr is an edge users. The real users don't post here, because they're not
out to build things, they're out to just express themselves loudly and with
fury.

~~~
DanielRibeiro
Your explanation can also be reflected on this comment from the post: _We've
nearly quadrupled our engineering team this month alone_

Without more information, this can either make people trust more tumbler in
the future (we went from 3 to 12 engineers) or less (we went from 200 to 800).
I'd worry in the later case, because it might show that Tumblr is not aware of
Brook's law (<http://en.wikipedia.org/wiki/Brooks%27s_law>), which is always
unreassuring for me.

------
brianwillis
>We’ve nearly quadrupled our engineering team this month alone...

How do you create and keep a good culture when you're growing that fast?
Anyone have experience with this sort of scenario?

~~~
gregable
Adding more people to a software project often has the effect of making it
take longer, at least for mid to near term milestones. Quadrupling sounds like
the adage of pouring gasoline on a fire.

~~~
chunkbot
I'm surprised that this has so many upvotes. Are you all _net_ this
simplistic?

~~~
StavrosK
It's probably that the upvoters all read the Mythical Man-Month.

~~~
gregable
Yes, my comment is a clear rip off of MMM.

------
datums
Todo:

Move the blog/status site outside your network (linode.com)

Work on a process to try and follow if you have another outage.

    
    
       - One person to handle communication (blog post / respond to users)
    

Work on a faster way to recover from such a failure. Maybe have a read only
version you can switch to "maintenance mode" ?

Done:

Probably the biggest outage you'll face.

20+ outages don't usually happen.

Learning from it. . .

~~~
rdl
eBay had some really long outages, and multiple. I'd say a site which has had
one 20h outage is way more likely to have similarly long outages in the future
-- it demonstrates that neither the technical nor procedural measures are
there to prevent them.

Of course, once you have enough 20h outages, you dont have to worry about the
problem anymore.

------
Xuzz
I understand they were busy, but only two or three updates about progress
during the entire ordeal (on Twitter) seemed a little low to me. Their "we'll
be back shortly" page also didn't link to their Twitter page either, so for
many people it was just a black hole for updates about when a surprising large
chunk of the web would return.

~~~
donohoe
I've been around when chunks or a large web site went down. Everyone is
scrambling to fix it. While we know communication is worth taking time and
effect, it is also difficult when you're focused on a singular task. You do
get blinders.

Thats not to say its a good excuse, and you certainly lern from it - but it is
understandable.

~~~
zecho
Everyone learns quickly after a major outage (they happen) that they need to
have a game plan moving forward for communicating to users and customers. I
hope tumblr took note of this lesson.

------
alanh
Doesn’t actually explain what happened, or why a database cluster outage means
more than a read-only situation.

I understand this is aimed at all users, but I’m still disapointed.

~~~
samratjp
Don't be, their engineers have better things to patch up at the moment than
write a detailed post. They will, hopefully, though like foursquare did with
their mongodb outage - [http://nosql.mypopescu.com/post/1265191137/foursquare-
mongod...](http://nosql.mypopescu.com/post/1265191137/foursquare-mongodb-
outage-post-mortem)

------
hdeshev
The postmortem is pretty weak and us, geeks, would have loved to see more
detail. Tumblr would have gotten some good karma with a detailed explanation.
Well, unless this is all made up just to look impressive and the real reason
was something else like human error that wiped the production DB. But even in
that case honesty would have paid off - remember GitHub's recent DB wipe and
their excellent explanation?

On a side note - anyone know what DB they are using? The cynic in me is
thinking "Hey, another MongoDB + FourSquare 'success' story of webscale
awesomeness."

~~~
bantic
Agreed. I'd like to see more detail. C.f. facebook's recent post-mortem
([http://www.facebook.com/notes/facebook-engineering/more-
deta...](http://www.facebook.com/notes/facebook-engineering/more-details-on-
todays-outage/431441338919)). Lots of detail. When a downtime blog post is so
vague like Tumblr's was, my first instinct is that there's something they are
deliberately not telling about what went wrong. Or maybe they haven't figured
out exactly why it went down yet. I hope a more detailed postmortem is coming.

------
ojbyrne
Pretty weak compared to other post-mortems.

------
tdoggette
They should have a better ongoing communication method than Twitter in the
event of downtime. I'd suggest getting a separate hosting account with a
reliable third party, and planning on running something that looks like (but
isn't) a normal tumblog, but is instead something very reliable, like editing
an html file on a web server that everyone on the team has shell access to.

------
wooster
This points to the necessity of a data storage solution which doesn't involve
waiting for hours upon hours for the caches to warm in order for your service
to be reliable.

------
luckyland
My favorite opinion on the handling of this event is, by far, this one:

<http://twitter.com/b6n/status/11877355945463808>

------
shuri
technical details?

