
So, that was a bummer - abraham
http://blog.foursquare.com/2010/10/05/so-that-was-a-bummer/
======
brown9-2
Interestingly the cause of the most severe part of their downtime seems to
still be unknown to them:

 _As a next step, we introduced a new shard, intending to move some of the
data from the overloaded shard to this new one.

We wanted to move this data in the background while the site remained up. For
reasons that are not entirely clear to us right now, though, the addition of
this shard caused the entire site to go down._

To those who use MongoDB - does this sound like something that might have been
caused by MongoDB itself, or Foursquare's use of it?

~~~
houseabsolute
I'm also wondering why this kind of intervention is necessary at all. The
NoSQL solution we use at work has load based automatic splitting, and I'd have
thought (though I haven't confirmed) that this would be an obvious feature to
include.

~~~
mcfunley
I would speculate that it's a poorly-chosen shard key. MongoDB's built-in
sharding uses range-based indexing. If you choose user_id as your shard key,
and those are autoincrementing integers, then you're screwed if newer users
tend to be more active on average than older ones.

~~~
jrockway
Wait, people shard on a key other than something approximately random, like an
sha1 hash!?

~~~
mcfunley
Where I work (Etsy) we keep an index server that maps each user to a shard on
an individual basis. There are a number of advantages to it. For example, if
one user generated a ton of activity they could in theory be moved to their
own server. Approximately random works for the initial assignment.

Flickr works the same way (not by coincidence, since we have several former
Flickr engineers on staff).

~~~
jrockway
Sounds like a good system. I've noticed that people tend to do things like
shard based on even/odd, and then they realize that they need three databases.

I've never had either problem though... but if I ever need to shard I plan on
doing it based on object ID. Then one request can be handled by multiple
databases, "for free", increasing both throughput and response time.

~~~
fizx
Even/odd isn't the end of the world, but you would then be best jumping to mod
4.

~~~
moe
Actually for anonymous sharding (without a central index) a consistent hash is
about the closest you can get to ideal distribution and flexibility. I haven't
looked but I presume that's what mongo uses under the hood for their auto-
sharding, too.

------
waxman
The site just went down again, awkwardly enough, only moments after they
published their post-mortem on yesterday's outage.

Clearly, as their blog post indicates, they were unable to trace the root
problem.

To me, the worst feeling in the world as a developer is when there's a major
bug in your production site, and you can't figure out exactly why it happened.
Then even after you get the site working there's that pit in your stomach of
"what if it strikes again?"

~~~
jackowayed
Not surprisingly, it is a related problem:

> _UPDATE Oct. 5 8:01PM: Our server team is still working to resolve the
> problem. The issue is related to yesterday’s outage._

------
marclove
Well I have nothing constructive to add, but it certainly makes me rethink
using MongoDB in any sort of serious production environment. I've been
hesitant to use MongoDB for this very reason...what happens when things go
wrong? I'm certainly not an expert in how to handle those situations, and
there aren't very many of those experts out there. Unfortunately for
Foursquare, they got caught with their pants down, not having someone on staff
who really understands all the ins & outs of a database technology their
entire service depends upon.

~~~
rbranson
Honestly though, would MySQL or PostgreSQL really helped out in this
situation? Sharded or not sharded, there's really not much one can do once a
server (or set of replicating servers comprising the shard) starts to become
overloaded. Increasing the capacity of the shard by adding more hardware will
induce a significant amount of load by itself. Of course, that's just one
piece of the puzzle. We still don't know what actually brought the site down
completely, hopefully they'll be able to trace it down and fill us in on that.

~~~
ora600
Here's what I do in these situations (I'm an Oracle DBA, but this should apply
to most loaded shards):

1) Use connection pooling at the application layer to prevent overloading the
DB of any specific shard. This means that if a shard has 16 CPUs, having 16
connections sounds reasonable. Additional connections will not give you more
performance. This means you need to queue and throttle requests at the
application layer and with some thought you can probably figure out what to do
with the waiting users - show partial results? show a nice whale? A "loading
please wait" sign?

2) If you didn't do #1 and the DB is getting overloaded, my normal response is
to start shooting down connections. Oracle has separate unix process per
connection. MySQL has its own way of shooting connections down. Put up a small
script that will kill the correct percentage of sessions to prevent overload
on shared resources. This will generates lots of errors and will cause a
percentage of the users to hate you, but you won't be down.

~~~
rbranson
1) Application connection pooling won't scale. In a scenario like FourSquare,
there is likely a 4:1 ratio of app server to DB shard server. Further,
connections don't necessarily equal load.

2) This sounds like a great way to create data inconsistencies, unless you've
got very tight constraints on your database, which is impossible in a sharded
scenario.

I agree though, that ultimately they should have had some way to "fail whale"
instead of getting overloaded.

------
dmytton
This posting in the MongoDB mailing list provides more detail from the
developers:

[http://groups.google.com/group/mongodb-
user/browse_thread/th...](http://groups.google.com/group/mongodb-
user/browse_thread/thread/66752f49af68619)

~~~
dowskitest
Sounds like a side effect of relying on MMAP (and not doing compaction).

"Basically, the issue is that if data migrates to a new shard, there is no re-
compaction yet in the old shard of the old collection. So there could be small
empty spots throughout it which were migrated out, and if the objects are
small, there is no effective improvement in RAM caching immediately after the
migration." \- Dwight Merriman (at the link in the parent).

"The kernel is able to swap/load 4k pages. For a page to be idle from the
point of view of the kernel and its LRU algorithm, what is needed is that
there are no memory accesses in the whole page for some time."

-antirez from [http://antirez.com/post/what-is-wrong-with-2006-programming....](http://antirez.com/post/what-is-wrong-with-2006-programming.html)

------
blantonl
_we noticed that one of these shards was performing poorly because a
disproportionate share of check-ins were being written to it._

I'd love to know the root cause behind this specific issue. Was this a
behavioral issue within the user base, or a technical problem that routed
check-ins to this specific shard more than others?

Since they mentioned they partition their shards by userId that would probably
rule out their routing process. I wonder if there was some event that caused a
certain sharded subsection of users to start sending so many checkins?

And since this was a subset of userIds on the same shard - could this have
been a targeted DOS or SPAM event?

I'm making a very conscious migration to MongoDB so I'm very interested to
hear what the root cause of this was.

------
lipnitsk
Ironically, the servers are being "upgraded" again, as we speak.

------
cagenut
I notice foursquare.com resolves to ec2, was the mongodb that got overloaded
also on ec2? Can you tell us in what way did it get "overloaded" (iowait,
mem/swap, raw-cpu)?

------
trustfundbaby
"but... MongoDB is web scale."

\--- commenter

~~~
storm
Indeed. <http://www.xtranormal.com/watch/6995033>

------
haberman
Having to manually shard your data seems like so much work when offerings like
App Engine will take care of that for you. It seems like exactly the kind of
thing that you shouldn't have to think about when you're trying to get a
business off the ground.

I can see the lock-in concerns with AppEngine, but an AppEngine level of
abstraction seems so much more appropriate than manually deploying/configuring
an entire infrastructure of proxies, load balancers, web servers, etc.
Especially when an error can take down your whole site, like in this example.

~~~
dmytton
It's not manual sharding. You specify a key a MongoDB is supposed to handle
everything for you. There are manual operations you can perform if you want
(like moving data or splitting) but normally you'd let Mongo handle it all.

------
smoody
What I don't understand is this: Why do companies (like four square, twitter,
etc) wait until after their first multi-hour crash before instituting a "this
is how we'll communicate to users when we have downtime" process? I would
assume that everyone has learned from twitter's historical mistakes at this
point. I would argue that startups -- especially startups that deal with large
numbers of transactions per day -- start with code and policies for
communicating downtime issues first and launch the product second.

~~~
ddlatham
It's not part of a minimum viable product. When you're trying to get something
out the door, your downtime communication is not your top priority, so you can
afford to improve it later.

~~~
japherwocky
aka "cross that bridge when you come to it"

------
Sukotto
Well, that's another example for the "dangers of making untested changes to
your production environment" pile. Of course, people rarely feel the need to
post when everything works out...

Sucks that such a popular service had such trouble. I look forward to reading
any additional posts they write explaining in more detail exactly what
happened.

------
ankimal
Firstly, as a free user you really shouldn't be asking for HA. (I m assuming
their paid customer are kicking some butt as I speak .. or maybe not .. the
site is back up :)).

However, as a business you really want to give ALL your customers HA. Its not
just a reputation thing, its a "we love you all equally" kinda attitude.

As for MongoDB, we ve been using in production for small insignificant things.
FWIW, they have replication <http://www.mongodb.org/display/DOCS/Replication>
and some cool new features like Replica Sets for failover and redundancy.
Maybe they missed a trick?

I think the apology post was totally fair and he did categorically mention "..
This blog post is a bit technical. It has the details of what happened, and
what we’re doing to make sure it doesn’t happen again in the future.". They
could have dilly-dallied with words and said "we had a technical failure of a
data nature" and that would have been just been plain stupid. So thanks for
the detailed technical write up and hope there is more to follow.

~~~
dhimes
HA?

~~~
allwein
High Availability

~~~
dhimes
Thanks

------
jscore
I'm so happy they're back up, here I was thinking the world would come to a
screeching halt when people cannot check-in to places.

Seriously though, are they THAT important?

~~~
JabavuAdams
No, they're not, but they have to dance the dance and say / do the right
things if they want to keep their users and investors happy.

------
sabat
I like the NoSQL approach as an option. But we should keep in mind:
operationally, these databases/stores are comparatively new, and don't have
the years and years of use that would help find and solve problems like this.
It reminds me of Ebay's 3-day downtime in 1999 -- based on an Ebay mistake and
an Oracle bug. Although Oracle had been around for a while in 1999, OLTP was
still new, and, hence the bug.

I'm not blaming a flat-out bug in this case (the cause of the severe part is
still unknown?), but it could also be architectural or operator error.

~~~
blantonl
I'm unable to come up with any reference to a 3 day outage regarding Oracle
and Ebay in 1999. Can you provide more info on this - I'm very interested to
see what happened.

Edit: I found this reference to the 22hr outage that occurred, and I remember
this outage, but I don't ever remember it being a 3 days outage.

[http://www.internetnews.com/ec-
news/article.php/137251/Cost-...](http://www.internetnews.com/ec-
news/article.php/137251/Cost-Of-eBays-22-Hour-Outage-Put-At-2-Million.htm)

~~~
sabat
Here's a link to a Forbes article about it. You're right: my memory was
distorted. The outage was only 22 hours. (I'm sure it _felt_ like three days
to the Ebay admins at the time.)

Fun fact: the "Steve Abatangle" quoted in the article is yours truly, and the
author of the piece is Dan Lyons, now AKA Fake Steve Jobs.

<http://www.forbes.com/forbes/1999/0726/6402238a.html>

------
bhiggins
If there's anyone from Foursquare here, I'm interested in what monitoring you
have in place. How long had the shard been poorly performing before you
noticed? Do you use anything to monitor MongoDB in particular, or load on
servers, anything like that?

------
dotBen
For a service that is trying to be 'mainstream' I think their blog post is
horrible.

There is no way a 'regular user' is going to understand what a shard is, nor
should they care. Ok, they explained sharding in laymans terms but then went
on to talk about "reindexing the shard to improve memory fragmentation
issues"... Woah, that means nothing to 95% of users.

If you experience down time and you want your users to be sympathetic then you
got to explain whats going on in terms they will understand. Sure, include a
technical explanation at the bottom for those inclined, but not as part of
your main body.

~~~
harryh
How do you think it could have been better? We struggled a lot trying to
decide how much technical detail to include. We decided that including more
information (even if a lot of our users didn't understand it) was better than
"something broke, it took a long time to fix."

Would love to hear suggestions on this topic.

~~~
dotBen
The down voting on my previous reply is sad but I probably shouldn't be too
surprised. The problem of being way too technical for a mainstream audience is
a problem many people on Hacker News seem to have, and so no wonder many would
disagree with me.

It's silly for someone on Hacker News to say "well I thought the level of
detail was fine" - of course you would, like the rest of us you're a technical
geek. The point that seems to be lost is 95% of FourSquare's userbase ISN'T!

Also FourSquare is one of those startups that, in addition to the YC startups
_(for obvious reasons I guess)_ , people give a little more favoritism to then
perhaps other startups of equal quality/interestingness.

 _How do you think it could have been better? We struggled a lot trying to
decide how much technical detail to include. We decided that including more
information (even if a lot of our users didn't understand it) was better than
"something broke, it took a long time to fix." Would love to hear suggestions
on this topic._

Well, I'm not suggesting you wrote "something broke, it took a long time to
fix" - I'm all for transparency. But if you are going to be transparent you
need to communicate at a level at which that transparency can be understood by
all of your readers. I'm sorry if some people on Hacker News don't get that.

So ok, here's how I would have written your post (for time sake I just did the
intro - I'd have repeated the technical description after this block of copy):

 _Yesterday, we experienced a very long downtime. All told, we were down for
about 11 hours, which is unacceptably long. It sucked for everyone (including
our team – we all check in everyday, too). We know how frustrating this was
for all of you because many of you told us how much you’ve come to rely on
foursquare when you’re out and about. For the 32 of us working here, that’s
quite humbling. We’re really sorry.

Below is an explanation of what happened and what we’re doing to make sure it
doesn’t happen again in the future (a more technical explanation for those
inclined appears further below)

What happened As you can imagine we store a huge amount of data from all of
your user check-ins. We split that data across many servers as it's obviously
far to big to fit onto just one. Starting around 11:00am EST yesterday we
noticed that one of these servers was performing poorly because it was
receiving an unusually high volume of check-ins. Maybe there was an incredibly
popular party that we missed out on! :)

Anyway, after trying various things to improve the performance of that server
we decided to try to add another server to take some of the strain off the
original overloaded server. We wanted to move this data in the background
while the site remained up - however for some reason when we added the new
server the entire site did go down. Ouch!

We tried all sorts of things to ease the strain but nothing seemed to work. By
around 6:30pm EST (phew, what a day!) we decided to try one final idea, which
fortunately worked. Yay!

However it took a further 5 hours to properlly test our fix, and so it was
only by around 11:30pm EST that we were able to bring the site back up. Don't
worry, all of your data remained safe at all times, and that hard-won
mayorship is still yours!

..._

Anyway, if people disagree that you should always communicate with your
customer at a level they understand, then I'd urge you to read
[http://steveblank.com/2010/04/22/turning-on-your-reality-
dis...](http://steveblank.com/2010/04/22/turning-on-your-reality-distortion-
field/) or [http://www.readwriteweb.com/start/2010/05/is-your-startup-
to...](http://www.readwriteweb.com/start/2010/05/is-your-startup-too-
geeky.php) (pitching to investors, media or customers - it's all the same
issues).

~~~
harryh
Interesting. Thx very much fo the feedback. I'm sure you understand it's a
hard balance to strike between technical detail and ease of understanding.
Will strive to make things a bit more on the "ease of understanding" side next
time.

Also considering starting a separate engineering blog where it would probably
be appropriate to go into more detail for those that are interested.

-harryh

~~~
lemming
For what it's worth, I liked the original much better than the proposed
replacement. People aren't idiots, if you give them a good explanation they
appreciate it even if they don't fully understand all the details or
implications. Think of it like going to the doctor - if I have something wrong
with me, I want my doctor to explain it to me in a way I can understand, not
just tell me that I have something generic wrong.

