Hacker News new | past | comments | ask | show | jobs | submit login
So, that was a bummer (foursquare.com)
189 points by abraham on Oct 5, 2010 | hide | past | favorite | 92 comments



Interestingly the cause of the most severe part of their downtime seems to still be unknown to them:

As a next step, we introduced a new shard, intending to move some of the data from the overloaded shard to this new one.

We wanted to move this data in the background while the site remained up. For reasons that are not entirely clear to us right now, though, the addition of this shard caused the entire site to go down.

To those who use MongoDB - does this sound like something that might have been caused by MongoDB itself, or Foursquare's use of it?


Without knowing more about their schema, it's hard to speculate. However, let's assume they had a single, large "checkins" collection, sharded by user ID as they said. That would be distributed across the shards. If the mongod process is down on one shard and therefore the shard isn't responding, the data just wouldn't be available.

But it sounds like the shard was still up. As per the sharding FAQ[1]:

"What if a shard is down or slow and I do a query? If a shard is down, the query will return an error. If a shard is responding slowly, mongos will wait for it. You won't get partial results."

They said they had a performance issue on the overloaded shard so perhaps it's possible that mongo believed the shard to be up, when it was instead overloaded. This meant any queries just waited, taking down the whole site.

As I said, this is speculation and as a massive production user of MongoDB ourselves[2], I'm interested to know more.

[1] http://www.mongodb.org/display/DOCS/Sharding+FAQ [2] http://blog.boxedice.com/2010/02/28/notes-from-a-production-...


"What if a shard is down or slow and I do a query? If a shard is down, the query will return an error. If a shard is responding slowly, mongos will wait for it. You won't get partial results."

an approach to CAP that chooses consistency over availability. I wonder why as it isn't a bank and they could have chosen the availability instead.


If in doubt, choose consistency.


But it wasn't the overloaded shard that brought the site down (it was only slowing things down), it was the addition of the new shard:

"For reasons that are not entirely clear to us right now, though, the addition of this shard caused the entire site to go down."


If the data was being moved it could cause even more load on an already overloaded shard.


I know nothing about MongoDB but why would one shard that's responding slowly affect queries to all the other shards and take the whole site down? I can't believe any database system would be designed to work like that.


At 6:30pm EST, we determined the most effective course of action was to re-index the shard, which would address the memory fragmentation and usage issues.

Sounds similar to some of the problems we're facing with MongoDB too. Indexes fragment a lot, especially if you delete documents frequently. You have to reindex everything eventually.

MongoDB offers background indexing (except for _id fields!) but it's not really an option if you need the indexes at all times to keep executing queries. Also, you'll need to reindex _id eventually and there's currently no way to do that without blocking the server.


I'm also wondering why this kind of intervention is necessary at all. The NoSQL solution we use at work has load based automatic splitting, and I'd have thought (though I haven't confirmed) that this would be an obvious feature to include.


I would speculate that it's a poorly-chosen shard key. MongoDB's built-in sharding uses range-based indexing. If you choose user_id as your shard key, and those are autoincrementing integers, then you're screwed if newer users tend to be more active on average than older ones.


Wait, people shard on a key other than something approximately random, like an sha1 hash!?


Where I work (Etsy) we keep an index server that maps each user to a shard on an individual basis. There are a number of advantages to it. For example, if one user generated a ton of activity they could in theory be moved to their own server. Approximately random works for the initial assignment.

Flickr works the same way (not by coincidence, since we have several former Flickr engineers on staff).


Sounds like a good system. I've noticed that people tend to do things like shard based on even/odd, and then they realize that they need three databases.

I've never had either problem though... but if I ever need to shard I plan on doing it based on object ID. Then one request can be handled by multiple databases, "for free", increasing both throughput and response time.


Even/odd isn't the end of the world, but you would then be best jumping to mod 4.


Actually for anonymous sharding (without a central index) a consistent hash is about the closest you can get to ideal distribution and flexibility. I haven't looked but I presume that's what mongo uses under the hood for their auto-sharding, too.


Load based splitting is on the MongoDB roadmap, but doesn't exist yet.


Just because the feature exists, doesn't mean it works.


Which NoSQL store do you use?



That looks potentially useful in some cases. Can you send me the code/docs for that?



Does it work that well for production for you? Really, that's extremely interesting.


Also look at dynamo based systems like cassandra and riak (riak seems to have a better load balancing at the moment, cassandra is a bit more "bumpy")


> To those who use MongoDB - does this sound like something that might have been caused by MongoDB itself, or Foursquare's use of it?

MongoDB still has problems with locking too much for certain operations. The 10gen guys keep on introducing "yields" to those operations, but the simple (and fast) database/process-level locking could very well cause this... (not a 100% 'yes', but it seems possible)


The site just went down again, awkwardly enough, only moments after they published their post-mortem on yesterday's outage.

Clearly, as their blog post indicates, they were unable to trace the root problem.

To me, the worst feeling in the world as a developer is when there's a major bug in your production site, and you can't figure out exactly why it happened. Then even after you get the site working there's that pit in your stomach of "what if it strikes again?"


Not surprisingly, it is a related problem:

> UPDATE Oct. 5 8:01PM: Our server team is still working to resolve the problem. The issue is related to yesterday’s outage.


I have a similar feeling toward my car's problems. I think it's the general issue of reproducibility that can cause such a feeling. You know the problem is there, but you're not sure of the cause, nor can you reproduce it reliably. Immediately, the issue becomes top priority because you don't know what to expect.


After this past week, I know exactly what you mean...


Well I have nothing constructive to add, but it certainly makes me rethink using MongoDB in any sort of serious production environment. I've been hesitant to use MongoDB for this very reason...what happens when things go wrong? I'm certainly not an expert in how to handle those situations, and there aren't very many of those experts out there. Unfortunately for Foursquare, they got caught with their pants down, not having someone on staff who really understands all the ins & outs of a database technology their entire service depends upon.


You can get commercial support from 10gen (MongoDB developers) as we do for my company. It's just like MySQL providing support for their enterprise DB. Of course MySQL is much more widely used than MongoDB so there is more community help and knowledge.


Right, I'm aware. Not everybody can afford a contract though, nor should they have to in order to avoid major outages like this.

But I would like to know if Foursquare has a commercial support contract with 10Gen. If they didn't, why not? especially for a service that big? If they did, how was it that 10Gen took that long to fix the problem?


They share USV as an investor. I'd be astonished if they didn't.


Heh...that's one hell of a commercial support contract then. Should be interesting to read 10gen's own detailed post-mortem.


If I was 10Gen, it would be in my best interest to offer support to 4sq, contract or not. People (like us) are watching.


I'm sure that if you're the size of Foursquare, you can afford the contract.

Then again, it's no excuse for bad software. I've only used Mongo on small sites so far, and have been loving it.


Its all fun and games until someone gets a shard in the eye.


I'm sure it was diagnosed quickly. Sometimes you have to copy data, rebuild indexes, etc.


According to that blog post it hasn't actually been properly diagnosed yet. They see the symptoms but they don't know why they're seeing those symptoms.


Honestly though, would MySQL or PostgreSQL really helped out in this situation? Sharded or not sharded, there's really not much one can do once a server (or set of replicating servers comprising the shard) starts to become overloaded. Increasing the capacity of the shard by adding more hardware will induce a significant amount of load by itself. Of course, that's just one piece of the puzzle. We still don't know what actually brought the site down completely, hopefully they'll be able to trace it down and fill us in on that.


Here's what I do in these situations (I'm an Oracle DBA, but this should apply to most loaded shards):

1) Use connection pooling at the application layer to prevent overloading the DB of any specific shard. This means that if a shard has 16 CPUs, having 16 connections sounds reasonable. Additional connections will not give you more performance. This means you need to queue and throttle requests at the application layer and with some thought you can probably figure out what to do with the waiting users - show partial results? show a nice whale? A "loading please wait" sign?

2) If you didn't do #1 and the DB is getting overloaded, my normal response is to start shooting down connections. Oracle has separate unix process per connection. MySQL has its own way of shooting connections down. Put up a small script that will kill the correct percentage of sessions to prevent overload on shared resources. This will generates lots of errors and will cause a percentage of the users to hate you, but you won't be down.


1) Application connection pooling won't scale. In a scenario like FourSquare, there is likely a 4:1 ratio of app server to DB shard server. Further, connections don't necessarily equal load.

2) This sounds like a great way to create data inconsistencies, unless you've got very tight constraints on your database, which is impossible in a sharded scenario.

I agree though, that ultimately they should have had some way to "fail whale" instead of getting overloaded.


The whole point of MongoDB is that you SHOULDN'T have to worry about a server becoming overloaded! You're giving up a lot for this privilege too, so if that doesn't even work properly then... back to PostgreSQL in my opinion.


MongoDB is designed to help ease the pain of scaling, but what you're asking for is magic. If all the sudden a large amount of requests start coming through that overload a shard (think very popular users, like if several very popular users ended up colocated on a shard), how is MongoDB going to anticipate this? Any kind of shard scaling will only work well in scenarios where you have a reasonable increase in load, not a crippling surge.

In addition, it's very, VERY difficult to scale writes against a single object, such as Justin Bieber's profile data, say, if you've got a view counter on it. You can either serialize writes on read like Cassandra does, which has it's own drawbacks (the more writers an object has, the more expensive reads become), or you can have single-master-for-an-object sharding like MongoDB employs and most other production sites (Facebook, Flickr, etc) use.


How is this primarily an issue of overlaod? One particular shard got overlaoded, yes, but the reall issue is how that brought the whole system down.


There was no sudden crippling surge in this case as far as I can tell. There was no mass-updating of a single object either. It failed even though it fits your ideal situation pretty much perfectly.


This posting in the MongoDB mailing list provides more detail from the developers:

http://groups.google.com/group/mongodb-user/browse_thread/th...


Sounds like a side effect of relying on MMAP (and not doing compaction).

"Basically, the issue is that if data migrates to a new shard, there is no re-compaction yet in the old shard of the old collection. So there could be small empty spots throughout it which were migrated out, and if the objects are small, there is no effective improvement in RAM caching immediately after the migration." - Dwight Merriman (at the link in the parent).

"The kernel is able to swap/load 4k pages. For a page to be idle from the point of view of the kernel and its LRU algorithm, what is needed is that there are no memory accesses in the whole page for some time."

-antirez from http://antirez.com/post/what-is-wrong-with-2006-programming....


we noticed that one of these shards was performing poorly because a disproportionate share of check-ins were being written to it.

I'd love to know the root cause behind this specific issue. Was this a behavioral issue within the user base, or a technical problem that routed check-ins to this specific shard more than others?

Since they mentioned they partition their shards by userId that would probably rule out their routing process. I wonder if there was some event that caused a certain sharded subsection of users to start sending so many checkins?

And since this was a subset of userIds on the same shard - could this have been a targeted DOS or SPAM event?

I'm making a very conscious migration to MongoDB so I'm very interested to hear what the root cause of this was.


Ironically, the servers are being "upgraded" again, as we speak.


I notice foursquare.com resolves to ec2, was the mongodb that got overloaded also on ec2? Can you tell us in what way did it get "overloaded" (iowait, mem/swap, raw-cpu)?


"but... MongoDB is web scale."

--- commenter



Having to manually shard your data seems like so much work when offerings like App Engine will take care of that for you. It seems like exactly the kind of thing that you shouldn't have to think about when you're trying to get a business off the ground.

I can see the lock-in concerns with AppEngine, but an AppEngine level of abstraction seems so much more appropriate than manually deploying/configuring an entire infrastructure of proxies, load balancers, web servers, etc. Especially when an error can take down your whole site, like in this example.


It's not manual sharding. You specify a key a MongoDB is supposed to handle everything for you. There are manual operations you can perform if you want (like moving data or splitting) but normally you'd let Mongo handle it all.


I can't imagine any co with that much investment money betting on a platform that they can never take ownership of.


What I don't understand is this: Why do companies (like four square, twitter, etc) wait until after their first multi-hour crash before instituting a "this is how we'll communicate to users when we have downtime" process? I would assume that everyone has learned from twitter's historical mistakes at this point. I would argue that startups -- especially startups that deal with large numbers of transactions per day -- start with code and policies for communicating downtime issues first and launch the product second.


It's not part of a minimum viable product. When you're trying to get something out the door, your downtime communication is not your top priority, so you can afford to improve it later.


aka "cross that bridge when you come to it"


Agreed. And they did use their Twitter accounts to communicate the problem/downtime (which even MVP's should have & use for status communications). You can be sure their status.foursquare.com blog won't have any more information than their Twitter accounts when they have unexpected downtime in the future.


By experience, I'd say it was on their roadmap, there was even maybe a ticket about having a status.foursquare.com.

But you know there's tons of tickets, tons of priorities and finally shit happens and some tasks are placed on top of the pile and become priorities.


Well, that's another example for the "dangers of making untested changes to your production environment" pile. Of course, people rarely feel the need to post when everything works out...

Sucks that such a popular service had such trouble. I look forward to reading any additional posts they write explaining in more detail exactly what happened.


Firstly, as a free user you really shouldn't be asking for HA. (I m assuming their paid customer are kicking some butt as I speak .. or maybe not .. the site is back up :)).

However, as a business you really want to give ALL your customers HA. Its not just a reputation thing, its a "we love you all equally" kinda attitude.

As for MongoDB, we ve been using in production for small insignificant things. FWIW, they have replication http://www.mongodb.org/display/DOCS/Replication and some cool new features like Replica Sets for failover and redundancy. Maybe they missed a trick?

I think the apology post was totally fair and he did categorically mention ".. This blog post is a bit technical. It has the details of what happened, and what we’re doing to make sure it doesn’t happen again in the future.". They could have dilly-dallied with words and said "we had a technical failure of a data nature" and that would have been just been plain stupid. So thanks for the detailed technical write up and hope there is more to follow.


HA?


High Availability


Thanks


I'm so happy they're back up, here I was thinking the world would come to a screeching halt when people cannot check-in to places.

Seriously though, are they THAT important?


No, they're not, but they have to dance the dance and say / do the right things if they want to keep their users and investors happy.


I like the NoSQL approach as an option. But we should keep in mind: operationally, these databases/stores are comparatively new, and don't have the years and years of use that would help find and solve problems like this. It reminds me of Ebay's 3-day downtime in 1999 -- based on an Ebay mistake and an Oracle bug. Although Oracle had been around for a while in 1999, OLTP was still new, and, hence the bug.

I'm not blaming a flat-out bug in this case (the cause of the severe part is still unknown?), but it could also be architectural or operator error.


I'm unable to come up with any reference to a 3 day outage regarding Oracle and Ebay in 1999. Can you provide more info on this - I'm very interested to see what happened.

Edit: I found this reference to the 22hr outage that occurred, and I remember this outage, but I don't ever remember it being a 3 days outage.

http://www.internetnews.com/ec-news/article.php/137251/Cost-...


Here's a link to a Forbes article about it. You're right: my memory was distorted. The outage was only 22 hours. (I'm sure it felt like three days to the Ebay admins at the time.)

Fun fact: the "Steve Abatangle" quoted in the article is yours truly, and the author of the piece is Dan Lyons, now AKA Fake Steve Jobs.

http://www.forbes.com/forbes/1999/0726/6402238a.html


I remember it being 3 days. Could be faulty memory.


in 1999, OLTP was still new, and, hence the bug.

No, OLTP was not at all new in 1999. OLTP probably means something other than what you think it means.


Online Transaction Processing. I meant specifically on the web, where life is more chaotic than in traditional environments. I suppose I could have written "web OLTP". By "new", I mean less than 10 years old.


I think the correct reply is not weaseling but 'Sorry, my entire post was completely wrong'. It happens.


Are you sure that was due to an Oracle bug? I heard of one eBay outage while watching a talk on ZFS that was due to misconfigured SAN volumes that overlapped and caused repeated data corruption in the Oracle DB, but wasn't related to any bug in Oracle.


If there's anyone from Foursquare here, I'm interested in what monitoring you have in place. How long had the shard been poorly performing before you noticed? Do you use anything to monitor MongoDB in particular, or load on servers, anything like that?


For a service that is trying to be 'mainstream' I think their blog post is horrible.

There is no way a 'regular user' is going to understand what a shard is, nor should they care. Ok, they explained sharding in laymans terms but then went on to talk about "reindexing the shard to improve memory fragmentation issues"... Woah, that means nothing to 95% of users.

If you experience down time and you want your users to be sympathetic then you got to explain whats going on in terms they will understand. Sure, include a technical explanation at the bottom for those inclined, but not as part of your main body.


How do you think it could have been better? We struggled a lot trying to decide how much technical detail to include. We decided that including more information (even if a lot of our users didn't understand it) was better than "something broke, it took a long time to fix."

Would love to hear suggestions on this topic.


I liked that your post didn't start with "we use MongoDB and there's some problems with it". I haven't used MongoDB on production, and while I'm interested to learn about specific issues, the format of this post gives the reader the opportunity to evaluate your problem from a platform-agnostic perspective first. Instead of having Mongo interfere right away, I can think about how our systems might hit the same issues.

To echo others, I'm interested to read the more in-depth post-mortem.

Good luck!


The technical details were great. And I'm sure other devs can learn from your predicament. But, as a user, you didn't answer the question, "When can I use this again?"

It broke. You fixed it. But, "Can I expect this to work again?" "Reliably?" All I heard was that it was broken.

It sounded a lot like, "something broke, it took a long time to fix."


I liked your blog post and appreciated the level of detail you put into it. I'm always wondering why things went down and what's going on behind the scenes and I'm sure many other users appreciate it too. Those that don't likely won't mind the extra information.


Having more MongoDB-specific technical info might be a call-to-arms for other MongoDBAs who might offer their ideas/thoughts. You could also use that as a hiring/scouting opportunity.

Does 4sq have an engineering blog?


The down voting on my previous reply is sad but I probably shouldn't be too surprised. The problem of being way too technical for a mainstream audience is a problem many people on Hacker News seem to have, and so no wonder many would disagree with me.

It's silly for someone on Hacker News to say "well I thought the level of detail was fine" - of course you would, like the rest of us you're a technical geek. The point that seems to be lost is 95% of FourSquare's userbase ISN'T!

Also FourSquare is one of those startups that, in addition to the YC startups (for obvious reasons I guess), people give a little more favoritism to then perhaps other startups of equal quality/interestingness.

How do you think it could have been better? We struggled a lot trying to decide how much technical detail to include. We decided that including more information (even if a lot of our users didn't understand it) was better than "something broke, it took a long time to fix." Would love to hear suggestions on this topic.

Well, I'm not suggesting you wrote "something broke, it took a long time to fix" - I'm all for transparency. But if you are going to be transparent you need to communicate at a level at which that transparency can be understood by all of your readers. I'm sorry if some people on Hacker News don't get that.

So ok, here's how I would have written your post (for time sake I just did the intro - I'd have repeated the technical description after this block of copy):

Yesterday, we experienced a very long downtime. All told, we were down for about 11 hours, which is unacceptably long. It sucked for everyone (including our team – we all check in everyday, too). We know how frustrating this was for all of you because many of you told us how much you’ve come to rely on foursquare when you’re out and about. For the 32 of us working here, that’s quite humbling. We’re really sorry.

Below is an explanation of what happened and what we’re doing to make sure it doesn’t happen again in the future (a more technical explanation for those inclined appears further below)

What happened As you can imagine we store a huge amount of data from all of your user check-ins. We split that data across many servers as it's obviously far to big to fit onto just one. Starting around 11:00am EST yesterday we noticed that one of these servers was performing poorly because it was receiving an unusually high volume of check-ins. Maybe there was an incredibly popular party that we missed out on! :)

Anyway, after trying various things to improve the performance of that server we decided to try to add another server to take some of the strain off the original overloaded server. We wanted to move this data in the background while the site remained up - however for some reason when we added the new server the entire site did go down. Ouch!

We tried all sorts of things to ease the strain but nothing seemed to work. By around 6:30pm EST (phew, what a day!) we decided to try one final idea, which fortunately worked. Yay!

However it took a further 5 hours to properlly test our fix, and so it was only by around 11:30pm EST that we were able to bring the site back up. Don't worry, all of your data remained safe at all times, and that hard-won mayorship is still yours!

...

Anyway, if people disagree that you should always communicate with your customer at a level they understand, then I'd urge you to read http://steveblank.com/2010/04/22/turning-on-your-reality-dis... or http://www.readwriteweb.com/start/2010/05/is-your-startup-to... (pitching to investors, media or customers - it's all the same issues).


While I agree on the technical-level bit, the tone in your sample floats between cavalier and condescending. That's worse than technical overkill IMO.


It would be interesting to know which bits you felt that was?

Also keep in mind I tried to edit their original post, as kinda suggested. I'm not sure I'd have written any the post quite in the way that they did - but I tried to work with what I had.


Pretty much every one of the attempts at editorial comedy. People are only reading a post like that if they've been inconvenienced. Comedy is hard to pitch at an angry crowd, so unless you're really good at it, it's probably best not to try.


Interesting. Thx very much fo the feedback. I'm sure you understand it's a hard balance to strike between technical detail and ease of understanding. Will strive to make things a bit more on the "ease of understanding" side next time.

Also considering starting a separate engineering blog where it would probably be appropriate to go into more detail for those that are interested.

-harryh


For what it's worth, I liked the original much better than the proposed replacement. People aren't idiots, if you give them a good explanation they appreciate it even if they don't fully understand all the details or implications. Think of it like going to the doctor - if I have something wrong with me, I want my doctor to explain it to me in a way I can understand, not just tell me that I have something generic wrong.


Maybe you don't need to find a balance. Provide a "more (technical) info" link after the general-user description.


I disagree (but I never downvote). Leaving out technical detail, translating everything into end user friendly but meaningless phraseology is the worst thing you can do. Microsoft has done that in all its consumer products and it's infuriating because users have no way of asking someone more knowledgable for help.

I think end users have a problem if the one thing they need to know is expressed in a way they cannot understand. What they need to know is when the site is going to be up again and what the likelihood of it happening again is. Once they know that, I don't think they have a problem with added technical detail that's not meant for everyone.


How many "regular users" do you really think would spend the time to read about why they were down? They just know it was down and wanted it back up ASAP. And how would you better explain it to a layman? The servers crashed. We worked on it. It took longer than we wanted. They're back up again and we're working to make sure it doesn't happen again. How do you go further than that without getting technical?


Actually, I think that it wasn't even all that technical. They took their time to explain what shards are, etc, so their intended audience was still power users, at best. Not saying it's good or bad, just saying how I felt when I read the post.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: