
MongoDB's lead developer: Foursquare outage post mortem - pc
http://groups.google.com/group/mongodb-user/browse_thread/thread/528a94f287e9d77e
======
there
so, in short, a company relying entirely on cloud computing machines for
storing its data, which is presumably being billed according to the memory
usage of those machines, ran out of memory on them, and suffered a large
amount of downtime as a result. mongodb had little to do with the problem,
other than maybe it took longer than expected to migrate data to a third
server.

i'm baffled at how there could be no monitoring or reporting in place to catch
that days or weeks ahead of time, let alone just 12 hours needed according to
the mongodb developer, to fix the problem without downtime. it's such a
fundamental thing to keep track of for a system designed entirely around
storing a bunch of data in the memory of those 2(!) machines. i have more
servers than foursquare and none of them do even a small fraction of the
amount of processing that theirs do, and yet i have real-time bandwidth,
memory, cpu, and other stats being collected, logged, and displayed, as well
as nightly jobs that email me various pieces of information. nobody at
foursquare ever even logged into those systems and periodically checked the
memory usage manually?

worse still, during all of this, the initial outage reports were blaming
mongodb or saying the problem was unknown. even at that point nobody at
foursquare realized that the servers were just out of memory?

how did the developers come up with 66 gigabytes of ram to use for these
instances in the first place? was there some kind of capacity planning to come
up with that number or is it just a hard limit of EC2 and the foursquare
developers maxed out the configuration?

~~~
hartror
Exactly. How the hell does a fast growing well funded 24 employee startup NOT
have load monitoring on their database servers! Pay the 10c the hour for a
micro EC2 instance and run Zabbix or one of the half dozen other awesome
monitoring packages out there.

~~~
heresy
If it's anything like here, any one engineer's task list is essentially a
weighted list of fires to put out and technical debt to pay down, in addition
to the features to be delivered next day / week.

Guess more extensive monitoring just got bumped to the top :)

~~~
hartror
If it is anything like any start up I have ever heard of :P

My point is that it is stupidly important to have good monitoring in place and
I am surprised that one of the best known startups had a gaping hole in
theirs.

------
tmountain
The "give a crap" factor at 10gen is nothing short of amazing. The company I
work at has been using Mongo for a while, and anytime we've had an issue,
Eliot has been right there to help us. Mongo is a great product, but like
anything, it has its limits. Learning those limits and taking the time to plan
your infrastructure is a mandatory part of adopting any technology, and Mongo
is no exception.

On a more technical note, it would be nice to have a way to compact indexes
online without having to resort to doing so on a slave, but Mongo is a minimum
two server product to begin with, so it's not the end of the world. Overall
it's a great datastore, and it's only going to get better in the next few
years.

~~~
cies
> Mongo is a minimum two server product to begin with

are you sure? i thought it runs well on one server and even in some sort of
run-locally setup.

~~~
tmountain
Yes, if you care about your data, it's a two server product. Single server
redundancy isn't being added until 1.8. Until then, if the server terminates
abnormally, there's potential for data corruption. The solution right now is
to run a slave.

<http://www.mongodb.org/display/DOCS/Durability+and+Repair>

~~~
sausagefeet
This solution strikes me as best-case-scenario error. What happens if the
connection between my slave and master goes down then my master is unplugged?
What happens if both slave and master are unplugged, can both databases be
corrupted? What am I missing?

~~~
tmountain
If the connection goes away and your master is unplugged, you will need to run
a repair on the master and then bring the slave back into sync. If both are
unplugged, you will have to run the repair on both. This obviously isn't
ideal, and that's probably why single server durability is the biggest
priority for the next release.

------
donaldc
_In essence, although we had moved 5% of the data from shard0 to the new third
shard, the data files, in their fragmented state, still needed the same amount
of RAM. This can be explained by the fact that Foursquare check-in documents
are small (around 300 bytes each), so many of them can fit on a 4KB page.
Removing 5% of these just made each page a little more sparse, rather than
removing pages altogether._

Interestingly, this is one of the reasons antirez gives as to why redis will
not be using the built-in OS paging system, but instead will use one custom-
written for redis' needs.

~~~
wheels
I don't think those are the same problem -- could you provide a link?

The problem isn't paging, per se -- the paging system is doing exactly what it
should be doing, paging in blocks off of the disk and into memory as they
become hot, which for this use case is always.

The problem is that you get fragmentation in your pages. If you allocate three
records in a row that are 300 bytes, and then need to rewrite the first one to
make it 400 bytes, or delete it altogether, you end up creating a hole there.

The typical strategy for dealing with those holes it to maintain a list of
free blocks which can be used and hope that the distribution of incoming
allocations neatly fits into unused chunks.

However, as they note, if you have a fully compacted 64 GB active data set and
remove 5% of it you just end up with address space that looks like swiss
cheese; it's not set up to elegantly shrink, but to recycle space as it grows.

There are a couple of things that I find a bit odd here though: they should
have seen this coming before migrating data; it's pretty obvious from the
architecture. Second is that they mention the solution being auto-compacting,
which wouldn't have actually helped them.

Auto-compaction is in fact useful, but all that it does is, well, compact
stuff. It means they would have hit the limits later, but once the threshold
was crossed, they'd have the exact same problem. Auto-compaction is either an
offline process that runs in the background or a side-effect of smarter
allocation algorithms. Both of those things need time once you remove data
from an instance to reclaim the holes in the address / memory / disk space ...
which is exactly what they did manually.

The only really sane ways to handle something like this are notifications at
appropriate levels, or block-aware data removal -- e.g. "give me stuff from
the end of the file". I don't know if mongo uses continuation records and
stuff like that enough to know how difficult that would be for them.

(Note: Directed Edge's graph database uses a similar IO scheme, so I'm doing
some projecting of our architecture onto theirs, but I assume that the
problems are very similar.)

~~~
donaldc
_I don't think those are the same problem -- could you provide a link?_

You are correct, the actual problem is paging out LRU keys as opposed to
memory holes. The issue is related but not the same.

From [http://antirez.com/post/what-is-wrong-
with-2006-programming....](http://antirez.com/post/what-is-wrong-
with-2006-programming.html)

 _Multiply this for all the keys you have in memory and try visualizing it in
your mind: These are a lot of small objects. What happens is simple to
explain, every single page of 4k will have a mix of many different values. For
a page to be swapped on disk by the OS it requires that all contained objects
should belong to rarely used keys. In practical terms the OS will not be able
to swap a single page at all even if just 10% of the dataset is used._

~~~
wheels
What antirez is getting at there is actually a much harder (and more
interesting) problem that could be generalized as something like "efficient
data locality for mixed latency access".

However, assuming that all data must actually be in memory (as stated in the
posted email), you don't actually solve the Mongo problem with more efficient
organization of the data set, though compacting could be considered a sub-
problem of the one that antirez describes.

But as I noted in my earlier comment, compacting wouldn't have actually solved
their problems, it just would have delayed them. It's reasonable to ask if all
of their data truly needs to be hot, but even there, you'd eventually hit
diminishing returns as you approached the threshold where your active set
couldn't fit in memory, and there smarter data organization wouldn't actually
fix things once you started pulling chunks out for sharding; you'd still need
to recompact.

~~~
donaldc
I doubt every last bit of their data needs to be hot (i.e. in memory) at any
given time, but without specialized paging along the lines of what antirez has
discussed for redis, _enough_ of their data probably needs to be hot that,
from a paging perspective, all of the vm pages of their data need to be hot.

------
SemanticFog
I like the way MongoDB describes the flaws in their own system, but places the
blame (four) squarely on the customer, where it belongs.

It wasn't a random failure or a sudden spike that caused the crash -- it was
completely predictable growth. Foursquare had already experienced the problem
once, and they had solved it. All they needed to do was monitor their growth
and iterate that solution.

Sure, Foursquare could have had a better sharding algorithm, but that would
only have put off the crash a bit longer. This is a very basic failure -- not
monitoring a system that you know is steadily growing.

~~~
SriniK
I think they mongodb guys are correct - it is an issue with app architecture
and app monitoring failure.

------
redthrowaway
The main thing I took from this incident is how much being open and honest
about issues improves a company's image. Foursquare and 10gen could have very
easily played the blame game, or kept their cards close to their chests, and
both would have come off poorly. Instead, they described the problem, owned up
to their role in it, and laid out a framework for how to avoid the problem in
the future. After reading about the issue, I come away with respect for both
groups. Sure, mistakes were made, but they treated their users like adults and
took responsibility for what they did wrong.

Good job, guys. If only the likes of Apple followed this example.

~~~
megablast
What are you talking about? How can you see how the users see foursquare after
this? Sure, in the hacker community this detail is appreciated, but how do you
know how many foursquare users have left because of these outages? How many
people are upset with the company, and are now taking facebook locations or
the other companies more seriously?

~~~
parfe
Left over this? Foursquare isn't a bank. Foursquare is a game. An achievement
system for work lunches.

Of the people I know who use four square the response was "Hm, foursquare
isn't working" _puts phone away_.

------
jasonwatkinspdx
Building sharded systems isn't as simple as throwing consistent hashing into
the code and calling it a day. You have to think carefully about what happens
when nodes exceed capacity. There is good work in academia on distribution
algorithms that gracefully handle reaching capacity (along with data center
structure such as rack awareness) [1]. Alternately, if your algorithm doesn't
handle a shard reaching capacity you need to have the monitoring and processes
in place to ensure you always add more capacity and rebalance before running
out.

Also, this validates Redis's VM position about 4KB pages being to large to
properly manage data swapping in web application storage.

[1]: <http://www.ssrc.ucsc.edu/Papers/weil-sc06.pdf>

~~~
t_crayford
I'm wondering if the decision to shard based on users was taken with data (ie,
at that point did each user have a roughly similar number of check-ins); if
not, hashing by that seems kind of fail for that kind of app.

~~~
rcoder
From the email thread, it sounds like the decision to shard on UID was made
mostly to increase locality of data, so that you didn't have to query more
than one node to get a single user's data.

There's no silver bullet here. Hashing on insertion order would basically
guarantee that writes would favor one node over another, which random hashes
would force you to aggregate results from all available nodes for each query.

~~~
jasonwatkinspdx
This stuff can be very counter intuitive. Locality may not be what you want.

For example, last I heard google's search index was sharded by document rather
than by term.

That sounds odd, since if it was sharded by term, then a given search would
only need to go to a handful of servers (one for each term) and then the
intermediate result combined. But with it sharded by document, every query has
to go to all the nodes in each replica/cluster.

It ends up that's not as bad as it seems. Since everything is in ram, they can
answer "no matches" on a given node extremely quickly (using bloom filters
mostly in on processor cache or the like). They also only send back truly
matching results, rather than intermediates that might be discarded later,
saving cluster bandwidth. Lastly it means their search processing is
independent of the cluster communication, giving them a lot of flexibility to
tweak the code without structural changes to their network, etc.

Does that mean everyone doing search should shard the same way? Probably not.
You have to design this stuff carefully and mind the details. Using any given
data store is not a silver bullet.

~~~
hboon
Sharding a search system by documents gives you several advantages. You can
scale horizontally by adding additional indexes for new documents. You can
tweak a group of documents more easily. E.g. rank wikipedia higher. Assign
better hardware (if necessary), higher priority, etc. Performance is also more
uniform. Easier to index content at different frequencies. It's also easier
for re-indexing content after tweaking algorithms.

If you shard a search index by term instead, you will end up with duplicate
documents stored in each index that contain the same term. And for a large
index, you need to scale up your hardware to handle the term or shard by
document within that term anyway.

------
zaidf
Is it acceptable/preferred to store your entire db in RAM? I have little idea
about large systems but feel like this may be hard to scale if your db grows
to hundreds of TB. I'm intrigued to learn more! Anyone know how fb organizes
its massive db storage?

~~~
harryh
> Is it acceptable/preferred to store your entire db in RAM?

This is actually one of the big long term challenges we're going to have to
deal with @ foursquare. Right now we calculate whether you should be awarded a
badge when you check in by examining your entire checkin history (which means
it needs to be in ram so we can load it fast). While this works now, as we
continue to grow it will become more and more of a problem so we'll have to
switch to another method of calculating how badges are awarded. Several
different options here, each with pluses and minuses.

~~~
fleitz
Why not keep the algorithm and process in batches? You don't need to have
EVERY users checkin history in RAM at any moment. Throw the checkin on a
queue, have a few processes that query a db for the checkin history and you're
fine. Heck, if you delay the awards it's also a good excuse to throw the user
an alert to come back to the site/app.

~~~
harryh
That is definitely one of the options we are considering. It (obviously)
involves a change to the product, which we have to think about carefully, but
it certainly could help from a technical standpoint.

~~~
carbocation
It also depends on your algorithms. Some algorithms are amenable to "running
tallies," so a third possible approach would be to store and update various
values based only on the aggregate past data plus the incremental data,
instead of looking back through the entire history and recomputing when a new
piece of data comes in. This of course depends on whether or not this is even
theoretically possible with what you're computing.

~~~
jorgeortiz85
This is probably feasible for some badges, but would be hard to do for all of
them.

This approach also increases the complexity of adding new badges, which is
undesirable for product and business reasons.

It could certainly help in some case though, and it's something we're
considering.

------
jranck
_For example, if we had notifications in place to alert us 12 hours earlier
that we needed more capacity, we could have added a third shard, migrated
data, and then compacted the slaves._

Where did Foursquare find their engineers? I hope no one lost their job here
but this is pretty elementary stuff.

~~~
harryh
It's true that this is elementary in and of itself, but looking at things with
a bit of a wider lens shows the complexity. We're a small engineering team (10
people) working on a product that is growing extremely fast both in terms of
usage and feature set. Meanwhile we're also pretty much constantly re-
architecting things to keep up with growth and also doing the immense work of
growing the company up from 3 people to 33 and beyond (this has turned out to
be WAY HARDER than I would have guessed going in).

Further, there are lots and lots of different things that we need to be
monitoring at any different time to make sure that everything is going ok and
we aren't about to run into a wall. Automated tools can help a lot with this,
but these tools still need to be properly set up and maintained.

I'm not saying we didn't screw up. We had 17 hours of downtime over two days.
We screwed up bad, and we feel horrible about it, and are doing a lot to make
sure that we don't screw up the same way again.

But it's not because we're morons that never thought about the fact that we
should be monitoring memory usage. We just got overwhelmed with the complexity
of all that we're doing at once.

-harryh, foursquare eng lead

~~~
chewbranca
I agree completely about the complexity of monitoring solutions for small
teams with fluctuating applications. However, something as simple as htop
running on an extra monitor would have alerted you of this issue long before
downtime resulted.

~~~
cullenking
Awww come on, it's easy to say from an outside perspective. Regardless, the
problem was handled well in the end, and we all get the benefit of
understanding these limitations better. I think us tech people got the good
side (information) out of this ordeal :)

~~~
chewbranca
I work on a team of 2 where I'm responsible for a handful of servers. I chimed
in because I'm in a similar position, I've been looking for a monitoring
solution for a while now. Things like nagios and zenoss are over kill, but
lack of time has prevented me from finding an ideal solution. That said, I
keep htop open and running at all times, and its saved my ass on more than one
occasion. I say htop because of the color coding it provides, if things start
going red it attracts my attention.

~~~
sjs
Nagios is pretty nice. It's dead easy to write custom monitors and clients are
everywhere, there's even a Firefox extension. It requires a bit of learning to
get going with but it's not so bad and the pay-off is big.

That said I'm looking at monit too. I hear it's quite nice and has less of a
learning curve.

~~~
hartror
I'll put a vote in for Zabbix as a good option, it has saved us more times
than I can count.

~~~
riffraff
zabbix user here, and it works nicely. But man, it is painful to use.

------
jmulho
Why on earth would you want to keep 236 million check-in documents (66 gig /
300 bytes) in memory ?

Here is an idea: write an algorithm that keeps the most recently used 100
million check-in documents in memory. That'll save 38 gig of RAM.

Or how about this idea from the 1970s: write an algorithm that keeps the most
recently used 4 gig of check-in documents in memory. That will save 62 gig of
RAM, and the most recently active 14 million users (4 gig / 300 bytes) will
still enjoy instantaneous response.

Can someone help me out with the math here?

~~~
ww520
Isn't that what the on-demanding disk cache (and paging) from the OS give you?
Only the data being accessed and used are in memory.

Unless they're constantly doing data churning over the whole dataset, there is
no need to keep everything in memory.

~~~
jmulho
I am talking about the database looking in memory for what it needs to answer
a request; if it doesn't find what it needs, it reads from disk. Since it has
been allowed limited memory, when it put something new in memory, it has to
kick something old out. It uses a simple algorithm to identify what to keep
and what to kick out. One of the benefits of this (obvious) design is that
when too many users are active at the same time to keep all of them in memory,
a few of them experience a small lag (as opposed to all of them experiencing a
complete crash). If you have a few terabytes of disk then you have to really
be asleep at the wheel (like for a decade) to totally crash.

~~~
brown9-2
I was curious about a similar question to yours: if they had 66GB of RAM, why
did going just slightly over the threshold cause such drastic paging for them
- surely queries aren't touching all parts of their dataset equally?

harryh's answer above about querying the user's entire history for each
checkin answers this question though:
<http://news.ycombinator.com/item?id=1769909>

~~~
dedward
IF your system is getting pounded with queries, and you suddenly have to page
to serve those queries, the ability of your system to respond to queries and
other operations goes WAY down, and the queries are just going to pile up,
causing further swapping and load - it's a vicious cycle.

------
donaldc
It sounds like even if the check-ins _had_ grown evenly across the two shards,
that would have only saved them for about another month. Without close
monitoring of the growth in memory usage, foursquare was still due for an
outage.

~~~
harryh
This is 100% true. We were always planning on moving to more shards, we just
thought we had more time to do it than we did.

------
moondowner
One thing that I admire about Foursquare is that they don't only use the
technology (MongoDB, Lift Framework, etc..) but also invest in it. If it
wasn't for them I think that Lift wouldn't be such an advanced framework as it
is now.

~~~
harryh
Thanks! We do what we can (we're very busy these days!), and only hope to be
able to do more of this sort of thing in the future.

------
JabavuAdams
I find it interesting that with all this talk of scaling, and the specialized
tools that go along with it, 4sq's service was essentially running on 2
servers.

Maybe I only need 1 server to be profitable. Hmm...

------
chrislloyd
117 GB of data and 300 B for a checkin means that Foursquare has had
408,944,640 checkins.

~~~
ehwizard
There are numerous indexes on the data - so actual data is about half of that.

------
alexpopescu
While the details are very interesting, there are still many questions to be
answered (on both sides):

\- how difficult would be to bring up read-only replicas? (hopefully that
should take much less than 11 hours + 6hours) \- why the 3rd shard could
accommodate only 5% of the data? \- how can you plan capacity when using the
"wrong" sharding? (basically leading to unpredictable distributions)

I have posted the rest of the questions here:
[http://nosql.mypopescu.com/post/1265191137/foursquare-
mongod...](http://nosql.mypopescu.com/post/1265191137/foursquare-mongodb-
outage-post-mortem) as I hope to get some more answers.

~~~
harryh
> how difficult would be to bring up read-only replicas? > (hopefully that
> should take much less than 11 hours > 6hours)

Bringing up read only replicas would have been easy, but our appservers are
not currently designed to read data from multiple replicas so it wouldn't have
helped. We hope to make architectural changes to allow for this sort of thing
in the future but aren't there yet.

> \- why the 3rd shard could accommodate only 5% of the data?

The issue wasn't the amount of data the 3rd shard could accomodate, but the
rate at which data could be transfered off the 1st (overloaded) shard onto the
3rd shard.

> how can you plan capacity when using the "wrong" > sharding? (basically
> leading to unpredictable > distributions)

We weren't really using the "wrong" sharding. And even the uneven distribution
we saw (about 60%/40%) wasn't totally horrible. Not 100% understanding your
question here.

~~~
dirkgadsden
What exactly is the rate like for transferring data across MongoDB instances
on EBS? Was the overhead of going across EBS volumes the major factor in the
length of the downtime?

The 60%/40% sharding distribution doesn't seem too bad in the big picture; are
there any plans to be evening that out?

------
kvs
Looks like there needs to be some data engineering to be done to back memory
up with disk storage. I think you guys, by now, have enough information about
users and checkins to create algorithms that can determine "hot" checkins vs.
"cold" checkins as and when they come in. Because not every user is the same,
a more user-centric view can definitely help you come up with a more scalable
(or graciously failing) system. For example, I don't checkin as much as some
of my friends who checkin 10-15 times a day. So, there is no point in giving
my checkins as much priority as those busy folks.

------
jmulho
Can we clear up one issue? Sharding on userid -- bad or fine? I say fine. If
your algorithm is "if userid between A and M then server0, else (N-Z)
server1", then you could get unbalanced. But if your algorithm is "if random
50/50 hash of userid = 0 then server0, else (random 50/50 hash = 1) server1",
then you are going to stay almost exactly balanced (law of large numbers). You
would never, ever, get anywhere near a 60/40 imbalance. So the choice to use
userid, if implemented correctly, is completely fine.

------
richardw
How much can using SSD's for paging help with this? Or to just serve the
reads, since they're excellent at random access - surely typical of
Foursquare's usage?

------
voxxit
Why they have only two database servers running (with their database in
memory, no less) with 200 million check-ins, is completely beyond me.

~~~
seiji
They are dealing with web scale sharded NoSQL realtime geo scala. Old rules
don't apply when 80% of the words describing your company didn't exist two
years ago.

~~~
doty
I honestly cannot tell whether or not that comment was serious.

~~~
epochwolf
Any comment with the words "web scale" should be considered a joke until
proven otherwise. :)

------
robryan
This may be a little naive seeing as I haven't used MongoDB or know anything
about foursquare architecture, but wouldn't a better result have been just to
drop requests that were hitting the disk? Sure some people would have been
getting a sub optimal experience but arguably it's a lot better than not
having your service up at all.

~~~
sofuture
From my understanding, that 'routing' decision would have to take place inside
of Mongo -- null routing DDoS packets on a network is one thing, dropping
certain database requests on the fly inside a database server sounds a
_littttle_ bit tricker.

~~~
dedward
Tunable concurrency control for the number of incoming requests would have
been good - before they even hit mongo.

Then you can just dial things down a bit (making customers wait a bit) and let
the system recover, and tune things back to the point of optimal behaviour,
rather than letting things overload. (You set your concurrency limits at a
point where you know the system is just beginning to slow down but is still
performing satisfactorily)

------
look_lookatme
While I understand this is foremost a monitoring and architecture (mongo and
4sq side...) issue, but I'm curious how dangerous it is to run such IO
dependent systems on EC2. Would the prolonged downtime (due to shard
migration, etc) been as severe if they were running on hardware? What if these
two servers were on SSD RAID 0?

~~~
dirkgadsden
The duration of the downtime seems to have been dependent on not just disk
throughput, but also network and CPU throughput. Eliot did mention that a
large portion of the downtime was partly caused by the slowness of EBS. Making
a rough estimate, I think the downtime would probably be about 2/3 of how long
it was if it had been on an SSD RAID. However, then you run into the issue of
having to maintain your own servers; which, judging by the fact that they are
using EC2 and EBS so heavily, Foursquare does not have any desire whatsoever
to manage their own infrastructure.

~~~
look_lookatme
It's an interesting angle if that's the reason they are using EC2. Dedicated
hosting might be more expensive, but hardware managed is pretty efficient at
this point.

Also, given the vast number of EC2 instances they could afford, it seems
counter intuitive to be running only two mongrel shards. If you are that
stingy about spreading data around, you might as well be using dedicated.

~~~
dirkgadsden
It doesn't seem that it was necessarily stinginess of neglect, but rather
stinginess imposed by their situation. Like the article said, it took hours to
create a new shard, downtime that Foursquare definitely did not want.

------
piramida
So some people still use built-in memory allocation for large-scale memory-
intense projects like mongodb eh? Pity that, they should have looked at why
exactly memcached (and any other sane piece of memory-write-erase intensive
application) does slab allocations.

------
famousactress
I wonder if there ought to be a move to allow for sharded systems to lump
groups/pages nicely.. kind of like MySQL's partitioning.. where you won't run
into the issue of sparsing out pages when you migrate data.. instead you'd
effectively migrate pages.

~~~
ehwizard
Yes - something that keeps data organized my shard key is being planned.

------
ww520
A typical way to deal with uneven distribution in sharding is to do Hash(key)
Mod P, where Hash is a cryptographic hash and P is the number of partitions.
This way the hash function would randomize the distribution of key.

------
sasidharm
I am not trying to start a debate of sql vs nosql, but i have an honest
question. What if foursquare was using a standard RDBMS like Oracle? Would
they have run into the same problem?

~~~
andrewf
In my experience with things like postgres, yes.

If your working set (the data that you are touching regularly) grows past
available RAM, you very rapidly switch from a system where disk access plays
no part in most queries, to a system where it is the limiting factor.

Because disks are so much slower, and having queries take only ten times as
long can cause them to rapidly queue up, you end up with an unresponsive
system. The appropriate response is almost always to turn off features to get
the size of your working set back down, which buys you time to either alter
the app to need less data, or get more RAM into the box.

------
2mur
I don't use NoSQL (yet) but I'm curious if there would be a better key to
shard by (instead of user_id)?

------
nailer
To fix Google Group's unnecessary use of monospaced fonts so you can read
this:

<http://lab.arc90.com/experiments/readability/>

------
justlearning
for those who don't want to login to see the post:

<http://gist.github.com/616108>

------
BenSchaechter
That was one of the most well-written, to-the-point explanations I have read
in a long time.

------
darkhorse
and out comes mongo's dirty little secret - you have to have enough ram in
your boxes to hold not just all the data in ram, but all the indexes too, or
it completely shits the bed.

putting hundreds of gigs of ram in a box isn't cheap.

are the foursquare folks considering rewriting with a traditional datastore
like postgres and some memcached in front of it?

~~~
rit
They migrated to MongoDB from Postgres. Specifically because they needed
distributed sharding which Postgres didn't have.

Jorge Ortiz, one of the developers at Foursquare said it best on twitter
yesterday:

"Baffled by the Mongo haters. If foursquare had stayed on Postgres, we
would've had to write our own distributed sharding/balancing/indexing."

"Two problems: 1) Our job is to build foursquare, not a database. 2) These
things are hard. Odds are we would have had even more downtime."

<http://twitter.com/#!/jorgeortiz85/status/26563381834>
<http://twitter.com/#!/jorgeortiz85/status/26563387808>

Which sums it up incredibly well.

Your comment shows a scary lack of depth - this is the "dirty little secret"
of _ANY_ application that needs serious scalable data performance. You need to
put your data in RAM where it is quickly accessible.

As soon as you go to disk especially on a low I/O cloud box you are going to
be in 'extremely slow' territory. This is why, even before the current NoSQL
movement, everyone was using Memcached all over the place. Not because it was
a fad but because they needed speed.

SQL Databases have RAM caches as well which give you a pretty similar
behavior. And big surprise - if you throw more memory at them - they run
faster! Because they cache things in memory! In this case I believe that what
Eliot was highlighting was that Foursquare's performance tolerance
requirements were such that going to disk was not an optimal situation for
them.

There are plenty of applications out there using MongoDB without keeping it
all in RAM; I've deployed and maintain several. The more memory you have
certainly gives you better performance but it tends to be about a most
frequently used cache.

It never "Shit the bed" here to use your sophomoric vernacular. As the post
states, it started having to go to the disk after it surpassed the memory
threshold which slowed things down... nothing "crashed" as far as the
description states. But a read/write queue backlog on any database is likely
to exhibit the same behavior.

Your postgres + memcached solution fixes what exactly? They would still need
the same slabs of RAM to solve the problem, lest they go to disk on postgres
and slow to a crawl the same way they did with MongoDB.

------
michaelhalligan
Net impact analysis of outage: A bunch of socially awkward technophiles
couldn't tell their friends that they were eating lunch.

~~~
slantyyz
I wonder if burglaries also went down during the outage, since the bad guys
couldn't tell if their targets were at home or busy being the Mayor at the
local bowling alley.

------
invertedlambda
MongoDB is web scale.

~~~
invertedlambda
You guys are lame - you obviously haven't seen:
<http://www.xtranormal.com/watch/6995033/>

------
narrator
Well Foursquare's Scala/Lift front-end sure has been holding up nicely despite
all the FUD about Lift's stateful architecture. Funny that Mongo DB, which
gets all the scalability hype, is the first thing to have trouble scaling.

~~~
rit
The described problem - imbalance in shard allocation - is not one that's
specific to MongoDB.

In fact one of the huge choices one has to make when deploying say, Cassandra,
is the partitioning algorithm. Sequential partitioning (saying say, users a-l
go on Partition 1 and m-z go on #2) gives you certain capabilities for range
queries but also risks you overloading if one particular user is significantly
larger than others.

Cassandra recommends random partitioning as a way of better balancing your
data across shards.

You're going to run into the same problem on just about any sharded setup
(With any software) - how do you make sure that you are distributing new
chunks/documents/rows in a way that doesn't overload any given server.

~~~
tav
Or your datastore can take care of doing that for you, c.f. BigTable. This is
one of my main gripes with most of the current set of NoSQL offerings — they
leave too many decisions in the hands of developers.

Whilst it would definitely be advantageous for all developers to understand
the intricacies of various CPU and OS scheduling algorithms, it's not an issue
that most developers have to deal with directly. The App Engine datastore, in
particular, proves that it is possible to create NoSQL datastores which don't
force developers to think about issues like load balancing.

~~~
rit
Well - in many cases they leave the decisions in the hands of developers
because they have made a conscious choice to let the developers decide.

BigTable is more than anything a filesystem - it is NOT a database. It lacks
true user configurable indexes, custom sorting, querying etc. These features
and the data storage requirements to support them are a very different
prospect from what a filesystem (even a distributed one) needs.

------
swaits
Right, kick ass. Well, don't want to sound like a dick or nothin', but, ah...
it says on your chart that you're fucked up. Ah, you talk like a fag, and your
shit's all retarded. What I'd do, is just like... like... you know, like, you
know what I mean, like...

~~~
swaits
Geez. No sense of humor. It's a line from Idiocracy. Hilarious.

