i'm baffled at how there could be no monitoring or reporting in place to catch that days or weeks ahead of time, let alone just 12 hours needed according to the mongodb developer, to fix the problem without downtime. it's such a fundamental thing to keep track of for a system designed entirely around storing a bunch of data in the memory of those 2(!) machines. i have more servers than foursquare and none of them do even a small fraction of the amount of processing that theirs do, and yet i have real-time bandwidth, memory, cpu, and other stats being collected, logged, and displayed, as well as nightly jobs that email me various pieces of information. nobody at foursquare ever even logged into those systems and periodically checked the memory usage manually?
worse still, during all of this, the initial outage reports were blaming mongodb or saying the problem was unknown. even at that point nobody at foursquare realized that the servers were just out of memory?
how did the developers come up with 66 gigabytes of ram to use for these instances in the first place? was there some kind of capacity planning to come up with that number or is it just a hard limit of EC2 and the foursquare developers maxed out the configuration?
The high-memory quadruple extra large (no, I'm not making that name up) instance offers 68.4GB of ram. Presumably they left the rest for the OS.
Guess more extensive monitoring just got bumped to the top :)
My point is that it is stupidly important to have good monitoring in place and I am surprised that one of the best known startups had a gaping hole in theirs.
(again, NOT saying this applies to foursquare, I have no idea)
My point is, and perhaps the down vote indicates I didn't make it well, that monitoring and alerts are extremely important for any startup. They admit as much in the article, had they known sooner there wouldn't have been any down time. It isn't an excuse to say it got lost in the crush (not that they are trying to use it as an excuse, they're being very upfront and admitting fault).
We have neither the load or the team they do and we have comprehensive monitoring and alerts across our infrastructure as part of the SOE that each server gets. Perhaps we prioritized monitoring higher than them as we are a small team?
And I bet they have prioritized monitoring now. Now where did I put the key to that barn door?
It's in the saddlebags.
Maybe there were other things the engineers were working on that brought definite (as in 100%) benefit. Maybe there's a good product/consulting opportunity here. Everybody needs monitoring, but it takes time and resources.
I agree, that they are focusing on it now, because it happened.
No. off Developers: x
No. of Sysadmins: 0
It's a different job, often with a different career path, and a very differnt attitude.
Though some of us get to be be both Code Monkey and the Scary Monk.
On a more technical note, it would be nice to have a way to compact indexes online without having to resort to doing so on a slave, but Mongo is a minimum two server product to begin with, so it's not the end of the world. Overall it's a great datastore, and it's only going to get better in the next few years.
are you sure? i thought it runs well on one server and even in some sort of run-locally setup.
Interestingly, this is one of the reasons antirez gives as to why redis will not be using the built-in OS paging system, but instead will use one custom-written for redis' needs.
The problem isn't paging, per se -- the paging system is doing exactly what it should be doing, paging in blocks off of the disk and into memory as they become hot, which for this use case is always.
The problem is that you get fragmentation in your pages. If you allocate three records in a row that are 300 bytes, and then need to rewrite the first one to make it 400 bytes, or delete it altogether, you end up creating a hole there.
The typical strategy for dealing with those holes it to maintain a list of free blocks which can be used and hope that the distribution of incoming allocations neatly fits into unused chunks.
However, as they note, if you have a fully compacted 64 GB active data set and remove 5% of it you just end up with address space that looks like swiss cheese; it's not set up to elegantly shrink, but to recycle space as it grows.
There are a couple of things that I find a bit odd here though: they should have seen this coming before migrating data; it's pretty obvious from the architecture. Second is that they mention the solution being auto-compacting, which wouldn't have actually helped them.
Auto-compaction is in fact useful, but all that it does is, well, compact stuff. It means they would have hit the limits later, but once the threshold was crossed, they'd have the exact same problem. Auto-compaction is either an offline process that runs in the background or a side-effect of smarter allocation algorithms. Both of those things need time once you remove data from an instance to reclaim the holes in the address / memory / disk space ... which is exactly what they did manually.
The only really sane ways to handle something like this are notifications at appropriate levels, or block-aware data removal -- e.g. "give me stuff from the end of the file". I don't know if mongo uses continuation records and stuff like that enough to know how difficult that would be for them.
(Note: Directed Edge's graph database uses a similar IO scheme, so I'm doing some projecting of our architecture onto theirs, but I assume that the problems are very similar.)
You are correct, the actual problem is paging out LRU keys as opposed to memory holes. The issue is related but not the same.
Multiply this for all the keys you have in memory and try visualizing it in your mind: These are a lot of small objects. What happens is simple to explain, every single page of 4k will have a mix of many different values. For a page to be swapped on disk by the OS it requires that all contained objects should belong to rarely used keys. In practical terms the OS will not be able to swap a single page at all even if just 10% of the dataset is used.
However, assuming that all data must actually be in memory (as stated in the posted email), you don't actually solve the Mongo problem with more efficient organization of the data set, though compacting could be considered a sub-problem of the one that antirez describes.
But as I noted in my earlier comment, compacting wouldn't have actually solved their problems, it just would have delayed them. It's reasonable to ask if all of their data truly needs to be hot, but even there, you'd eventually hit diminishing returns as you approached the threshold where your active set couldn't fit in memory, and there smarter data organization wouldn't actually fix things once you started pulling chunks out for sharding; you'd still need to recompact.
How to compare this to compaction, in the general case, isn't immediately obvious to me.
Maybe they would be better identifying regular users, these would require on the fly compacting, be hosted on one or two machines, with a third smaller server for the non-frequent users.
It wasn't a random failure or a sudden spike that caused the crash -- it was completely predictable growth. Foursquare had already experienced the problem once, and they had solved it. All they needed to do was monitor their growth and iterate that solution.
Sure, Foursquare could have had a better sharding algorithm, but that would only have put off the crash a bit longer. This is a very basic failure -- not monitoring a system that you know is steadily growing.
Good job, guys. If only the likes of Apple followed this example.
Of the people I know who use four square the response was "Hm, foursquare isn't working" puts phone away.
Of course, not all Americans, but I was surprised when I first saw this.
When we're growing up our parents teach us to take responsibility for our actions, and if we screw up or wrong someone, admit it and make it right. Then we get out in the real world and it's the exact opposite, and the higher you go the worse it seems to get.
People in Canada and the US are more prone to admit mistakes than in more prestige / respect-oriented cultures. I.e. apparent lack of respect has its benefits.
Also, this validates Redis's VM position about 4KB pages being to large to properly manage data swapping in web application storage.
There's no silver bullet here. Hashing on insertion order would basically guarantee that writes would favor one node over another, which random hashes would force you to aggregate results from all available nodes for each query.
For example, last I heard google's search index was sharded by document rather than by term.
That sounds odd, since if it was sharded by term, then a given search would only need to go to a handful of servers (one for each term) and then the intermediate result combined. But with it sharded by document, every query has to go to all the nodes in each replica/cluster.
It ends up that's not as bad as it seems. Since everything is in ram, they can answer "no matches" on a given node extremely quickly (using bloom filters mostly in on processor cache or the like). They also only send back truly matching results, rather than intermediates that might be discarded later, saving cluster bandwidth. Lastly it means their search processing is independent of the cluster communication, giving them a lot of flexibility to tweak the code without structural changes to their network, etc.
Does that mean everyone doing search should shard the same way? Probably not. You have to design this stuff carefully and mind the details. Using any given data store is not a silver bullet.
If you shard a search index by term instead, you will end up with duplicate documents stored in each index that contain the same term. And for a large index, you need to scale up your hardware to handle the term or shard by document within that term anyway.
Exactly. Then a flash crowd occurs and your shard fails to service requests.
While buying that much ram sound costly, that's only looking at capacity. Assuming typical 1u servers and common pricing at the moment, you're paying roughly $1k in capital for each 10GB of ram capacity. However, each of these servers gives you a couple hundred thousand random reads per second. To duplicate that with spinning hard drives would take several hundred spindles at least. SSD's are better, but still, it ends up being a lot of devices to duplicate that io capacity.
This is why virtually everyone at stupendous scale (google, facebook, etc) ends up with very ram centric architectures.
There's a simple way to decide how much of what storage you need . Look at the distribution of access times. With current technology, in very rough terms, if an item is accessed more than once a day, it's more cost effective to store it on SSD. If it's accessed more than once in an hour, it's more cost effective to store it in ram.
So far I have never heard of any one running a commercial RDBMS reiterate the MySQL-mantra that you need the entire DB in RAM and I find it a very puzzling attitude to efficient database usage.
Most of the time, less than 10% of your DB represents the active working-set of your data and a good database-system should be able to analyse what is being used, gather statistics about the data, use those statistics to intelligently optimize your queries and minimize the need for disk IO.
For any decent RDBMS, having to match total memory with the database size, is wasting money on RAM for very little extra gain in performance and this doesn't really make much business sense.
A concrete example would be a server I manage. It has 16GBs of RAM and handles around 5000 GBs worth of DBs. Due to a good RDBMS and intelligent caching and use of statistics, that server has a cache-hit ratio of 98%. That means that only 2% of queries entering the system results in actual disk IO.
In order to get the last 2% of queries to get into RAM, I would have to increase the system memory by a factor of 30. At that point just getting another server and setting up replication would be much cheaper and also allow a theoretical doubling of troughput.
Really. Having the DB all in RAM is not something which makes marginal sense at all when you compare it to the costs it involves. No amount of internet-argument can convince me that this goal is anything besides a pure waste of money.
"It has 16GBs of RAM and handles around 5000 GBs worth of DBs. Due to a good RDBMS and intelligent caching and use of statistics, that server has a cache-hit ratio of 98%."
This may be your query distribution but it plainly is not everyone's.
"No amount of internet-argument can convince me that this goal is anything besides a pure waste of money."
If Jim Gray's published work doesn't convince you there's not really much left to discuss.
This is actually one of the big long term challenges we're going to have to deal with @ foursquare. Right now we calculate whether you should be awarded a badge when you check in by examining your entire checkin history (which means it needs to be in ram so we can load it fast). While this works now, as we continue to grow it will become more and more of a problem so we'll have to switch to another method of calculating how badges are awarded. Several different options here, each with pluses and minuses.
E.g. for "you've seen foo 4 times today" you'd keep a list of the foos visited. When a new foo comes in you first remove the foos older than 1 day from the list, and insert the new foo. If the list now contains 4 foos you award the badge.
You can use a bitvector to record the badges that have any active state at all, so for badges that have no state associated with them yet (e.g. no foos seen yet) you only pay 1 bit. Or if very few badges have active state on average you could use a list of badges that have state instead of a bitvector, so that you only pay for badges that actually have active state. So total storage is 1 bit per badge or less + a couple of bytes per badge with active state?
Also I recommend giving @naveen a separate shard. The guy just keeps roaming around in the city. Thank you.
This approach also increases the complexity of adding new badges, which is undesirable for product and business reasons.
It could certainly help in some case though, and it's something we're considering.
To the extent that you do this across the board, you'll have yet another tool to defend against being over capacity.
I suppose if your CPU load is not too high you could even run background processes to monitor "near-badge" status for all users with high enough frequency. This you could really get wild with: people with most number of recent check-ins should get scanned first and of course this list should be updated as a separate entity so it stays small and always in memory.
In this way, the amount of current day's events are pretty small and you don't need to keep all historic events in memory.
EDIT: I thought about this a bit more, and the checkins are probably time-sensitive, so another 32-bit timestamp (Foursquare Epoch) would need to be stored for each checkin.
For example, starting at ~9pm the data about check-ins into library may be safely unloaded to disk (the probability of needing this data is low and reading from disk for whose rare cases would do just fine) and be replaced in memory with data about check-ins to clubs, etc...
http://en.wikipedia.org/wiki/Memoization ; see the list of implementations at the bottom - there are ones for Lisp, Python, Perl, Java, etc.
If your disk is spinning at 6000 rpm, a given sector spins past 100 times per second, which makes lookup time anywhere from 0 to 1/100th of a second, or 1/200th of a second on average. Disks only let you look for a limited number of things at once, so you can do less than 1000 disk seeks per second. Period.
If data is living on disk and you have high query volume, it is really, really easy to blow past that limit. The solution is to shard data on multiple machines in RAM. This gives a fixed cost per unit of RAM. As long as you don't exceed available RAM, it works well. Luckily it isn't hard to monitor available RAM and respond in advance. (They didn't do that in this case.)
If you don't care about latency, or have a lower query volume, then you can live with data on disk, and frequently accessed data in RAM. A few well-designed caching layers in front can give you even more headroom. This is much cheaper. But has more complicated failure modes, and they can be tricky to monitor properly.
If you can fit your db into RAM that's the ideal, but of course you also have to have a reliable, persistent backup. The best current compromise is probably PCI based SSD drives.
John Ousterhout (of Tcl fame) and his group at Stanford are working on a project called RAMCloud that's exploring the feasibility of storing data in RAM at all times, using disk only as backup. See his projects page http://www.stanford.edu/~ouster/cgi-bin/projects.php for more info. IMO, given the continuous improvements in RAM prices and network latencies, this type of setup for permanent storage will be the norm in data centers in a few years.
These new projects really seem a lot like re-hashing problems that have already been solved over and over again - people just keep forgetting to do proper systems engineering in the first place. There is no magic bullet.
In short they use sharded mysql instances (sans joins) as a key-value store with memcache on top of that.
I don't understand why they are running the whole nosql db on those monster machines - it just defeats the purpose. And sharding architecture that they are mentioning is quite error prone. I would go with read/write seperate channels for things of their nature - it appears they don't have that either.
Monitoring your working set is tough.
Not we're not talking necessarily about a ramdisk here or specialized software (though we could be) - we're talking about proper system design with enough ram and tuning to get the response times you need.
Where did Foursquare find their engineers? I hope no one lost their job here but this is pretty elementary stuff.
Further, there are lots and lots of different things that we need to be monitoring at any different time to make sure that everything is going ok and we aren't about to run into a wall. Automated tools can help a lot with this, but these tools still need to be properly set up and maintained.
I'm not saying we didn't screw up. We had 17 hours of downtime over two days. We screwed up bad, and we feel horrible about it, and are doing a lot to make sure that we don't screw up the same way again.
But it's not because we're morons that never thought about the fact that we should be monitoring memory usage. We just got overwhelmed with the complexity of all that we're doing at once.
-harryh, foursquare eng lead
In additions to hiring scalability experts (many will claim to be experts but most aren't), talk to your investors to find outside technical advisors who have worked on and are still working on similar problems.
Find advisors from Google, Facebook, and other places who have dealt with these issues. In my experience, you have as much to learn from people who have made mistakes as from people who have succeeded.
Also, in my judgement, you guys are a little too risk-tolerant for your significance (first major Scala 2.8 upgrade, largest MongoDB deploy...).
It's not like your 66GB is hot all the time (or even big by any sort of measure), so that makes no sense to me, and that part of things needs to be further explained by 10gen and/or 4sq.
That said I'm looking at monit too. I hear it's quite nice and has less of a learning curve.
Get that tool in place NOW while you are small, so it's there when you get bigger... so you can hire people to watch it while you move on to other things.
You will always have a lack of time, and always be busy. Spend a weekend putting up Nagios, setting up alerts over SMS or XMPP or Twitter or whatever you want, and then move on. This is fundamental to systems management.
Dont' over think it - just get nagios/cacti (again, try groundwork) up and running and graph EVERYTHING you can... yuo'll be very, very glad you did.
We can help you all with these problems.
Here's how fast/easy it is:
1. Create an account (~30 sec)
2. Add your cloud credentials (~45 sec)
3. Install the monitoring agent (~90 sec/node)
4. Create a CPU/Memory/Disk monitor, query-targeted all your servers (~60 sec) (example: "provider:EC2")
5. You get an email whenever the monitors you created reach the thresholds you set
There are a lot of other cool features - check 'em out on the site: www.cloudkick.com
All our plans are free for 30 days.
Sharding needs to be managed in a way that evenly distributes the data. I would never do it by user ID unless a proper analysis of the data showed it to be a fairly even distribution.
That doesn't excuse anything, but these oversights can happen even when you've got primo talent on board.
Here is an idea: write an algorithm that keeps the most recently used 100 million check-in documents in memory. That'll save 38 gig of RAM.
Or how about this idea from the 1970s: write an algorithm that keeps the most recently used 4 gig of check-in documents in memory. That will save 62 gig of RAM, and the most recently active 14 million users (4 gig / 300 bytes) will still enjoy instantaneous response.
Can someone help me out with the math here?
Unless they're constantly doing data churning over the whole dataset, there is no need to keep everything in memory.
harryh's answer above about querying the user's entire history for each checkin answers this question though: http://news.ycombinator.com/item?id=1769909
OS caching and paging have work pretty well in keeping the frequently used data in memory and leave the rest on disk, and thus utilize the system resource more efficiently.
Maybe I only need 1 server to be profitable. Hmm...
- how difficult would be to bring up read-only replicas? (hopefully that should take much less than 11 hours + 6hours)
- why the 3rd shard could accommodate only 5% of the data?
- how can you plan capacity when using the "wrong" sharding? (basically leading to unpredictable distributions)
I have posted the rest of the questions here: http://nosql.mypopescu.com/post/1265191137/foursquare-mongod... as I hope to get some more answers.
Bringing up read only replicas would have been easy, but our appservers are not currently designed to read data from multiple replicas so it wouldn't have helped. We hope to make architectural changes to allow for this sort of thing in the future but aren't there yet.
> - why the 3rd shard could accommodate only 5% of the data?
The issue wasn't the amount of data the 3rd shard could accomodate, but the rate at which data could be transfered off the 1st (overloaded) shard onto the 3rd shard.
> how can you plan capacity when using the "wrong"
> sharding? (basically leading to unpredictable
We weren't really using the "wrong" sharding. And even the uneven distribution we saw (about 60%/40%) wasn't totally horrible. Not 100% understanding your question here.
The 60%/40% sharding distribution doesn't seem too bad in the big picture; are there any plans to be evening that out?
I know hindsight is 20/20, but I can't imagine the foresight was any worse than 20/30.
Then you can just dial things down a bit (making customers wait a bit) and let the system recover, and tune things back to the point of optimal behaviour, rather than letting things overload. (You set your concurrency limits at a point where you know the system is just beginning to slow down but is still performing satisfactorily)
Also, given the vast number of EC2 instances they could afford, it seems counter intuitive to be running only two mongrel shards. If you are that stingy about spreading data around, you might as well be using dedicated.
If your working set (the data that you are touching regularly) grows past available RAM, you very rapidly switch from a system where disk access plays no part in most queries, to a system where it is the limiting factor.
Because disks are so much slower, and having queries take only ten times as long can cause them to rapidly queue up, you end up with an unresponsive system. The appropriate response is almost always to turn off features to get the size of your working set back down, which buys you time to either alter the app to need less data, or get more RAM into the box.
putting hundreds of gigs of ram in a box isn't cheap.
are the foursquare folks considering rewriting with a traditional datastore like postgres and some memcached in front of it?
Jorge Ortiz, one of the developers at Foursquare said it best on twitter yesterday:
"Baffled by the Mongo haters. If foursquare had stayed on Postgres, we would've had to write our own distributed sharding/balancing/indexing."
"Two problems: 1) Our job is to build foursquare, not a database. 2) These things are hard. Odds are we would have had even more downtime."
Which sums it up incredibly well.
Your comment shows a scary lack of depth - this is the "dirty little secret" of ANY application that needs serious scalable data performance. You need to put your data in RAM where it is quickly accessible.
As soon as you go to disk especially on a low I/O cloud box you are going to be in 'extremely slow' territory. This is why, even before the current NoSQL movement, everyone was using Memcached all over the place. Not because it was a fad but because they needed speed.
SQL Databases have RAM caches as well which give you a pretty similar behavior. And big surprise - if you throw more memory at them - they run faster! Because they cache things in memory! In this case I believe that what Eliot was highlighting was that Foursquare's performance tolerance requirements were such that going to disk was not an optimal situation for them.
There are plenty of applications out there using MongoDB without keeping it all in RAM; I've deployed and maintain several. The more memory you have certainly gives you better performance but it tends to be about a most frequently used cache.
It never "Shit the bed" here to use your sophomoric vernacular. As the post states, it started having to go to the disk after it surpassed the memory threshold which slowed things down... nothing "crashed" as far as the description states. But a read/write queue backlog on any database is likely to exhibit the same behavior.
Your postgres + memcached solution fixes what exactly? They would still need the same slabs of RAM to solve the problem, lest they go to disk on postgres and slow to a crawl the same way they did with MongoDB.
In fact one of the huge choices one has to make when deploying say, Cassandra, is the partitioning algorithm. Sequential partitioning (saying say, users a-l go on Partition 1 and m-z go on #2) gives you certain capabilities for range queries but also risks you overloading if one particular user is significantly larger than others.
Cassandra recommends random partitioning as a way of better balancing your data across shards.
You're going to run into the same problem on just about any sharded setup (With any software) - how do you make sure that you are distributing new chunks/documents/rows in a way that doesn't overload any given server.
Whilst it would definitely be advantageous for all developers to understand the intricacies of various CPU and OS scheduling algorithms, it's not an issue that most developers have to deal with directly. The App Engine datastore, in particular, proves that it is possible to create NoSQL datastores which don't force developers to think about issues like load balancing.
BigTable is more than anything a filesystem - it is NOT a database. It lacks true user configurable indexes, custom sorting, querying etc. These features and the data storage requirements to support them are a very different prospect from what a filesystem (even a distributed one) needs.