so, in short, a company relying entirely on cloud computing machines for storing its data, which is presumably being billed according to the memory usage of those machines, ran out of memory on them, and suffered a large amount of downtime as a result. mongodb had little to do with the problem, other than maybe it took longer than expected to migrate data to a third server.
i'm baffled at how there could be no monitoring or reporting in place to catch that days or weeks ahead of time, let alone just 12 hours needed according to the mongodb developer, to fix the problem without downtime. it's such a fundamental thing to keep track of for a system designed entirely around storing a bunch of data in the memory of those 2(!) machines. i have more servers than foursquare and none of them do even a small fraction of the amount of processing that theirs do, and yet i have real-time bandwidth, memory, cpu, and other stats being collected, logged, and displayed, as well as nightly jobs that email me various pieces of information. nobody at foursquare ever even logged into those systems and periodically checked the memory usage manually?
worse still, during all of this, the initial outage reports were blaming mongodb or saying the problem was unknown. even at that point nobody at foursquare realized that the servers were just out of memory?
how did the developers come up with 66 gigabytes of ram to use for these instances in the first place? was there some kind of capacity planning to come up with that number or is it just a hard limit of EC2 and the foursquare developers maxed out the configuration?
Exactly. How the hell does a fast growing well funded 24 employee startup NOT have load monitoring on their database servers! Pay the 10c the hour for a micro EC2 instance and run Zabbix or one of the half dozen other awesome monitoring packages out there.
If it's anything like here, any one engineer's task list is essentially a weighted list of fires to put out and technical debt to pay down, in addition to the features to be delivered next day / week.
Guess more extensive monitoring just got bumped to the top :)
If it is anything like any start up I have ever heard of :P
My point is that it is stupidly important to have good monitoring in place and I am surprised that one of the best known startups had a gaping hole in theirs.
Which is exactly why the "don't optimize now" and/or "wait until you have users before you worry about capacity planning" crowd is simply clueless. You need to think about your data and how it will scale from day one!
Don't know about them - but there seems to be a somewhat scary, hopefully brief, undercurrent out there where business (startups, corporations, whatever) think they can work solely with a few developers and some outsourced services - nobody is really taking responsibility for overall architecture and growth.
(again, NOT saying this applies to foursquare, I have no idea)
Like not having a massive downtime that gives their users a reason to try out some massive social network's offering that has already started eating their lunch? Oh wait no they didn't prioritize that.
My point is, and perhaps the down vote indicates I didn't make it well, that monitoring and alerts are extremely important for any startup. They admit as much in the article, had they known sooner there wouldn't have been any down time. It isn't an excuse to say it got lost in the crush (not that they are trying to use it as an excuse, they're being very upfront and admitting fault).
We have neither the load or the team they do and we have comprehensive monitoring and alerts across our infrastructure as part of the SOE that each server gets. Perhaps we prioritized monitoring higher than them as we are a small team?
And I bet they have prioritized monitoring now. Now where did I put the key to that barn door?
Humans don't judge probability and risk very well. It's hard to forsee the future and problems. Hindsight is 20/20. My bag of cliches is exhausted.
Maybe there were other things the engineers were working on that brought definite (as in 100%) benefit. Maybe there's a good product/consulting opportunity here. Everybody needs monitoring, but it takes time and resources.
I agree, that they are focusing on it now, because it happened.
The "give a crap" factor at 10gen is nothing short of amazing. The company I work at has been using Mongo for a while, and anytime we've had an issue, Eliot has been right there to help us. Mongo is a great product, but like anything, it has its limits. Learning those limits and taking the time to plan your infrastructure is a mandatory part of adopting any technology, and Mongo is no exception.
On a more technical note, it would be nice to have a way to compact indexes online without having to resort to doing so on a slave, but Mongo is a minimum two server product to begin with, so it's not the end of the world. Overall it's a great datastore, and it's only going to get better in the next few years.
Yes, if you care about your data, it's a two server product. Single server redundancy isn't being added until 1.8. Until then, if the server terminates abnormally, there's potential for data corruption. The solution right now is to run a slave.
This solution strikes me as best-case-scenario error. What happens if the connection between my slave and master goes down then my master is unplugged? What happens if both slave and master are unplugged, can both databases be corrupted? What am I missing?
If the connection goes away and your master is unplugged, you will need to run a repair on the master and then bring the slave back into sync. If both are unplugged, you will have to run the repair on both. This obviously isn't ideal, and that's probably why single server durability is the biggest priority for the next release.
In essence, although we had moved 5% of the data from shard0 to the new third shard, the data files, in their fragmented state, still needed the same amount of RAM. This can be explained by the fact that Foursquare check-in documents are small (around 300 bytes each), so many of them can fit on a 4KB page. Removing 5% of these just made
each page a little more sparse, rather than removing pages
altogether.
Interestingly, this is one of the reasons antirez gives as to why redis will not be using the built-in OS paging system, but instead will use one custom-written for redis' needs.
I don't think those are the same problem -- could you provide a link?
The problem isn't paging, per se -- the paging system is doing exactly what it should be doing, paging in blocks off of the disk and into memory as they become hot, which for this use case is always.
The problem is that you get fragmentation in your pages. If you allocate three records in a row that are 300 bytes, and then need to rewrite the first one to make it 400 bytes, or delete it altogether, you end up creating a hole there.
The typical strategy for dealing with those holes it to maintain a list of free blocks which can be used and hope that the distribution of incoming allocations neatly fits into unused chunks.
However, as they note, if you have a fully compacted 64 GB active data set and remove 5% of it you just end up with address space that looks like swiss cheese; it's not set up to elegantly shrink, but to recycle space as it grows.
There are a couple of things that I find a bit odd here though: they should have seen this coming before migrating data; it's pretty obvious from the architecture. Second is that they mention the solution being auto-compacting, which wouldn't have actually helped them.
Auto-compaction is in fact useful, but all that it does is, well, compact stuff. It means they would have hit the limits later, but once the threshold was crossed, they'd have the exact same problem. Auto-compaction is either an offline process that runs in the background or a side-effect of smarter allocation algorithms. Both of those things need time once you remove data from an instance to reclaim the holes in the address / memory / disk space ... which is exactly what they did manually.
The only really sane ways to handle something like this are notifications at appropriate levels, or block-aware data removal -- e.g. "give me stuff from the end of the file". I don't know if mongo uses continuation records and stuff like that enough to know how difficult that would be for them.
(Note: Directed Edge's graph database uses a similar IO scheme, so I'm doing some projecting of our architecture onto theirs, but I assume that the problems are very similar.)
Multiply this for all the keys you have in memory and try visualizing it in your mind: These are a lot of small objects. What happens is simple to explain, every single page of 4k will have a mix of many different values. For a page to be swapped on disk by the OS it requires that all contained objects should belong to rarely used keys. In practical terms the OS will not be able to swap a single page at all even if just 10% of the dataset is used.
What antirez is getting at there is actually a much harder (and more interesting) problem that could be generalized as something like "efficient data locality for mixed latency access".
However, assuming that all data must actually be in memory (as stated in the posted email), you don't actually solve the Mongo problem with more efficient organization of the data set, though compacting could be considered a sub-problem of the one that antirez describes.
But as I noted in my earlier comment, compacting wouldn't have actually solved their problems, it just would have delayed them. It's reasonable to ask if all of their data truly needs to be hot, but even there, you'd eventually hit diminishing returns as you approached the threshold where your active set couldn't fit in memory, and there smarter data organization wouldn't actually fix things once you started pulling chunks out for sharding; you'd still need to recompact.
I doubt every last bit of their data needs to be hot (i.e. in memory) at any given time, but without specialized paging along the lines of what antirez has discussed for redis, enough of their data probably needs to be hot that, from a paging perspective, all of the vm pages of their data need to be hot.
It seems that if you know your page size and you know your record size, you can compute the optimal size into which you should break your records into chunks and link the chunks with pointers. This causes more memory accesses, but surely that's worth keeping everything in memory? More chunks per record means more pointers and more accesses, but it should still be way faster in many cases.
How to compare this to compaction, in the general case, isn't immediately obvious to me.
I couldn't help but think of that exact issue. I suppose once compacting the data online is built in, this particular issue won't come up again. At the same time, when a machine is overloaded, often times you have even bigger problems. For example, if you are out of memory, you may not be able to create another SSH process to get at the box.
We need an "Ask HN" for cool tools. Maybe there's already a SO question about it. I hadn't heard of vmtouch and probably tons of other things people here use.
Well, the other big advantage being that the OS has no idea about the particular kinds of data you want to store, whereas Redis has quite a bit more data to work with. But yes, vmtouch looks useful for these types of situations (thanks).
Surely if they moved more than 5% across, this would have freed up more memory, despite the fragmentation.
Maybe they would be better identifying regular users, these would require on the fly compacting, be hosted on one or two machines, with a third smaller server for the non-frequent users.
Freeing memory would require that there be an entirely empty page. Each page holds 13-14 objects (4k/300). The chances that 13 consecutive objects were in the 5% (assuming independent, random distribution) is 1 in 20^13, putting the expected number of empty pages well below 1. You have to migrate much more data before you can hope to see page-size holes.
I like the way MongoDB describes the flaws in their own system, but places the blame (four) squarely on the customer, where it belongs.
It wasn't a random failure or a sudden spike that caused the crash -- it was completely predictable growth. Foursquare had already experienced the problem once, and they had solved it. All they needed to do was monitor their growth and iterate that solution.
Sure, Foursquare could have had a better sharding algorithm, but that would only have put off the crash a bit longer. This is a very basic failure -- not monitoring a system that you know is steadily growing.
The main thing I took from this incident is how much being open and honest about issues improves a company's image. Foursquare and 10gen could have very easily played the blame game, or kept their cards close to their chests, and both would have come off poorly. Instead, they described the problem, owned up to their role in it, and laid out a framework for how to avoid the problem in the future. After reading about the issue, I come away with respect for both groups. Sure, mistakes were made, but they treated their users like adults and took responsibility for what they did wrong.
Good job, guys. If only the likes of Apple followed this example.
What are you talking about? How can you see how the users see foursquare after this? Sure, in the hacker community this detail is appreciated, but how do you know how many foursquare users have left because of these outages? How many people are upset with the company, and are now taking facebook locations or the other companies more seriously?
I don't know how many left because of the outages, but I think it's safe to say that very few, if any, left because of the honest explanation of their cause.
Problem is lawsuits. Admitting guilt can and is used against you in court. Hence, US businesses don't apologize for anything. This is a refreshing exception.
I see it as a very American characteristic, from the people I have worked with before. Many Americans will never admit fault, never admit they do not know something, and always looking for someone else to blame. I think this comes from the culture of large companies. I have not seen this at smaller US companies, and other companies I have worked at.
Of course, not all Americans, but I was surprised when I first saw this.
Yeah, I think it's common, if unstated, knowledge in the US that's the reason for it. Fear of getting sued by some opportunistic lawyer looking for a quick buck.
When we're growing up our parents teach us to take responsibility for our actions, and if we screw up or wrong someone, admit it and make it right. Then we get out in the real world and it's the exact opposite, and the higher you go the worse it seems to get.
This is a large company / management culture thing, not a US thing. Also, it's a techie versus non-techie thing.
People in Canada and the US are more prone to admit mistakes than in more prestige / respect-oriented cultures. I.e. apparent lack of respect has its benefits.
Building sharded systems isn't as simple as throwing consistent hashing into the code and calling it a day. You have to think carefully about what happens when nodes exceed capacity. There is good work in academia on distribution algorithms that gracefully handle reaching capacity (along with data center structure such as rack awareness) [1]. Alternately, if your algorithm doesn't handle a shard reaching capacity you need to have the monitoring and processes in place to ensure you always add more capacity and rebalance before running out.
Also, this validates Redis's VM position about 4KB pages being to large to properly manage data swapping in web application storage.
I'm wondering if the decision to shard based on users was taken with data (ie, at that point did each user have a roughly similar number of check-ins); if not, hashing by that seems kind of fail for that kind of app.
From the email thread, it sounds like the decision to shard on UID was made mostly to increase locality of data, so that you didn't have to query more than one node to get a single user's data.
There's no silver bullet here. Hashing on insertion order would basically guarantee that writes would favor one node over another, which random hashes would force you to aggregate results from all available nodes for each query.
This stuff can be very counter intuitive. Locality may not be what you want.
For example, last I heard google's search index was sharded by document rather than by term.
That sounds odd, since if it was sharded by term, then a given search would only need to go to a handful of servers (one for each term) and then the intermediate result combined. But with it sharded by document, every query has to go to all the nodes in each replica/cluster.
It ends up that's not as bad as it seems. Since everything is in ram, they can answer "no matches" on a given node extremely quickly (using bloom filters mostly in on processor cache or the like). They also only send back truly matching results, rather than intermediates that might be discarded later, saving cluster bandwidth. Lastly it means their search processing is independent of the cluster communication, giving them a lot of flexibility to tweak the code without structural changes to their network, etc.
Does that mean everyone doing search should shard the same way? Probably not. You have to design this stuff carefully and mind the details. Using any given data store is not a silver bullet.
Sharding a search system by documents gives you several advantages. You can scale horizontally by adding additional indexes for new documents. You can tweak a group of documents more easily. E.g. rank wikipedia higher. Assign better hardware (if necessary), higher priority, etc. Performance is also more uniform. Easier to index content at different frequencies. It's also easier for re-indexing content after tweaking algorithms.
If you shard a search index by term instead, you will end up with duplicate documents stored in each index that contain the same term. And for a large index, you need to scale up your hardware to handle the term or shard by document within that term anyway.
That sounds odd, since if it was sharded by term, then a given search would only need to go to a handful of servers (one for each term) and then the intermediate result combined. But with it sharded by document, every query has to go to all the nodes in each replica/cluster.
Exactly. Then a flash crowd occurs and your shard fails to service requests.
Is it acceptable/preferred to store your entire db in RAM? I have little idea about large systems but feel like this may be hard to scale if your db grows to hundreds of TB. I'm intrigued to learn more! Anyone know how fb organizes its massive db storage?
Facebook runs primarily out of ram via memcached. The last numbers I'm aware of were that they had about 200TB in memcache capacity [1]. They use a variety of data stores, but primarily sharded mysql. I don't have recent numbers there, but they were above 1000 master-master pairs as of 2008.
While buying that much ram sound costly, that's only looking at capacity. Assuming typical 1u servers and common pricing at the moment, you're paying roughly $1k in capital for each 10GB of ram capacity. However, each of these servers gives you a couple hundred thousand random reads per second. To duplicate that with spinning hard drives would take several hundred spindles at least. SSD's are better, but still, it ends up being a lot of devices to duplicate that io capacity.
This is why virtually everyone at stupendous scale (google, facebook, etc) ends up with very ram centric architectures.
There's a simple way to decide how much of what storage you need [2]. Look at the distribution of access times. With current technology, in very rough terms, if an item is accessed more than once a day, it's more cost effective to store it on SSD. If it's accessed more than once in an hour, it's more cost effective to store it in ram.
Anything that is both latency sensitive (web apps are) and requires high throughput is going to be RAM-centric. Even services operating at much smaller scales with "old school" single DB boxes need to keep the vast majority of active data in RAM and use things like battery-backed write caches to get acceptable write throughput. Even with the sophisticated tricks mature RDBMS software uses to squeeze every ounce of performance out of spindles, most "old school" DBAs will tell you that to get high throughput for a database you basically need to have RAM = database size.
I have had very acceptable performance for latency critical applications where the db have exceeded the available memory by a factor between 10 and 50.
So far I have never heard of any one running a commercial RDBMS reiterate the MySQL-mantra that you need the entire DB in RAM and I find it a very puzzling attitude to efficient database usage.
While you may just be trying to talk-down on the MySQL dudes, I don't think that "RAM is faster than disk, so you're more performant if you can fit your data in RAM" is really all that puzzling.
I'm not debating that RAM is faster than disk. I'm debating that you need to have the entire DB in RAM to get good performance.
Most of the time, less than 10% of your DB represents the active working-set of your data and a good database-system should be able to analyse what is being used, gather statistics about the data, use those statistics to intelligently optimize your queries and minimize the need for disk IO.
For any decent RDBMS, having to match total memory with the database size, is wasting money on RAM for very little extra gain in performance and this doesn't really make much business sense.
A concrete example would be a server I manage. It has 16GBs of RAM and handles around 5000 GBs worth of DBs. Due to a good RDBMS and intelligent caching and use of statistics, that server has a cache-hit ratio of 98%. That means that only 2% of queries entering the system results in actual disk IO.
In order to get the last 2% of queries to get into RAM, I would have to increase the system memory by a factor of 30. At that point just getting another server and setting up replication would be much cheaper and also allow a theoretical doubling of troughput.
Really. Having the DB all in RAM is not something which makes marginal sense at all when you compare it to the costs it involves. No amount of internet-argument can convince me that this goal is anything besides a pure waste of money.
There's a reason I mentioned object inter-access times. To properly size storage you need to know your query distribution.
"It has 16GBs of RAM and handles around 5000 GBs worth of DBs. Due to a good RDBMS and intelligent caching and use of statistics, that server has a cache-hit ratio of 98%."
This may be your query distribution but it plainly is not everyone's.
"No amount of internet-argument can convince me that this goal is anything besides a pure waste of money."
If Jim Gray's published work doesn't convince you there's not really much left to discuss.
> Is it acceptable/preferred to store your entire db in RAM?
This is actually one of the big long term challenges we're going to have to deal with @ foursquare. Right now we calculate whether you should be awarded a badge when you check in by examining your entire checkin history (which means it needs to be in ram so we can load it fast). While this works now, as we continue to grow it will become more and more of a problem so we'll have to switch to another method of calculating how badges are awarded. Several different options here, each with pluses and minuses.
Would it be too difficult to keep a state for every badge that you update when a new checkin happens? You'd need some specialized logic for every badge, but not much more complex than you already have. Depending on how much space you save it could be worth it.
E.g. for "you've seen foo 4 times today" you'd keep a list of the foos visited. When a new foo comes in you first remove the foos older than 1 day from the list, and insert the new foo. If the list now contains 4 foos you award the badge.
You can use a bitvector to record the badges that have any active state at all, so for badges that have no state associated with them yet (e.g. no foos seen yet) you only pay 1 bit. Or if very few badges have active state on average you could use a list of badges that have state instead of a bitvector, so that you only pay for badges that actually have active state. So total storage is 1 bit per badge or less + a couple of bytes per badge with active state?
Why not keep the algorithm and process in batches? You don't need to have EVERY users checkin history in RAM at any moment. Throw the checkin on a queue, have a few processes that query a db for the checkin history and you're fine. Heck, if you delay the awards it's also a good excuse to throw the user an alert to come back to the site/app.
That is definitely one of the options we are considering. It (obviously) involves a change to the product, which we have to think about carefully, but it certainly could help from a technical standpoint.
It also depends on your algorithms. Some algorithms are amenable to "running tallies," so a third possible approach would be to store and update various values based only on the aggregate past data plus the incremental data, instead of looking back through the entire history and recomputing when a new piece of data comes in. This of course depends on whether or not this is even theoretically possible with what you're computing.
Can you design the system such that you can turn off ("darkmode") features in an emergency? For example, in an emergency you can turn off badge-awarding as it is known to be RAM intensive (and degrade by seeking to disk).
To the extent that you do this across the board, you'll have yet another tool to defend against being over capacity.
We have a framework for doing this sort of thing but it's only implemented around certain features so it doesn't provide us a ton of benefit at the moment. Obviously doing work to give us more control like this could be a big win, so we'll probably be doing some of that.
There is no better time than during an outage to add these features. In my experience, band-aids are better way to resolve an outage than fixing root-causes. Make the problem go away by commenting out code, returning empty result-sets, (or whatever) - and re-balance your partitioned database at some later date.
you don't need to keep their entire history in ram, you only need to keep their check-in history of places that they are on the verge of becoming mayors to in ram, (you are one day away from becoming mayor of xyz) and that you can calculate offline, and all you need to do is see if onedayaway:xyz:user_id => true on checkins, which you can keep in memcache
You could set up a user-place relationship and simply increment the number of check-ins for that place, then periodically correct the number, subtracting old ones which no longer count. Or look at the entire history only if there is a chance of getting a badge based on this number and otherwise just increment it. For badges for the total number of check-ins this is even easier, since no recalculation is necessary.
I suppose if your CPU load is not too high you could even run background processes to monitor "near-badge" status for all users with high enough frequency. This you could really get wild with: people with most number of recent check-ins should get scanned first and of course this list should be updated as a separate entity so it stays small and always in memory.
This is similar to one of the problem I used to deal with in tracking huge volume of billing events. Since the data (checkins) are immutable, it's fairly easy to build summary running totals daily. The total count then becomes the running-total plus the current day's count. The summary job can be run offline in the background. Just make sure to mark the current day's events as processed and update the running total at the same transaction.
In this way, the amount of current day's events are pretty small and you don't need to keep all historic events in memory.
Have you thought about how much extra overhead there is in storing the checkins using MongoDB's BSON structures? Considering how often you do this, you might consider packing a specialized version of a user's history (at least for this purpose -- sort of a denormalization) into a array of 64-bit placeIDs? This means a checkin would only occupy 8 bytes of space and there would be no overhead. 1 billion checkins = 8GB of data. You could even keep these in memcached and use a sort of write-through strategy.
EDIT: I thought about this a bit more, and the checkins are probably time-sensitive, so another 32-bit timestamp (Foursquare Epoch) would need to be stored for each checkin.
Unfortunately we have some badges that require more than just the venueid + timestamp, but some sort of "compressed" checkin format is something we'll probably end up looking at.
Bummer. It's really amazing how 1 seemingly simple (but quite important and engagement driving) feature like instant gratification badges can drive your architecture decisions and keep you guys up at night.
you can analyze the statistics of when/where specific people are most probable to check-in and pre-load (or even pre-calculate) the data in advance.
For example, starting at ~9pm the data about check-ins into library may be safely unloaded to disk (the probability of needing this data is low and reading from disk for whose rare cases would do just fine) and be replaced in memory with data about check-ins to clubs, etc...
I think what you are looking for is called "memoization" and is common in many functional programming languages such as Lisp; however many other languages have this readily available.
Depends on what you are doing. If you have high volume and care about latency, you pretty much have to put everything in RAM.
If your disk is spinning at 6000 rpm, a given sector spins past 100 times per second, which makes lookup time anywhere from 0 to 1/100th of a second, or 1/200th of a second on average. Disks only let you look for a limited number of things at once, so you can do less than 1000 disk seeks per second. Period.
If data is living on disk and you have high query volume, it is really, really easy to blow past that limit. The solution is to shard data on multiple machines in RAM. This gives a fixed cost per unit of RAM. As long as you don't exceed available RAM, it works well. Luckily it isn't hard to monitor available RAM and respond in advance. (They didn't do that in this case.)
If you don't care about latency, or have a lower query volume, then you can live with data on disk, and frequently accessed data in RAM. A few well-designed caching layers in front can give you even more headroom. This is much cheaper. But has more complicated failure modes, and they can be tricky to monitor properly.
> Is it acceptable/preferred to store your entire db in RAM?
If you can fit your db into RAM that's the ideal, but of course you also have to have a reliable, persistent backup. The best current compromise is probably PCI based SSD drives.
John Ousterhout (of Tcl fame) and his group at Stanford are working on a project called RAMCloud that's exploring the feasibility of storing data in RAM at all times, using disk only as backup. See his projects page http://www.stanford.edu/~ouster/cgi-bin/projects.php for more info. IMO, given the continuous improvements in RAM prices and network latencies, this type of setup for permanent storage will be the norm in data centers in a few years.
It doesn't mean the DB files have to be stored in a ramdisk or something like that - you just need enough ram so your working set can fit in ram.
These new projects really seem a lot like re-hashing problems that have already been solved over and over again - people just keep forgetting to do proper systems engineering in the first place. There is no magic bullet.
In general you want to have memcached implemented for stuff like this. Have most used stuff closer/faster with memcached.
I don't understand why they are running the whole nosql db on those monster machines - it just defeats the purpose. And sharding architecture that they are mentioning is quite error prone. I would go with read/write seperate channels for things of their nature - it appears they don't have that either.
It's okay to have hundreds of terrabytes of data but you need to keep your "working set" hot. Even if you have 10TB of data, 1TB of it might actually be hot.
It is easy to turn swap off. There are many reasons why it can be good to turn swap off in servers. (But only if you've got sufficient monitoring to hear about it before you run out of RAM.)
It can be entirely acceptable and preferred. Knowing your working set size and how your memory is being utilized is a critical part of any systems planning... and memory is one heck of a lot faster than disk, hands down, no debate there.
Not we're not talking necessarily about a ramdisk here or specialized software (though we could be) - we're talking about proper system design with enough ram and tuning to get the response times you need.
For example, if we had notifications in place to alert us 12 hours earlier that we needed more capacity, we could have added a third shard, migrated data, and then compacted the slaves.
Where did Foursquare find their engineers? I hope no one lost their job here but this is pretty elementary stuff.
It's true that this is elementary in and of itself, but looking at things with a bit of a wider lens shows the complexity. We're a small engineering team (10 people) working on a product that is growing extremely fast both in terms of usage and feature set. Meanwhile we're also pretty much constantly re-architecting things to keep up with growth and also doing the immense work of growing the company up from 3 people to 33 and beyond (this has turned out to be WAY HARDER than I would have guessed going in).
Further, there are lots and lots of different things that we need to be monitoring at any different time to make sure that everything is going ok and we aren't about to run into a wall. Automated tools can help a lot with this, but these tools still need to be properly set up and maintained.
I'm not saying we didn't screw up. We had 17 hours of downtime over two days. We screwed up bad, and we feel horrible about it, and are doing a lot to make sure that we don't screw up the same way again.
But it's not because we're morons that never thought about the fact that we should be monitoring memory usage. We just got overwhelmed with the complexity of all that we're doing at once.
In additions to hiring scalability experts (many will claim to be experts but most aren't), talk to your investors to find outside technical advisors who have worked on and are still working on similar problems.
Find advisors from Google, Facebook, and other places who have dealt with these issues. In my experience, you have as much to learn from people who have made mistakes as from people who have succeeded.
Also, in my judgement, you guys are a little too risk-tolerant for your significance (first major Scala 2.8 upgrade, largest MongoDB deploy...).
I didn't realize you only had 10 guys behind the scenes, that's pretty impressive in and of itself. In all actuality this is a good problem to have and one I wouldn't mind having. It means you're growing and fast.
The fail here isn't just that you should have set up a shard 12 hours before or whatever. The other fail is that it seems (from the groups post) that you were relying on your data set only fitting in RAM but the underlying I/O system wasn't quite up to scratch in terms of being able to page in/out that additional 1GB of data.
It's not like your 66GB is hot all the time (or even big by any sort of measure), so that makes no sense to me, and that part of things needs to be further explained by 10gen and/or 4sq.
I agree completely about the complexity of monitoring solutions for small teams with fluctuating applications. However, something as simple as htop running on an extra monitor would have alerted you of this issue long before downtime resulted.
Awww come on, it's easy to say from an outside perspective. Regardless, the problem was handled well in the end, and we all get the benefit of understanding these limitations better. I think us tech people got the good side (information) out of this ordeal :)
I work on a team of 2 where I'm responsible for a handful of servers. I chimed in because I'm in a similar position, I've been looking for a monitoring solution for a while now. Things like nagios and zenoss are over kill, but lack of time has prevented me from finding an ideal solution. That said, I keep htop open and running at all times, and its saved my ass on more than one occasion. I say htop because of the color coding it provides, if things start going red it attracts my attention.
Nagios is pretty nice. It's dead easy to write custom monitors and clients are everywhere, there's even a Firefox extension. It requires a bit of learning to get going with but it's not so bad and the pay-off is big.
That said I'm looking at monit too. I hear it's quite nice and has less of a learning curve.
Stop thinking about it and set it up. There isn't anything to think about. Nagios (I like using it through groundwork) are certainly not overkill... it's thinking like that that is what gets you in trouble in the first place.
Get that tool in place NOW while you are small, so it's there when you get bigger... so you can hire people to watch it while you move on to other things.
You will always have a lack of time, and always be busy. Spend a weekend putting up Nagios, setting up alerts over SMS or XMPP or Twitter or whatever you want, and then move on. This is fundamental to systems management.
Dont' over think it - just get nagios/cacti (again, try groundwork) up and running and graph EVERYTHING you can... yuo'll be very, very glad you did.
A bit harsh but a good point. There are a few red flags here. Resource monitoring on the servers, like you mentioned, seems pretty obvious. Especially considering that the system was almost guaranteed to crash when it ran out of RAM.
Sharding needs to be managed in a way that evenly distributes the data. I would never do it by user ID unless a proper analysis of the data showed it to be a fairly even distribution.
It's probably a technical debt thing. FourSquare is immensely popular, and the team probably meant to have those checks and balances in place but got 'too busy' to implement them.
That doesn't excuse anything, but these oversights can happen even when you've got primo talent on board.
When you're a small company growing fast you have an endless list of things you really should have done a long time ago. That's no excuse for not preventing a predictable and painful failure, but do try to remember it's a lot easier to criticize as an observer with no time commitments.
Perhaps it's just my own background that pushes me this way - but as soon as my deployment was growing ot eat up anything over about 60% of my available memory, I'd be looking at deploying more. Your high water mark for spinning up more nodes and/or upgrading shouldn't be your maximum physical ram capacity - that's just nuts.
Why on earth would you want to keep 236 million check-in documents (66 gig / 300 bytes) in memory ?
Here is an idea: write an algorithm that keeps the most recently used 100 million check-in documents in memory. That'll save 38 gig of RAM.
Or how about this idea from the 1970s: write an algorithm that keeps the most recently used 4 gig of check-in documents in memory. That will save 62 gig of RAM, and the most recently active 14 million users (4 gig / 300 bytes) will still enjoy instantaneous response.
I am talking about the database looking in memory for what it needs to answer a request; if it doesn't find what it needs, it reads from disk. Since it has been allowed limited memory, when it put something new in memory, it has to kick something old out. It uses a simple algorithm to identify what to keep and what to kick out. One of the benefits of this (obvious) design is that when too many users are active at the same time to keep all of them in memory, a few of them experience a small lag (as opposed to all of them experiencing a complete crash). If you have a few terabytes of disk then you have to really be asleep at the wheel (like for a decade) to totally crash.
I was curious about a similar question to yours: if they had 66GB of RAM, why did going just slightly over the threshold cause such drastic paging for them - surely queries aren't touching all parts of their dataset equally?
IF your system is getting pounded with queries, and you suddenly have to page to serve those queries, the ability of your system to respond to queries and other operations goes WAY down, and the queries are just going to pile up, causing further swapping and load - it's a vicious cycle.
Paging doesn't happen in 300 byte chunks, though. It's usually 4 KB, so if a record or one of twelve neighbors has been touched, it stays in. The granularity isn't right.
The problem isn't the page granularity, which is just a convenient size for the general case and works pretty well in most cases. The problem is to not keep excessive data in memory; in this case, they are keeping ALL data in memory due to some architectural decisions.
OS caching and paging have work pretty well in keeping the frequently used data in memory and leave the rest on disk, and thus utilize the system resource more efficiently.
It sounds like even if the check-ins had grown evenly across the two shards, that would have only saved them for about another month. Without close monitoring of the growth in memory usage, foursquare was still due for an outage.
One thing that I admire about Foursquare is that they don't only use the technology (MongoDB, Lift Framework, etc..) but also invest in it. If it wasn't for them I think that Lift wouldn't be such an advanced framework as it is now.
I find it interesting that with all this talk of scaling, and the specialized tools that go along with it, 4sq's service was essentially running on 2 servers.
Maybe I only need 1 server to be profitable. Hmm...
While the details are very interesting, there are still many questions to be answered (on both sides):
- how difficult would be to bring up read-only replicas? (hopefully that should take much less than 11 hours + 6hours)
- why the 3rd shard could accommodate only 5% of the data?
- how can you plan capacity when using the "wrong" sharding? (basically leading to unpredictable distributions)
> how difficult would be to bring up read-only replicas?
> (hopefully that should take much less than 11 hours
> 6hours)
Bringing up read only replicas would have been easy, but our appservers are not currently designed to read data from multiple replicas so it wouldn't have helped. We hope to make architectural changes to allow for this sort of thing in the future but aren't there yet.
> - why the 3rd shard could accommodate only 5% of the data?
The issue wasn't the amount of data the 3rd shard could accomodate, but the rate at which data could be transfered off the 1st (overloaded) shard onto the 3rd shard.
> how can you plan capacity when using the "wrong"
> sharding? (basically leading to unpredictable
> distributions)
We weren't really using the "wrong" sharding. And even the uneven distribution we saw (about 60%/40%) wasn't totally horrible. Not 100% understanding your question here.
What exactly is the rate like for transferring data across MongoDB instances on EBS? Was the overhead of going across EBS volumes the major factor in the length of the downtime?
The 60%/40% sharding distribution doesn't seem too bad in the big picture; are there any plans to be evening that out?
I didn't mean "wrong" in the full sense of the word. What I actually mean is that sharding based on user id will most probably not give/guarantee predictable distributions of the writes generated by users (basically there can be no guarantee that chunk#1 of users will always generate the aprox. same amount of data as chunk#2 of users).
Looks like there needs to be some data engineering to be done to back memory up with disk storage. I think you guys, by now, have enough information about users and checkins to create algorithms that can determine "hot" checkins vs. "cold" checkins as and when they come in. Because not every user is the same, a more user-centric view can definitely help you come up with a more scalable (or graciously failing) system. For example, I don't checkin as much as some of my friends who checkin 10-15 times a day. So, there is no point in giving my checkins as much priority as those busy folks.
Can we clear up one issue? Sharding on userid -- bad or fine? I say fine. If your algorithm is "if userid between A and M then server0, else (N-Z) server1", then you could get unbalanced. But if your algorithm is "if random 50/50 hash of userid = 0 then server0, else (random 50/50 hash = 1) server1", then you are going to stay almost exactly balanced (law of large numbers). You would never, ever, get anywhere near a 60/40 imbalance. So the choice to use userid, if implemented correctly, is completely fine.
How much can using SSD's for paging help with this? Or to just serve the reads, since they're excellent at random access - surely typical of Foursquare's usage?
They are dealing with web scale sharded NoSQL realtime geo scala. Old rules don't apply when 80% of the words describing your company didn't exist two years ago.
Yea but like you said, since half of them are just for redundancy, they were in effect only running 2 database servers. I must say I find it stunning that a service with that much traffic, especially one with an infrastructure that requires the entire database to be in-memory, would be operating with just 2 database servers running on EC2.
I know hindsight is 20/20, but I can't imagine the foresight was any worse than 20/30.
This may be a little naive seeing as I haven't used MongoDB or know anything about foursquare architecture, but wouldn't a better result have been just to drop requests that were hitting the disk? Sure some people would have been getting a sub optimal experience but arguably it's a lot better than not having your service up at all.
From my understanding, that 'routing' decision would have to take place inside of Mongo -- null routing DDoS packets on a network is one thing, dropping certain database requests on the fly inside a database server sounds a littttle bit tricker.
Tunable concurrency control for the number of incoming requests would have been good - before they even hit mongo.
Then you can just dial things down a bit (making customers wait a bit) and let the system recover, and tune things back to the point of optimal behaviour, rather than letting things overload. (You set your concurrency limits at a point where you know the system is just beginning to slow down but is still performing satisfactorily)
While I understand this is foremost a monitoring and architecture (mongo and 4sq side...) issue, but I'm curious how dangerous it is to run such IO dependent systems on EC2. Would the prolonged downtime (due to shard migration, etc) been as severe if they were running on hardware? What if these two servers were on SSD RAID 0?
The duration of the downtime seems to have been dependent on not just disk throughput, but also network and CPU throughput. Eliot did mention that a large portion of the downtime was partly caused by the slowness of EBS. Making a rough estimate, I think the downtime would probably be about 2/3 of how long it was if it had been on an SSD RAID. However, then you run into the issue of having to maintain your own servers; which, judging by the fact that they are using EC2 and EBS so heavily, Foursquare does not have any desire whatsoever to manage their own infrastructure.
It's an interesting angle if that's the reason they are using EC2. Dedicated hosting might be more expensive, but hardware managed is pretty efficient at this point.
Also, given the vast number of EC2 instances they could afford, it seems counter intuitive to be running only two mongrel shards. If you are that stingy about spreading data around, you might as well be using dedicated.
It doesn't seem that it was necessarily stinginess of neglect, but rather stinginess imposed by their situation. Like the article said, it took hours to create a new shard, downtime that Foursquare definitely did not want.
So some people still use built-in memory allocation for large-scale memory-intense projects like mongodb eh? Pity that, they should have looked at why exactly memcached (and any other sane piece of memory-write-erase intensive application) does slab allocations.
I wonder if there ought to be a move to allow for sharded systems to lump groups/pages nicely.. kind of like MySQL's partitioning.. where you won't run into the issue of sparsing out pages when you migrate data.. instead you'd effectively migrate pages.
A typical way to deal with uneven distribution in sharding is to do Hash(key) Mod P, where Hash is a cryptographic hash and P is the number of partitions. This way the hash function would randomize the distribution of key.
I am not trying to start a debate of sql vs nosql, but i have an honest question. What if foursquare was using a standard RDBMS like Oracle? Would they have run into the same problem?
If your working set (the data that you are touching regularly) grows past available RAM, you very rapidly switch from a system where disk access plays no part in most queries, to a system where it is the limiting factor.
Because disks are so much slower, and having queries take only ten times as long can cause them to rapidly queue up, you end up with an unresponsive system. The appropriate response is almost always to turn off features to get the size of your working set back down, which buys you time to either alter the app to need less data, or get more RAM into the box.
and out comes mongo's dirty little secret - you have to have enough ram in your boxes to hold not just all the data in ram, but all the indexes too, or it completely shits the bed.
putting hundreds of gigs of ram in a box isn't cheap.
are the foursquare folks considering rewriting with a traditional datastore like postgres and some memcached in front of it?
Your comment shows a scary lack of depth - this is the "dirty little secret" of ANY application that needs serious scalable data performance. You need to put your data in RAM where it is quickly accessible.
As soon as you go to disk especially on a low I/O cloud box you are going to be in 'extremely slow' territory. This is why, even before the current NoSQL movement, everyone was using Memcached all over the place. Not because it was a fad but because they needed speed.
SQL Databases have RAM caches as well which give you a pretty similar behavior. And big surprise - if you throw more memory at them - they run faster! Because they cache things in memory! In this case I believe that what Eliot was highlighting was that Foursquare's performance tolerance requirements were such that going to disk was not an optimal situation for them.
There are plenty of applications out there using MongoDB without keeping it all in RAM; I've deployed and maintain several. The more memory you have certainly gives you better performance but it tends to be about a most frequently used cache.
It never "Shit the bed" here to use your sophomoric vernacular. As the post states, it started having to go to the disk after it surpassed the memory threshold which slowed things down... nothing "crashed" as far as the description states. But a read/write queue backlog on any database is likely to exhibit the same behavior.
Your postgres + memcached solution fixes what exactly? They would still need the same slabs of RAM to solve the problem, lest they go to disk on postgres and slow to a crawl the same way they did with MongoDB.
Err you don't . . as the article states their usage and load pattern does. If you want the speed of keeping the DB in RAM you need the RAM available whether it is mongo, MySQL/NDB or memcache. (-1 for not RTFA)
I wonder if burglaries also went down during the outage, since the bad guys couldn't tell if their targets were at home or busy being the Mayor at the local bowling alley.
Well Foursquare's Scala/Lift front-end sure has been holding up nicely despite all the FUD about Lift's stateful architecture. Funny that Mongo DB, which gets all the scalability hype, is the first thing to have trouble scaling.
The described problem - imbalance in shard allocation - is not one that's specific to MongoDB.
In fact one of the huge choices one has to make when deploying say, Cassandra, is the partitioning algorithm. Sequential partitioning (saying say, users a-l go on Partition 1 and m-z go on #2) gives you certain capabilities for range queries but also risks you overloading if one particular user is significantly larger than others.
Cassandra recommends random partitioning as a way of better balancing your data across shards.
You're going to run into the same problem on just about any sharded setup (With any software) - how do you make sure that you are distributing new chunks/documents/rows in a way that doesn't overload any given server.
Or your datastore can take care of doing that for you, c.f. BigTable. This is one of my main gripes with most of the current set of NoSQL offerings — they leave too many decisions in the hands of developers.
Whilst it would definitely be advantageous for all developers to understand the intricacies of various CPU and OS scheduling algorithms, it's not an issue that most developers have to deal with directly. The App Engine datastore, in particular, proves that it is possible to create NoSQL datastores which don't force developers to think about issues like load balancing.
Well - in many cases they leave the decisions in the hands of developers because they have made a conscious choice to let the developers decide.
BigTable is more than anything a filesystem - it is NOT a database. It lacks true user configurable indexes, custom sorting, querying etc. These features and the data storage requirements to support them are a very different prospect from what a filesystem (even a distributed one) needs.
Right, kick ass. Well, don't want to sound like a dick or nothin', but, ah... it says on your chart that you're fucked up. Ah, you talk like a fag, and your shit's all retarded. What I'd do, is just like... like... you know, like, you know what I mean, like...
i'm baffled at how there could be no monitoring or reporting in place to catch that days or weeks ahead of time, let alone just 12 hours needed according to the mongodb developer, to fix the problem without downtime. it's such a fundamental thing to keep track of for a system designed entirely around storing a bunch of data in the memory of those 2(!) machines. i have more servers than foursquare and none of them do even a small fraction of the amount of processing that theirs do, and yet i have real-time bandwidth, memory, cpu, and other stats being collected, logged, and displayed, as well as nightly jobs that email me various pieces of information. nobody at foursquare ever even logged into those systems and periodically checked the memory usage manually?
worse still, during all of this, the initial outage reports were blaming mongodb or saying the problem was unknown. even at that point nobody at foursquare realized that the servers were just out of memory?
how did the developers come up with 66 gigabytes of ram to use for these instances in the first place? was there some kind of capacity planning to come up with that number or is it just a hard limit of EC2 and the foursquare developers maxed out the configuration?