Hacker News new | comments | show | ask | jobs | submit login
Why MongoDB Never Worked Out at Etsy (mcfunley.com)
137 points by mcfunley 1580 days ago | hide | past | web | 70 comments | favorite

I was expecting a post describing why MongoDB specifically weren't fit to their use case, but the TL;DR version is basically:

"Before you get too excited, the reason for the failure is probably not any of the ones you're imagining. Mainly it's this: adding another kind of production database was a huge waste of time."

The blog title is misleading IMO. It could as well be titled "Why [any other DBMS] Never Worked Out at Etsy" and the conclusion would be the same.

Your proposed title is very confusing and meta. The current title is only misleading if you are going into it with biased expectations about what it's going to teach you. The article is interesting precisely because it is not just another "NoSQL sucks" screed.

What I meant was that the article was not about MongoDB at all. I think thefreeman summarized it better than me on his reply: the author is talking about his experiences on running more than one DBMS in production, not about MongoDB in particular.

I'm not even saying if the article was interesting or not. It surely has it's merits, it touches a point worth discussing (the downsides of trying to cover one DBMS's weakness throwing in another DBMS in the mix), but I was uninterested. I was expecting to hear why MongoDB wasn't fit for his use case, to better understand when to use NoSQL and when not to, so yes, I was indeed biased waiting for something different—but, in my defense, I say I was biased because the title made me think this way.

I think his point was more "Why you shouldn't try to run two different DBMS in production"

That would be a silly point to make. Many people are running multiple DBMS in production without regrets. Each DBMS has strengths and weaknesses in different areas, and for many businesses it doesn't make sense to shove everything into a single DBMS just because things are easier to manage that way.

His point was "in our specific scenario the benefits of using MongoDB were outweighed by the difficulties in having to manage two DBMS in parallel [at a time when MongoDB hadn't been around for long and we had to set up everything ourselves]."

And it depends on how you define "DB". We've recently started using Redis for a number of things, and it has been incomprehensibly faster than our SQL database when it comes to rapidly finding and inserting records in (the equivalent of) a billion-row table. It's also far cheaper to run than significantly beefing up the database server just because of that one bottleneck. Easily worth the time and trouble of setting it up (neither of which was very much).

Are there? Who are these magical people?

If you've got data sitting in one place and then data sitting in another, it's generally a massive fucking pain in the ass.

Speaking from my experience anyway. Bonus points if one of them was written by a company in the early 2000s who didn't trust those newfangled RDBMSes to get it right. Extra bonus points if they thought UTF-8 was for sissies.

I don't consider myself to be magical, but anyway — since you were asking.

We primarily use MySQL and CouchDB (BigCouch). Our user records (accounts and payment) are stored the RDBMS, while the data the users created is in BigCouch. We enjoy the schemaless nature, durability and scalability of BigCouch a lot.

Depending on your definition of "database system", we also have a hefty Solr index (for search), some Redis (no persistence only to connect systems/services via pubsub) and Memcache (cache).

Yep ... in interacting with many developers, setups like this are almost commonplace.

Soundcloud wrote a blog post http://backstage.soundcloud.com/2011/04/failing-with-mongodb... about the specifics of how they failed in implementing an analytics platform in MongoDB, then went with Cassandra (and why). I have been at a company where it was on developers to deploy mongo clusters, setting up logging can suck, but still there are no numbers or even application integration specifics here. As someone pointed out - manual denormilization can suck. There are options like MongoHQ and Heroku though so this shouldn't resolve to "don't try a new data store its hard and possibly buggy".

Genuine question- who is using MongoDB successfully in production, and at scale? I'm not aware of anyone myself- I hear of it being used in hackathons etc because its so quick to set up, but I'd be curious to know what people are using it with.

I run two sites, one is the perfect use-case for MongoDB - http://www.AUsedCar.com , it's a used car search engine. We've seen nothing but benefits by switching to it from MS SQL Server. Queries are way faster etc... It's a great use case because 99.9% of DB interactions are read-only searches.

My other site, http://www.BudgetSimple.com on the other hand is using SQL Server (in the process of porting to MySQL). It would not be a great use-case for Mongo, because there are usually as many update, delete, inserts as there are reads, and instant database integrity and a schema are important.

Anyone that claims a tool is perfect for every problem is probably wrong. You need to figure out the best one for your use case, and load test, security test, performance test, etc... until you have a good guess for the right answer.

> It's a great use case because 99.9% of DB interactions are read-only searches.

Did you ever consider a datasource like elasticsearch? If yes what made you choose mongo?

No, and I actually used to work for a company that made a similar type of search engine!

That probably would have worked as well (don't think I considered that specific solution). Mongo came up on top because of it's wide use (among other things), ie it's pretty easy to find support and lots of stories about how to scale it under different scenarios.

Can you give me an example of a query that was made much faster?

The biggest difference was pulling a random record. This is an odd use case for most people, and not even a common one for me, but say I wanted to show you a random car near your location. The more common use-cases that were sped up were any geo-location searches, ie you're searching within X miles.

I'm using it for the analytics suite at my company (large ecommerce multinational).

Its weird at first, coming from a background of Access then SQLite then MySQL/PHPMyAdmin but you get used to it. I essentially treat it like a gigantic python dictionary object.

The sharding is too much of a ball-ache to set up so I've created an optimal way of distributing/mapping files across our cluster to make use of all machines.

Data integration is nice. Making sure there's no temptation to output each integrated line to the terminal, pymongo and its C extensions can integrate a ~500 byte record in ~0.0001 seconds.

Basically the main advantage is not having a schema whatsoever - you can just add random attributes to documents whenever the hell you want. But later you have to be careful with exception handling since documents might not have the attributes you expect.

Where exactly is the net gain in your situation?

So right off the bat, you've lost the querying power that SQL offers. When dealing with data that's intended to be analyzed, that sounds like a pretty big loss.

Clearly its built-in sharding support, which is often touted as one of its biggest benefits, wasn't suitable for you. So you had to invest some time and effort coming up with an alternate system. That sounds like a loss to me.

When it comes to the schema issue, it sounds like you haven't actually reduced the effort or work in any way, but merely pushed it somewhere else. Like you admit, you still do have to deal with the schema, it's just handled within the application logic, rather than the database. That sounds worse to me, especially if there is more than one application using the database.

I'm just not seeing the benefit.

right off the bat, you've lost the querying power that SQL offers.

Can you point me to any reasonably-priced database systems that allow me to execute SQL queries against a cluster of shared-nothing machines running commodity hardware? I'm not interested in paying $20-100K/TB/year for Vertica and friends.

SQL can be great if you have vast amounts of money or if your data can fit onto a single machine. When neither of those things are true though, nosql DBs become important.

Well are we talking with shard capabilities built-in, or having to roll your own? If you can roll your own then I suggest looking at Galera (we use this for our production MySQL stuff) http://www.codership.com

If it has to have built-in sharding then I can't think of anything off-hand.

MySQL Cluster?

Auto sharding, shared nothing, SQL support, it does a pretty good job of things.


MySQL cluster is another option beyond MySQL Galera (and Cluster actually has a NoSQL layer on top of it.. FYI)... but I have to say if we want to talk about scaling issues for MongoDB, you might not be happy with some of the limitations of NDB.

Foursquare[1] and Codecademy[2] are two that I know of.

[1] https://www.youtube.com/watch?v=GBauy0o-Wzs

[2] https://www.youtube.com/watch?v=RkPmVQNesZA

Gaug.es uses Mongo. Its data model is pretty simple (just counters), but he has blogged about the approach in detail.

MongoDB Counters: http://railstips.org/blog/archives/2011/06/28/counters-every... http://railstips.org/blog/archives/2011/07/31/counters-every...

Kestrel: http://railstips.org/blog/archives/2012/03/05/misleading-tit...

Gets some decent traffic and works well.


They have a list of who's using MongoDB in production. http://www.mongodb.org/display/DOCS/Production+Deployments

How up-to-date is that list? I mean, in terms of removing entries that no longer apply. Some date from 2009 to 2011. Are these systems still in place, and actively being used today?

I've heard of or directly witnessed enough situations where somebody with influence, but maybe not much actual technical experience, pushes for the use of a NoSQL database of some sort. Yes, the project is implemented and often does end up in production for at least some amount of time. But it doesn't survive long. Problems arise, and the system is either discarded, or moved to a more traditional relational database system. That's why I'm curious about that list, and how many of the entires are still valid, as we approach the beginning of 2013.

That page is updated regularly. You can also check out more in-depth case studies of MongoDB deployments here: http://www.10gen.com/customers

And there's s growing list of stories at 10gen.com/presentations

Some good ones to point out: Analytica: http://www.10gen.com/presentations/mongosv-2012/exploring-pu...

Apollo Group (The University of Phoenix) http://www.10gen.com/presentations/mongosv-2012/how-we-evalu...

AOL : http://www.10gen.com/presentations/managing-large-scale-data...

Github: http://www.10gen.com/presentations/mongosv-2012/mongodb-anal...

Banjo: http://www.10gen.com/presentations/real-time-location-based-...

Telefonica: http://www.10gen.com/presentations/mongodb-uk-2012/MongoDB-o...

MapMyFitness: http://www.10gen.com/presentations/mongodb-seattle-2012-mong...

Sailthru: http://www.10gen.com/presentations/mongodb-seattle-2012-sail...

Stripe: http://www.10gen.com/presentations/high-availability-mongodb...

eBay: http://www.10gen.com/presentations/mongodb-ebay

These are all large deployments. It's a mix of small startups, startups that grew up and large engineering companies (like eBay and Apollo group).

I loved this comment 'I noticed that "mongodb.org" is not on the Production Deployments List above. It appears this site uses Confluence which I believe uses PostgreSQL(or MySQL). It's hard to have confidence in an organization that does not even use its own technology.'

CERN. We got too excited and started using it for EVERYTHING (it started just being part of the LHC data analyzing project) and it didn't work in some cases, but for some projects it fitted in perfectly.


We have a very large MongoDB installation running in production and at scale, and it works pretty well.

That said, 99% of our production issues involve bugs in MongoDB and it's inability to effectively use all available resources before it becomes unresponsive. I would say it needs a few more generations to become truly solid.

I'm sure the guys at MongoHQ.com would be able to provide you with better examples of "at-scale" setups.

foursquare pretty famously uses Mongo, and they handle a ton of data.

We're using it for Webpop (http://www.webpop.com) and have generally been very happy with it.

We were very well aware of its characteristics when choosing our DB, and didn't go in expecting any magic Web Scale or somehow getting a HA setup with plenty of durability with just one server.

For a multitenant CMS where you want to store documents with custom schemas, need more than just a key/value store and want some capability to do ad-hoc queries against custom fields, MongoDB is a pretty good fit.

Are you using mongodb for the analytics and how do you store/index/query custom fields?

We use mongodb for the analytics as well yes. It's a less obvious choice there than for the CMS part, but it's a good enough fit, in-place updates can be really handy and we prefer not having 2 different databases.

We don't index custom fields and for queries where that would be required we do the actual querying with ElasticSearch, but for simple filters on a custom field or the like, Mongo does fine.

We use it for a bunch of stuff at Disney. Games, websites, etc.

The bottom line is that MongoDB and MySQL are two different persistent data structures. MySQL is a more powerful data structure that can do more things. MongoDB is less powerful, but is more efficient at certain things. Due to pre-mature optimization or shortsightedness, some folks are romanticized with the efficiency of a less powerful data structure (MongoDB) and fail to realize that their application really need the more powerful relational data structure.

These things should be good learning examples for all.

This also makes it sound like whoever intervened to rewrite said feature in sharded mysql had an easy time. Usually this would not be an obvious port. However we don't know the technical nature of the feature or specifically why it failed.

I'm the developer that helped migrate the data from mongo to mysql (under mcfunley's supervision). Even a straightforward data migration becomes complicated when you have to do it without affecting the production feature or consumers of your public api (parallel writes to both dbs, snapshot and move the historical data, switch reads to the new db, etc). In addition, we took the opportunity to move the feature to a sharded architecture and to rethink the schema. Anyway, you're right in that it wasn't exactly an obvious or easy port.

Congrats then! I've seen mongo features start simple then end containing a bunch of embedded lists of primary keys to SQL or other data stores, rather than what would be a bridge table or two. Not as elegant if your not putting everything there (as some people here say here is mandatory to prevent the overhead running mongo on top of everything else like you mentioned

This is a really good point that doesn't bring up any direct slams against a particular tool; +1 to the author for that.

I've found using Mongo as a stop-gap for consuming JSON APIs extremely useful. You could probably s/Mongo/{nosqldb} there since it's nothing earth shattering.

However, as the only tech guy in our startup I'm always looking harder at Redis than Mongo for most of the problems for which a NoSQL solution might be tempting. I've recently had a lot of success with JSON in Postgres and knowing HStore is always there if I need it has firmly cemented my opinion that I don't need a separate NoSQL solution (yet). (Of course I am merely persisting data in JSON format- not querying on it).

Maybe it should be titled...

"Why MongoDB Never Worked Out Two Years Ago When We Tried to Run It For Our First Time For One Feature, And Beside Another Database Which We Really Considered Production."

I've seen and used MongoDB on multiple projects, big and small, and it's fine. It's a database that stores data. Use it for that purpose and you will be ok.

You didn't read the article because the point was that the lesson learned was that if you are going to have two data stores the human tendency is for one to be a second class citizen with regards to support by ops, etc.

This is totally reasonable. MongoDB, more than any other "NoSQL" database, directly competes with MySQL/Postgres as a general-purpose application database. I don't see a need to have more than one for most applications - at least as long as there is only one development/support team for that application.

Most of the production deployments I find on the internet are around 3-5 nodes. Are there any production clusters that are running 500-600+ nodes?

Disney runs over 1400 instances according to this presentation: http://www.10gen.com/presentations/mongosv-2011/a-year-with-...

Also foursquare runs a very large MongoDB deployment. http://www.10gen.com/presentations/mongodb-foursquare-cloud-...

Craigslist: http://www.10gen.com/customers/craigslist

Shutterfly also has a very large deployment: http://www.10gen.com/customers/shutterfly

1400 deployed instances doesn't necessarily equate to a 1400 node cluster. It seems to be very common for these companies to have several small to medium sized clusters...nevertheless, still pretty large deployments.

Wondering why everybody is using MySQL if Postgresql is supposed to be better. Are there any (startup) success stories involving Postgresql?

Perhaps one reason is that more hosting providers support MySQL but not PostgreSQL. Amazon AWS, Google Cloud SQL, and numerous others offer hosted MySQL solutions but not PostgreSQL. I'm unsure how much overall usage such service providers account for though; it would be an interesting stat.

Here and on Reddit I can read stories about people who are free to choose whatever they want. Nearly nobody uses Java or PHP if you can decide themselves.

But it's always MySQL. Starting with MySQL, going back to MySQL, staying with MySQL.

I'm surprised to hear this coming from Etsy, a place I thought of as doing deployment right.

All these things should be simple. You already have (or should have) a unified system for dealing with logging/monitoring/graphing/init scripts/backup across multiple services that are far more different from each other than they are from mongodb (Sharding strategy and slow queries are probably an application-level concern). It shouldn't be hard - in fact it should be trivial - to add one more service. At last.fm (disclaimer: my experience was brief and getting on for two years ago) it felt like we were running every database under the sun, but we had a unified system for doing deployment/monitoring/everything, so it was no bother to add one more if an application wanted it.

Misleading article tile's summary: We tried to use a technology that was less mature than another technology. We had to figure some stuff out that had already been figured out on the more mature technology. Using two technologies was more complicated than using one.

Oh look, another "We thought Mongo was a silver bullet and found out that was wrong" post.

Except, as others have said, that is not what this article is saying at all. That said, your comment seems to imply that you just read the headline, and (perhaps understandably) didn't actually read the article.

I've learned, especially on HN, that article titles can be extremely misleading.

"Mongo tries to make certain things easier, and sometimes it succeeds, but in my experience these abstractions mostly leak. There is no panacea for your scaling problem. You still have to think about how to store your data so that you can get it out of the database. You still have to think about how to denormalize and how to index."

I read the article. That statement makes it sound very much like they thought Mongo would be silver bullet for that feature.

I think the real issue here is that most people don't understand /how/ to use MongoDB.

The best use case for MongoDB is as a document store. I can essentially cache numerous MySQL requests into a compiled set of useful information. Especially if the information changes somewhat infrequently, then instead of running MySQL requests for every page load I can pull the information from MongoDB. In most cases when I use MongoDB, its not as a persistent data store, but as a "compiled" data store.

MongoDB also has some useful set operations.

I for one don't believe that MongoDB is /directly/ competing with MySQL, Postgres, etc. but rather enhances these databases.

Whenever I see articles come up like this one mentioning MongoDB, I wonder not why people decided to go with Mongo, but why they didn't go with some of the alternatives out there? For my part, we use Couchbase to great success and it fixes many of the complaints against MongoDB. Then there's Riak and countless others with well established quality installations. To me MongoDB seems the buzzword NoSQL engine that gets used for 'play' projects, but not much in the way of real-world implementations. Thoughts?

I do not see any of those other NoSQL databases as really being equivalent. MongoDB intends to be a general-purpose application database. It has many of the features developers expect from MySQL/Postgres, such as arbitrary numbers of indexed fields, partial record updates, aggregation queries (simpler than Map/Reduce) and many others. Couchbase may be much closer in feature-set but its developers claim they do not really compete with Mongo.

I do not see Riak or Cassandra as competing at all. In fact I would expect most applications that use Riak or Cassandra are also using a general-purpose database as well (such as MySQL or Mongo). You could use some of those databases as a general purpose database but it would be more work for little benefit. It makes more sense to me to use Riak or Cassandra for use-cases that really need high-throughput and unlimited write-scalability and use an app database for things like user accounts and preference management and all the little things that can take up a lot of development time but will never have really demanding runtime requirements (for 99.99% of internet apps).

Good points for sure. I think though I'd personally look at the different solutions on both an architectural and feature basis. A good number of the reasons that the original article listed as issues they came across, were outside the realm of features available in the actual MongoDB system (more or less) such as problems with logging, monitoring, backups, etc and were more architectural issues. To be certain, these can (and probably will) be issues with other systems to investigate.

So, if someone asked you why they should use "Riak and countless others" and not MongoDB, what would you really say? Also, you seemed to imply that Riak was a go-to solution (my wording) while implying that MongoDB was more of a fringe "buzzword" technology ... when, counting features, I think most would acknowledge MongoDB as being more mainstream.

There are a large number of well-established and quality installations of MongoDB. It works really well at both small and large scale and with a bit of tweaking (like any technology), can perform nicely.

I would say that (and I suppose this is a bit of a weasel words style answer) each is going to have their benefits and drawbacks and they would have to investigate each for their particular use-case to determine which would best work for them. I think it's pretty true though. I didn't mean to imply that Riak is a go-to solution at all. It was more to give a couple 'options' to explore if you were to start a list along with MongoDB.

I agree and I do think that often times people will select a technology and think it should work like some other technology, become frustrated when it doesn't and then, in turn, blast the technology on only the merits they understand.

There are certainly reasons for using Riak, HBase, Cassandra, etc. and there are reasons for using MongoDB. It is when people seem to act confused when their hammer isn't acting like a screwdriver that we get these blog posts.

You can't currently change only a single field in couchbase (ex:increment this field) in neither couchbase nor riak.

Range sharding (for saas,shard by client_id).

No sorting by value on couchbase indexes? And many other small features.

On the other hand i love about couchbase: no mongos,all servers equal.

Thanks, all good points.

I'm curious on the index sorting though, do you mean in terms of specifying what to sort on, or that you can't sort at all? As far as I understood the new indexing capabilities allow at the very least to sort on numeric values and similar.

You can't sort the results of a view by value:


Did you read the article? His main reason had nothing to do with Mongo itself.

Yes, yes I did. My question is why it was selected in the first place, rather than a period of exploration of the benefits and drawbacks of the different NoSQL solutions (which perhaps might have stopped this particular test-case from being a failure). The article doesn't mention those details though.

Also, we DO implement a two DB setup... Couchbase and MySQL. They both have their place.

I don't think you read it very carefully; the second sentence links to the earlier article where they list the reasons they chose that tech over others.

> I wrote about what I was thinking at the time here [1]

[1] http://codeascraft.etsy.com/2010/05/19/mongodb-at-etsy/

Then you would know that they were doing this in 2010, before Couchbase.

I'm so glad to see this post. I remember having a conversation with someone from Etsy at one point and they made an offhand comment about MongoDB having been a terrible idea with a hint that there was a longer story to it than we had time for. I've been curious about the story ever since.

Finally, some closure!

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | DMCA | Apply to YC | Contact