Hacker News new | past | comments | ask | show | jobs | submit login
MongoDB queries don’t always return all matching documents (engineering.meteor.com)
444 points by dan_ahmadi on June 7, 2016 | hide | past | favorite | 397 comments



Said it before, will say it again... "MongoDB is the core piece of architectural rot in every single teetering and broken data platform I've worked with."

The fundamental problem is that MongoDB provides almost no stable semantics to build something deterministic and reliable on top of it.

That said. It is really, really easy to use.


As a guy who works on ACID database internals, I'm appalled that people use MongoDB. You want a document store? Use Postgres. Why on earth would you use a database that makes so little in the way of guarantees about what results you get from it? I think most people have really low load and concurrency, so things seem to work. When things get busier you're in for a world of pain. Look I get that's it's easy to use and easy to get started with, but you're going to pay for all of that later.


I was always a big fan of calling MongoDB the "Snapchat for databases". I think people even had some stickers printed for it..



Hey! Snapchat is pretty useful for seeing snapshots into the not completely dull parts of my friends lives.

It has its potential flaws in being able to save the images/videos but please don't do it the disservice of likening it to MongoDB.


Oops - corrected, I meant "Snapchat for Databases" rather than "the Snapchat of Databases". Also, semi-OT: your reply was hilarious.


> Why on earth would you use a database that makes so little in the way of guarantees about what results you get from it?

Because some people can't stand having to work with SQL,migrations,schema and constraints, it's as simple as that ( That's not my opinion,that's just the rational behind MongoDB). Even if you use Postgres with the Json column type, you still need to write SQL queries and schemas.

In the context of analytics, it might make sense, I'm not a big data analyst, but I've seen MongoDB used to centralize logs.


> Because some people can't stand having to work with SQL,migrations,schema and constraints

The thing is, if you actually try and write an app using MongoDB, you will rapidly find that you:

1) Have migrations (except they're going to be some scary ad hoc nodejs script that loop through your document store and modify fields on the fly).

2) Have schemas (except they'll be implicit and undocumented)

3) Constraints (except they'll be hidden inside your app logic, and violating them will cause data corruption).

The biggest lie about NoSQL databases is that they're schemaless. If you're EVER going to read the data back and do anything with it, it has a schema.


This.

My company used mongo for years before we got our shit together.

Schmemas were always implicit (until we got our shit together and started defining and enforcing them with Python Schematics).

Migrations were crazy scripts you run in prod or hacks you stick into your code to "transition".

And yes, surprise constrains left and right causing awful anti-patterns. One-character key names to save disk. Hashed values for indexed keys to save memory. Awkward structuring to improve query performance.

The worst part is, we now have tons of important data in these databases and almost no one understands the legacy crazy app logic that makes them tick.


Basically reinventing many of the features of a relational db at the application level.


> used mongo for years before we got our shit together.

That's actually a legit use case. Use MongoDB while you get your shit together. I use global variables while I'm noodling around in code. Eventually I refactor.


I think this is a recipe for disaster. First, there are basic things that you should do from the get-go, e.g. not using globals. Second, the problem with "eventually I'll do it right" is that by that time, your stuff is out in the open, used by clients and heavily depended upon, and you have no way of refactoring. A company that uses a bad piece of technology will suffer many years before they could replace it.


> basic things ... not using globals

Depends on the language, I suppose. I'm more productive with Python when I write everything procedurally and refactor into functions, classes, etc. every dozen lines or so. It's more fun than writing UML diagrams (and seems to produce better code, too!).

Or do you think so long that your head aches and your colleague Hephaestus splits you open to find a fully-formed cooperative multiple inheritance hierarchy?


The problem is, in a great portion of real world projects "eventually" never comes and there's just no time for any major refactoring or replacing technologies since you are too busy implementing the feature that was needed two weeks ago.


I've often dreamed of a specific type of software built and released as "prototypeware", where any app created using it will have certain built-in scaling limits—and going past them will irrevocably force the app into a read-only mode. It would warn anyone monitoring it well in advance of hitting such a limit, of course. But there'd be no way to just slide the limit upward or otherwise tarry. It'd force the migration to something better just as if it were a Big Customer with Enterprise Compliance Demands.

If an enforceable mechanism like that existed, I'd be a lot more confident in mocking things up. Stick SQLite in for the database, munge HTML and Javascript together, whatever—it's literally going to slap away the hand of anyone who tries to use it on a production workload, so why not?

(Going further, it'd be interesting to create some sort of quagmire of a software license, specifically for prototypeware, such that you'd be forced to rewrite all the prototype code instead of reusing even a hair of it in production. Maybe something like reassigning the IP to a trust, with the trust having an obligation to sue anyone and everyone who tries to create derivative works of the code they've been handed?)


This will not work. The whole "prototype" idea assumes once you grow out of the "prototype" phase you have the time, money, manpower, etc. to rewrite the whole thing based on solid, powerful technology and tools. That is, more of then than not, not the case.

The first problem is that every tool has demands, especially the limited ones, and you end up writing your application around those limits and demands, using platform-specific code that will have to be discarded and re-written come the migration.

The second problem is that these tools dictate design, and once you try migrating, you still have an application designed around the prototype tools, which make a lot of concessions and have design flaws because of that.

Finally, I've never understood the need for learning a specific tool, platform or language for "rapid prototyping". Use the tools you will use eventually, it's not that building something in, say, Java from scratch will take an order of magnitude more time and effort than building it on Node.js, despite all the hype, especially if you're a Java shop.


> it's not that building something in, say, Java from scratch will take an order of magnitude more time and effort than building it on Node.js, despite all the hype, especially if you're a Java shop.

I think we're picturing different things here. You're picturing having software engineers make the prototype, and then having the same engineers do the final implementation. Meanwhile, I'm picturing two different teams, with different competencies—one who knows a prototyping toolchain backward and forward and is extremely productive in it, and the other who knows a solidly-architected platform just as well.

The classical pipeline in the animation industry is to have two separate "teams" of artists. One team does concept illustration and storyboarding, and the other does keyframe animation and in-betweening. The first of the two teams is essentially a team of prototypers. Their output is a product which stands on its own for internal evaluation purposes—but which isn't commercially viable "in production." (Nobody really wants to watch 1FPS sketches.) So, after the storyboarding is complete, the whole product is redone by the actual animators into the more familiar product of 24FPS tweened vector-lines or CGI model-joint movements.

The more familiar case of this for web development is where the "prototype" is a PSD file. Professional capital-D Designers are usually Photoshop experts—they're very productive in it, and can mock up something that can be evaluated for being "what the customer wants" quickly, with rapid iteration if it's not right. Once they've got the customer's sign-off, their output product—their prototype—can be tossed over to development staff to "make it work." (There are also an increasing number of interaction-design prototyping apps targeting the same set of designers, under the theory that they'll be able to become productive in quickly iterating the "feeling" of an app with a customer in the way they're already doing with the "look" of the app. I haven't met a designer that uses one of these professionally, but I think that's mostly because there aren't any of these yet well-known enough to be taught in art schools.)

But when it comes to workflow and use-case design, we don't really see the equivalent pipeline. Looking through the lens of separated "prototyper" and "engineer" roles, there are clearly tons of software-development tools that were intended to be used purely by "prototypers": Rails' view scaffolding, for example. But since this role isn't separate, these things get used by engineers, and sneered at, since, as you said, it's no more effort—when you're already an engineer—to just engineer the thing right from the beginning.

Interestingly, all of the true examples of workflow prototyping I can think of come from the specific domain of game development—but even there, nobody seems to realize that prototyping is the goal of these tools, and tries to misuse them as "production" tools. RPG Maker, seen as a tool for making a commercial RPG, is total crap. RPG Maker, seen as a tool for prototyping an RPG, is an excellent tool. Its output is effectively a sketch, a cartoon in the classical sense:

> The concept [of a cartoon] originated in the Middle Ages and first described a preparatory drawing for a piece of art, such as a painting, fresco, tapestry, or stained glass window.

A cartoon is a prototype used to communicate intent. Yes, you (as the producer of the finished piece) can cartoon together with a client to iterate on a proposal. But much more interestingly, a client can learn to cartoon on their own—and then, in place of a long design document, they can submit their cartoon to you. An RPG Maker game project is the best possible thing I could hope to receive as a design proposal from a client asking for me to make an RPG. It forces all the same decisions to be made that making the actual commercial game does—and thus embeds the answers to those decisions in the product—but it doesn't require the same skillset to create that the commercial game does, so the client can do it themselves. The prototyping tool, here, is doing the "iterating on a design together" job of the designer for them.

We do have one common prototyping tool in the software world—Excel. A complex Excel spreadsheet is a cartoon of a business process, that nearly anyone can make. We as engineers might hate them, because people generally have no sense of project organization when making them—but every project to convert an Excel "app" will take far less time than one that involves collecting the business requirements yourself. The decisions have already been made, and codified, into the spreadsheet. You don't have to sit there forcing the client to make them. The process of cartooning has forced them to do it themselves.

---

To summarize: software prototyping tools aren't for engineers—if you have an engineer's mindset, you'll prototype at the speed of sound engineering practice, so prototype tools won't be any help to you; and you'll be more familiar with the production-quality tools anyway, so you'll be more productive in those than with the prototyping toolset.

But software prototyping tools definitely have uses: they can help designers to iterate on a "functional mock-up" to capture a client's intent; or they can even help clients to create those same mock-ups on their own. This is why "prototypeware" makes sense as software—but also why it should be self-limiting from being used in production. The prototype app wasn't created by someone with an engineering mindset—so there's no way it could end up well-engineered. Its purpose is to serve as a cartoon, a communication to an engineer; not to function in production on its own.

(Mind you, prototypeware could be made to function as an MVP in closed-alpha test scenarios, in the same way that the MVPs of many startups are actually backed by manual human action in their early stages. The point there is to test the correctness of the codified business process, rather than to support a production workload.)


There is nothing as long lasting as a temporary solution.

I've just fixed up some code marked "proof of concept" that had been in production for a decade...

Admittedly some people's PoC work is better than what some consider to be release ready, but still this was not intended to be in that state for that long.


I'm not talking about major refactoring, but the kind of refactoring that I do every few minutes.


I don't think that is an apt comparison. Replacing your database backend, at the minimum, usually requires a massive migration of data, and possibly even changes to your entire architecture.

A refactoring does not change behavior, and can be perfomed in minor -- and in your example of a global variable, perhaps even trivial -- increments.

edit: typo


Sure, there's a continuum of refactoring, from trivial to complete re-write.

Any time the data schema(s) change, you need to migrate. I'll bet that even when sticking with the same database flavor you'll need to migrate a handful of times over the first few months. Requirements change, blah, blah. After the first couple migrations, you refactor to make that less painful. Eventually it might get to the point that your persistence layer is fairly abstracted and you can change databases without ripping apart everything else. Doesn't happen with every project, but sometimes.


My concern would be whether Mongo will cause me to lose data.... "To recap: MongoDB is neither AP nor CP. The defaults can cause significant loss of acknowledged writes. The strongest consistency offered has bugs which cause false acknowledgements, and even if they're fixed, doesn't prevent false failures." https://aphyr.com/posts/284-call-me-maybe-mongodb

...or get my data corrupted: "When MongoDB is all you have, it’s a cache with no backing store behind it. It will become inconsistent. Not eventually consistent — just plain, flat-out inconsistent, for all time. At that point, you have no options. Not even a nuclear one. You have no way to regenerate the data in a consistent state." http://www.sarahmei.com/blog/2013/11/11/why-you-should-never...

When you refactor or rewrite your code, you have the old code in version control, can write tests to confirm that it still works as expected, and there's no inherent time pressure.

If you pick an unreliable database and your data has been or is being lost and/or corrupted, it's more like a "try to stop the bleeding before the patient dies" situation.

That's not the time I want to be considering changing databases.


Likely better to scrape the data out through the app (if it's a web app) than to try to talk to that sort of database directly. The app would at least put names to everything.


I often liken NoSQL databases to dynamically typed languages.

With a NoSQL database, you have an implicit schema, but it will only be enforced and fail at runtime - when your code expects a field but failed to find it, for instance.

With a dynamically typed language, you have implicit types, but only enforced at runtime - when your code expects a value to be an int but finds a boolean, for instance.

And both are fine, there is a need for both. I can see how the flexibility of being able to change, well, everything by just flipping a switch in your head ("this is an int now") might be helpful for, say, data exploration problems.

It's just that in a production environment, these features of NoSQL databases and dynamically typed languages turn into massive sources of problems and oh god, just don't.


You and Lazare are right on the money. And the thing with the database is that the code that inserts/updates it has to agree with all the querying code about what the implicit schema should be - but it's implicit and scattered around your code - so on a large team it's very hard for everyone to understand that implicit contract and it's going to be a constant source of production bugs.

Schemas don't change that much compared to code, having a strict schema enforced by the database saves you so much time and pain and downtime in the long run.


This list makes me want to cry a little. It rings too true.

> Have migrations (except they're going to be some scary ad hoc nodejs script that loop through your document store and modify fields on the fly).

I literally just spent the better part of tonight AND yesterday evening dealing with one of these scripts. I had pulled down the production table to locally test the script (gross), but when I later ran it in the production environment, we'd somehow had an array sneak in to what was an object field. The whole thing just felt like a mess.


Oh god, I'm getting flashbacks.

Because you can't just test it on one document and see if it works; you have no guarantee that all the documents will be identical. And if the migration script crashes halfway through... oh man.


> And if the migration script crashes halfway through... oh man.

Schema-issues and typing aside, I looked at MongoDB just long enough to find out there are no transactions, then ran away, quickly.

For a lot of tasks, I guess I would find MongoDB very useful, but lack of transactions is a complete deal breaker for me. Not having a real schema, referential integrity and all that makes them even more important, IMHO.

At work, I have had more than one quickly-hacked-together Perl script crash on me in the middle of a run. Having proper transactions has saved my butt repeatedly.


By contrast, with Postgres you'd have an explicit schema which all records obey, and even migrations can be done in transactions.


Mongo has its weaknesses, yes its main strength is cited as its simplicity, or that its quick to get something out the door.

I agree with your last comment. I can't help but to laugh at people who think they would get away with designing a database with no schema. Schemaless for me meant that unless you enforce constraints, there won't be any.

There's a reason why there are ORMs even though Mongo drivers are sufficient for most cases.

1) I've always designed my data with future changes in mind. I often spend up to an hour thinking of possibilities of data that I want to store in a collection, before writing the schema. The flexibility i have with Mongo is that if I think I need a field but am unsure of the exact data type to store, i.e. is it a string or array of strings, or array of objects with strings? In that case I just leave the field as an object and change it later. The plus being that as long as I haven't stored anything with that field, I can always change its type without a 'migration script'.

I've only needed to 'migrate' by updating documents 4-5 time. When GeoJSON landed, and a few other times when I needed small changes to my data.

3) The only way I can think of enforcing constraints on < 3.2 is through indices, which is insufficient. Most ORMs do the enforcing. I've never needed to enforce them at an app level.

I've used MongoDB primarily for its Geo support, and JSON enabling me to get things done quicker relative to maintaining SQL tables. I've got a small but interesting use case, public transit. https://movinggauteng.co.za and https://rwt.to.

When I started with the projects, PostgreSQL + PostGIS felt like a black box, and I wanted something that would give me ease and flexibility. At the time hstore was the talk of the day, but seemed to not meet my needs.

It would now with JSON, but I'll stick with Mongo for now.


Exactly. While the process of designing the structure of your data can make you feel like “you're not getting real work done”, in the long run, it actually prevents headaches caused by inconsistent data. Data always has a structure, it's just that some people are too lazy or mentally feeble to figure out what it is.


For me that is the most important aspect of starting / designing an application. If the data model is accurate, then the code falls into place easily. If its not quite right, more and more code ends up in the application trying to make up for the poor data model.


I second this.

My first task in any project is to design the whole data model based on current requirements and while designing it I think of the interfaces and how would they read and write data (to refine requirements). Writing views and actions/APIs on top of well-formed data model then becomes a breeze.


Its so un-agile (but it works).


Agile doesn't mean "don't gather requirements or plan anything." It just means that you evaluate your results frequently and maybe change course, instead of waiting until the end when you're "done".


Schema on read, as opposed to schema on write, as it says in the excellent "Designing Data Intensive Applications" book ... nightmarish to deal with.


To add to your point about schemas. The new generation has not learned that the data almost always outlives whatever throwaway front end was written to work with said data. Tying the data to some sort of flavor of the month framework is setting up for all sorts of pain later.

I despise mysql, but even it is better than mongo. At least with it I can easily transition the data to many different uses.

Also, and a point I find amusing is that many users of nosql claim schemaless and then go and write a layer on top of the datastore to enforce a schema. It would have been so much simpler to use a RDMS out the gate instead of badly implementing one.


> If you're EVER going to read the data back and do anything with it, it has a schema.

You are giving the NoSQL crowd too much credit. Some abominations have no recognizable schema at all. The data store will just contain arbitrary dump of data which different developers decided their "schema" should be. The number of "columns" will vary, the "columns" will have arbitrary formats, so on and so forth.

If one developer decided to separate name into "first: John", "last: Doe", you will have that. If another decided to have "name: John Doe". That's what will be there. If one developer decided social security should be "SSN: 123-45-6789" and another decided it should "SSN: 123456780", well you are going to have fun cleaning up the data at the business or even application layer.

But that's not even the big issue with MongoDB. It's their lack of ACID compliance!


#2 - it's basically "schema on read" vs "schema on write".

You always have a schema; where and how it's defined is the only question.


> Because some people can't stand having to work with SQL,migrations,schema and constraints

The real question is “How come these people are allowed anywhere near data stores?” SQL isn't ideal, but how many of the alternatives are better at protecting the integrity of your data?


Thank you. Not everything is easy. This is the difference between engineering and 'hacking'. Hacking is not something to aspire to; it's something you do because of crushing, external pressures.


[dead]


> At the end of the day, a SQL database doesn't represent the data in a way the programmer uses the data.

Errrr that's exactly what they do, unless you've got a terrible schema and havent thought about your data enough. The thing is about 'sql databases' is you can use the power of sql to fetch the data in any representation you want.


ORMs are a really bad attempt to force a square peg into a round hole. The mismatch between the relational model and object-oriented design principles is simply too big.

In the relational model:

(0) A relation is a collection of tuples of primitive values. Every relation has a relation schema, which determines the arity of its tuples and the type of each tuple component. In other words, the relational model is first-order.

(1) There are a few basic operators for computing relations from other relations (relational algebra).

(2) There is a mathematical theory (database normalization) of how to design primitive relation schemas to avoid storing duplicate information, and running into insertion, update and deletion anomalies.

On the other hand, in a pure object-oriented program:

(0) An object is a collection of data and operations on it. The data is hidden from the rest of the program, so the only way to operate on it is to use the object's operations. The operations may take objects as arguments and return objects as results, so objects are intrinsically higher-order.

(1) In general, there are no limits on how one can define a single object's operations. However, it's impossible to define operations which require knowledge of the internal representation of two or more objects at a time.

(2) There are heuristic guidelines (e.g., SOLID principles) for designing flexible object-oriented systems. However, they lack any sort of rigorous foundation beyond “it seems to work in practice”, so object-oriented designers may deviate from these guidelines at their own discretion.

---

For data-oriented applications, it's pretty clear to me that the relational model has important advantages over object-orientation:

(0) The decoupling between data and operations allows the database designer to focus exclusively on data integrity constraints, instead of anticipating whatever queries users will want to make.

(1) The limited expressiveness of relational algebra (with no recursively defined relations) is also a blessing, because it makes automated query optimization tractable in practice.

While objects present problem after problem:

(0) Object graphs are intrinsically directed, and must be traversed in the direction of its links. This makes queries less declarative.

(1) Objects have a notion of identity, which destroys many opportunities for using equational reasoning to build large queries. This also makes queries less declarative.

Of course, the relational model says nothing about general-purpose programming, whereas object-orientation does. But there exist other paradigms for general-purpose programming that are less badly in conflict with the relational model. For instance, functional and logic programming:

(0) Don't reject the use of first-order data, decoupled from operations.

(1) Prefer the notion of mathematical variable, whose meaning is given by substitution (a first-order operation), to imperative assignment, whose meaning is given by certain predicate transformers (intrinsically higher-order gadgets).


I was going to say I wish there was a way to bookmark single comments. Then I realized the reply page is essentially a link to a single comment:

https://news.ycombinator.com/reply?id=11861520&goto=item%3Fi...

If anyone wants to bookmark this one.


Clicking on the time ("X hours ago") gives you https://news.ycombinator.com/item?id=11861520


Is having seen something used some way a leading indicator of it being a good idea to have used that thing that way?

Because I've seen Excel used as database with all kinds of macros and VBA scripts bolted-on/embedded to provide the workbook various shapes of stored-procedure and query capability... but, while sorta impressive in a "Holy crap, lol wut?" kind of way, I'm not sure any instance I observed of uses like that were actually good ideas. Full of epic cleverness and ingenuity? Definitely. A good idea? Probably not.


Did they make the company a lot more money than they cost? If so they were probably a good idea. Not all code needs to be "pretty" to serve a purpose. I've seen some pretty epic hacks that I know generated hundreds of thousands of dollars of new revenue.


They often had a significant "bus factor" problem as a result of this in the best cases, and in the worst cases these mountains of hacks were a massive impediment to growth and/or evolution to meet changing marketplace demands... despite being a central pillar of data management and revenue as it existed in the status quo.


In my experience in market research, advanced spreadsheet programming with macros and pivot tables and whatnot are more a contemporary incarnation of Reporting than raw database querying and operations.


So my question is then: Why not use CouchDB instead? I don't see what Mongo gives you over that and CouchDB is at least dependable and predictable in its operation.


CouchDB is too reliable and actually fsyncs your documents to disk. That is plain boring. I like to live on the edge and have some documents go to /dev/null once in a while. Life is just more exciting that way ;-)

/s


I really like Couch, I wish it had more adoption than it seems to have and that its ecosystem was more mature than it seems to be... and that javascript wasn't its first class citizen.

But its a really cool database (though I'm partial to rethinkdb now)


Even Javascript is sort of a second-class citizen in Couch. The real first-class citizen is native Erlang code running unsandboxed in the server context. If you want high(er) performance, that's where you go. (Alternatives to this have been discussed, like embedding the luerl Lua interpreter to give the option of a sandboxed programming target without the IPC cost. Nothing in the immediate pipeline, though.)


though I'm partial to rethinkdb now

Me too. RethinkDB is my document database of choice these days. In my experience, its proven to be reliable and fast and the development team very responsive and helpful. They also seem quite mindful when it comes to new features and will delay things for years (eg auto-failover, which they now support but it took a while) if rushing it would impact quality.

That's what I want from a database: first and foremost it must be solid and not lose my data. Everything else (including high availability) can come after.


Is CouchDB still alive? I spent a weekend playing with it in January, but it seemed to be a very quiet project, with the last stable release being almost two years ago.


Most of the activity happens at Couchbase now, the company that the inventor D. Katz founded based on CouchDB technology. You can still use Couchbase for free, but it's possible to pay for support. The coolest thing they have is Couchbase Lite, the mobile version of CouchDB, lets you replicate with your server. I find it a very interesting alternative to Core Data, parse and co. and we use it in production.


Good to know, will shift attention to Couchbase and re-assess. Thanks!


> some people can't stand having to work with SQL,migrations,schema and constraints, it's as simple as that

Use the right tool for the job, right? Admittedly something like MongoDB could be the right tool for the job (examples around here include RethinkDB and CouchDB). MongoDB, however, is like a hammer with no head.


in this case, please use Postgresql jsonb data tables. Just as powerful as mongo... but with the stability and guarantees of postgres


It has been pointed out before that json(b) has problems with indexing. IIRC the cost estimates of indexes on JSON data are static, and therefore very rarely accurate. I'm terribly sorry but couldn't find a reference with 5min of searching. I still like postgres over mongo


http://postgresql.nabble.com/working-around-JSONB-s-lack-of-...

Of course, that doesn't mean indexes never help. See for example http://blog.2ndquadrant.com/jsonb-type-performance-postgresq....

I guess the workaround would be creating indexes on computed columns that query from the json data, together with changing one's queries to use that computed field. For example, with a json column storing names in various places, a computed column could collect all of them in an array. An index on that computed column will have good statistics.

Bottom-line: if you want your queries to run fast, you will have to tell your store what kind of data you have and what kind of queries you will run. Otherwise, there's little the store can do.

Having a traditional database with various constraints is a way to give that information. With json columns, you may have to do it in another way (for now).


Who knew you'd need to learn how to use tools to use them!


> Because some people can't stand having to work with SQL,migrations,schema and constraints, it's as simple as that

So use an ORM that understands Postgres' JSON columns. Don't need to write a single SQL statement, automagic migrations, no explicit schema (unless you make one), no constraints (unless you add them).

It works great, we did a rather large project last year using Django's ORM and postgres where we didn't know the final data schema until months after launch.


So get them to use ToroDB.


Never heard of ToroDB. Just checked out the website and it looks interesting, however the tagline "The first NoSQL and SQL database" is untrue.

At least OrientDB has had both schema+schema-free and SQL + NoSQL querying interfaces.

That is, you can optionally supply a schema for your documents. IIRC you could choose either schemaless, schema or mixed (where mixed allows fields not in the schema to exist as schemaless fields).

The default query language was SQL with "enhancements" (to allow for graph traversal), but you could also query with Gremlin. Not sure if this is still the case or not as I don't use OrientDB.

The above was true in 2012 and possibly a lot earlier. I see ToroDB's first Github commit was in 2014.


We have rectified the web site. No longer says "first", just states now the fact that its open source.

Thanks!


Using a document oriented data store as a log aggregator?

Please lord, take him in his sleep.


It does work for a certain volume of data. You can index fields you're interested in, even do so after the fact, and it's like any other database in that case. And sometimes you have small apps that do need complete historical log data, so Kafka et al just introduce unnecessary complexity since you'd need to aggregate into a key value store anyways.

But if you do this, god forbid you go beyond where indices can fit in RAM of a single machine. And you will do so, with probability one given your product doesn't shut down. So you're running a gauntlet against a redesign.


Try some Lucene based software and tell me how it goes.


It’s useful for prototyping. When you don’t know which schema you’ll end up using having an *SQL database is tedious because you have to do migrations every time you change the schema. Once you’re done prototyping you can switch to a better alternative.


How does that work?

If you need to preserve data between the application versions then you still get all the headaches with MongoDb (either migrating the data or supporting multiple schema versions when you read the data, oh the fun!).

If you don't need to preserve data between the versions then you don't need to write migrations scripts in SQL, just scrape everything and pretend it's the first version of application.


I guess this is a question of how usefuly, deployable the prototype shoukd be. Why not just have an in memory object cache, literally a hashmap, for your dal? If you're composing app level code, you don't need to know what the backend does to your data. You could even create a simple method to populate the data at app boot in your in the dev profile. When you figure out the storage requirements and finalized model, build your db.

This would save you time on picking the db, schema changes or even migration changes in Mongo. You don't have to worry about bad documents from an earlier app revision.


> You could even create a simple method to populate the data at app boot in your in the dev profile

I have ~40M documents, and it’s still a subset of what I’ll manage in production. It takes a few hours to load them in the database.


In that case you can start with postgres and stuff your documents in a single json column, accessing it just as you would have in mongo while you're prototyping and don't care about speed and indexing, and when you're done, you can just change that table to a more proper structure without changing databases.


I have never managed to build a prototype that didn't end up being used in a production setting.

I am trying to remember if I ever built a prototype that got rewritten. Probably not.


It probably won’t get rewritten but the DB-specific code is in one module so it’s easy(-ish) to swap the underlying DB.


Use SQLite and let the tool drop and recreate all the tables during prototyping. No migrations required.


SQLite can’t handle a large number of documents; I have ~40M of them.


Exactly. Use the right tool for the right job. Start prototyping and development with MongoDB and then migrate to Postgres, or Cassandra or whatever suits your user-case better.


Oops.. did I say something wrong?


Because postgres doesn't focus on the first-five-minutes experience the way Mongo does. Even the name is hard to say.

Mongo is a dumb, dead-end platform, but they know how important ease-of-use is.


This is key and often overlooked - MongoDB is so popular not because it's the best database but because it's so easy to get started with. Download/unzip/run to have a database engine ready. It also helps that you can also immediately store anything without any prior setup steps.

Postgres/mysql/sqlserver/etc are nowhere near as easy to install, as fast to get started with or as portable to move around.


Postgres members should listen this and have a simple getting started guide for osx, Windows, Linux. I tried brew install postgresql. There was no single place which tells me how to start server, access command line, create db etc.


On OSX there is the fantastic http://postgresapp.com/ . It installs into /Applications so it is easy to remove, and comes with a start/stop GUI and taskbar icon. Great for local development.

But installing and configuring Postgres "properly" on a server is still something of a challenge. Do I need to modify random_page_cost on a SSD or not? What are good memory limits on modern big servers? What exactly needs to go into pg_hba.conf?

None of these seem too difficult after reading a few tutorials and wikis, but it would be nice if the server set itself up with reasonable defaults based on the machine its running on.


Getting started with PostgreSQL on Linux is actually trivial. What is annoying though that there are lots of guides which talk about editing pg_hba.conf which is not necessary for the simplest setup. The default pg_hba.conf is good in most distros.


We must have different definitions of trivial compared to what I had to go through every time - it's a mess of an install process that takes tweaking config files just to have it even listen to external requests.


With the ease of services like AWS, we never installed a database server. Pick a database flavor, version, click, click and you're up. I suppose designing the schema take a little effort, but I find it much easier than properly architecting software.


Many if not most installations are still being done on actual dev machines and servers. While RDS and other managed services are nice, they're just a small fraction of the usage.

Also the fact that managed services help so much only speaks to the fact of how difficult these relational databases typically are to work with operationally.


Well, that's not entirely true.

if you're on a mac you can download postgresql.app[0] which produces a small icon in the top right status bar. You don't have to install users or permissions or anything it's super easy to set up. Getting it on prod can come later but for the first five minutes it works.

(granted, this neglects contrib extensions like hstore)

[0] http://postgresapp.com/


It's not just installing for development, mongodb (other than its ridiculous clustering) is very easy to install on production servers as well and moving an installation is basically just zipping up the folder and moving it somewhere else.


That sets up postgres. It doesn't let you get started doing CRUD operations inside your postgres though - which is where MongoDB shines, "just store my data, fuck it"


Oh, it stores data, just don't try to get it out again...


Mongo also goes to every conference and pitches themselves constantly. Maybe postgres needs a marketing person :)


Yes marketing helps, but you can't ignore just how easy mongodb is to run and move around.


> Mongo is a dumb, dead-end platform, but they know how important ease-of-use is.

By “ease of use”, do you mean “ease of making something that seems to work” or “ease of making something that actually works”? I've never used a schema-free database, and ended up thinking to myself “I'm completely sure this database can't possibly contain garbage data”. Or do programmers simply not care about data integrity anymore?


The single biggest source of grief in our production database has been the one JSON field we used once to avoid adding another table. That goddamn thing has crashed the server so many times with invalid data, that I'm never using anything schemaless again. We recently migrated to a proper table and I'm thanking my lucky stars I finally got rid of that devil.


JSON field? You are lucky. We have a Pickled python object stored in our database. Wonderful for debugging when its pretty much unreadable.


My condolences :(


> I'm thanking my lucky stars I finally got rid of that devil.

All that I can say is congrats, man!


Prototyping on Mongo is bad. MySQL takes 10 seconds to install. I'm imagining with the public Docker registry, so does Postgres.

How many times have you promised to fix something later and then later comes and...

Prototyping is not an excuse for laziness. It does feel like some programmers don't care.


You can insult the developers that use Mongo or you can look at how to get those users onto a better platform. With the modern expectations of full-stack development, is it any wonder that something promising simplicity and zero-configuration data storage does well?


> MySQL

Why not just prototype with Sqlite? You don't even need a server.


Well, I have used one and had that assertion. Ease of use means making something that works. It seems we're condoning lack of knowledge or experience with programming. If you've never used databases in your life, whether you're using SQL orNoSQL you'll likely end up with rubbish data. In the SQL world it could be that you're storing time in the wrong format, concatenating long fields, or not normalising when you should or the other way around.

In NoSQL you could be reinventing the wheel, or storing data that you can't query efficiently because you can't index it well etc.

All the excuses of not using some document stores beyond ACID really sound like people won't know what the heck they're doing.


> Well, I have used one and had that assertion. Ease of use means making something that works.

For me, it means, under no circumstance, no interleaving of transactions or scheduling of commands, nothing, nichts, nada, can the database be in a state where a business rule is violated. If I need to worry what silly intermediate transaction state can be observed from another transaction, or if I need to worry whether a master record can be deleted without cascade-deleting everything that references it, then the DBMS has failed me.

> not normalising when you should or the other way around.

I've never seen a situation where anything less than 3NF (actually, ideally, at least EKNF) is acceptable.


What they neglect to mention is not that you don't need a DBA, that you are the DBA yourself. With all the responsibilities that go along with that role. Who's getting paged at 3am now...?


If thats really a problem host your database on a cloud platform and let those guys do the job of a DBA. Works for me at present. I am aware its not going to be a solution for everyone, though it still sounds a lot better than being your own Mongo DBA.


The DBA is the role who is reponsible for the organizations data. Even if you outsource the routine tasks such as "doing backups" you still need someone to assume that role.


Cloud service doesn't really replace DBA.

Yes, it will help you to cover cases like where the server phyically explodes, but that's basically irrelevant, most problems where you need a DBA are caused either by data corruption caused by application code or developer, or performance issues caused by DB structure - in those cases the cloud platform won't do anything for you, they just host the server. They can restore backups, do monitoring and tune the server, not your particular app/db structure - but all the big problems are there.


"most problems where you need a DBA are caused either by data corruption caused by application code or developer, or performance issues caused by DB structure"

Is running Mongo going to solve any of those problems? Without a rigidly enforced schema I would guess those problems are going to be amplified rather than solved.


This will be highly subjective but you need to get over the "postgresql has the longest feature list so why don't you use it". The last startup I have been involved with tried to use PostgreSQL and needed to move to MySQL (yeah, well) because commercial support was both more expensive and less useful than what we were able to get for MySQL. Perhaps today it's different.

While I no longer use PostgreSQL much, every time I need to touch it seems rather developer unfriendly, just last month I found MySQL, heck even SQLite supports triggers with code inlined into the trigger body but PostgreSQL mandates writing a separate function for the trigger. And, of course, it needs to be in plpgsql because reasons. The most trivial "let's calculate another column" becomes a complicated nightmare.

So then if you don't want to use PostgreSQL what then? The answer now is MySQL, again, because 5.7 has JSON.

And mind you, I have grown to dislike MongoDB slowly over the years as new types of queries have appeared and it's a complete mess by now. There was an excellent article on this posted on Linkedin of all places this March https://www.linkedin.com/pulse/mongodb-frankenstein-monster-...

It's really interesting how MySQL is the most usable and most supported database by now...


> This will be highly subjective but you need to get over the "postgresql has the longest feature list so why don't you use it".

I think if you re-read it, you might see that at no point did the post that you're replying to imply that Postgres was preferred because it had a longer list of features. They're speaking entirely about the strong guarantees that an ACID system gets you.

Document stores are only mentioned because this is one of the (incorrectly) perceived advantages that Mongo has over Postgres and other databases.


"just last month I found MySQL, heck even SQLite supports triggers with code inlined into the trigger body but PostgreSQL mandates writing a separate function for the trigger"

The right response, as a postgres developer, is to agree that you describe a useful feature, and perhaps implement it to help other users.

But my advice to you is to be willing to put up with some short-term annoyances. Sometimes the best choices are a little annoying, and if you refuse to consider them, it will cost you (or your employer) much more later.


It is listed on the TODO https://wiki.postgresql.org/wiki/Todo page, apparently since 2012. I haven't coded in C since 1998, I do not think you want me to touch the PostgreSQL code base.


> just last month I found MySQL, heck even SQLite supports triggers with code inlined into the trigger body but PostgreSQL mandates writing a separate function for the trigger

I found something similar (and in the last month too) – insofar as we're talking missing popular features – but with MySQL's and Postgres's positions reversed.

`ALTER TABLE ... ADD CONSTRAINT CHECK ...` runs on MySQL without an issue, and so does any INSERT or UPDATE violating that CHECK constraint. A bug was filed in 2004.


My response to the comment for triggers and PGSQL would be: Postgres tries its best to stop you shooting yourself in the foot.

Similar to the top comment, all the real problems I've ever encountered with postgres (heck, all major RDBMS's for that matter) come from certain areas, mainly triggers.


A startup which I have helped architect back end uses mongodb for everything. before starting the project I have requested the CTO not to use mongo as it was not a right fit.Basically they needed more of a relational stuff.The CTO chose mongo because he was thinking every startup uses it and why not us. Now they are suffering as they need ACID and relational features. They want to rewrite to postgres but they are heavily invested and not easy to go back.


Sounds like a startup with a very short timeframe.


Postgres wasn't always a great document store...there was definitely a time period where if you wanted to take a document-oriented approach to data modeling, MongoDB was a good way to go. JSONB was only added in the last minor version of Postgres, and while the JSON and HSTORE types were available, it didn't give you quite the same speed. Now that JSONB is a thing, I think the two databases are more comparable as a document store.


Is that actually true? I mean the part about Postgres might be, I don't know. But was there a time when MongoDB was a good way to go?

Was there ever a time when it actually worked consistently well at something that was database shaped? Because I started dealing with it in ~2010 I think, and it wasn't a suitable database for anything other than toy projects or throw-away data back then, and while it's many versions newer, it still appears to be pretty fast and loose with its supposed system guarantees.

There was a point when they raised $100+ Million in funding that I thought they'd take that money and actually build a database. At least as recently as last Summer that wasn't a reality yet.


> But was there a time when MongoDB was a good way to go?

Not in any way, shape, or form. Keep in mind, as bad as it is now, it was vastly worse when it launched.


A text/blob field with a normalized key column or two were always vastly superior. We're talking about data loss at an incredible level. I mean, a new Jepsen test comes out and this community goes bonkers over how database X might suffer a split-brain problem for a few milliseconds under an extreme condition but Mongo on a single instance has never been safe, and people are making excuses for it.


Somewhat surprisingly, recently Aphyr came close to recommending MongoDB. From https://aphyr.com/posts/329-jepsen-rethinkdb-2-1-5 :

> I’ve hesitated to recommend RethinkDB in the past because prior to 2.1, an operator had to intervene to handle network or node failures. However, 2.1’s automatic failover converges reasonably quickly, and its claimed safety invariants appear to hold under partitions. I’m comfortable recommending it for users seeking a schema-less document store where inter-document consistency isn’t required. Users might also consider MongoDB, which recently introduced options for stronger read consistency similar to Rethink–the two also offer similar availability properties and data models.


I've been using Postgres/jsonb for JSON document store. It works OK - the query capabilities are still a little rough (9.5 is better than 9.4), and some frameworks like Loopback don't support JSON in Postgres yet (not sure which ones do), but it's definitely capable and reliable...


The support in .net for Jsonb is awesome. Support by npgsql and a document database api written on top called Marten.


What do you think about noSQL in general. From what I could follow from aphyr rethinkdb seems pretty awesome. I like it a lot, but I am also not getting a ton of traffic on localhost:3000...


> What do you think about noSQL in general.

That is an unanswerable question since it's about everything. "NoSQL" is a huge variety of techniques - many of them yet to be invented, that only have one thing in common: "not SQL". From document storage over key/value storage to graph databases. Anyone who tells you what they think "about NoSQL" either has to redirect the question to become a useful one, or if they actually attempt to answer it take your popcorn and expect entertainment at best.

"Types and examples of NoSQL databases" - https://en.wikipedia.org/wiki/NoSQL#Types_and_examples_of_No...


fair enough. I mentioned rethinkdb above because I find it very intuitive and versatile. I've used mongodb a fair bit but I like rethinkdb better for a host of reasons. I guess what I meant was, I thought mongodb was ok however everyone here seems to have always known it was deeply flawed. I tried to follow the jepsen report on rethink but I don't fully comprehend the tradeoffs/benchmarks ect, and was curious what others thought about it.


Here's one:

You're dealing with a torrent of incoming semi-unstructured data, where losing a good chunk of it is minor nuisance because you only need a decent sample, from which you extract data.

In those kind of scenarios, making it easy to work on the code can often be far more important than reliability.

I have a project like that now. I'd love to use Postgres, and probably will eventually once things "settle down" and we know what data we need to store. But or now MongoDB is the "quick and dirty" solution. We define a schema client side for everything we nail down, so as we nail down more aspects of what data to process, it gets easier to transition to a proper database.

As ORMs get better support for Postgres' JSON capabilities, it will likely get less and less appealing to use MongoDB for stuff like this too.


Ah, yes multiple downvotes without explanation, that always convinces me.


That's a totally valid reason to use Mongo, now if only they'd market themselves that way.


But if you need to scale horizontally use Rethinkdb or Cassandra.


Because Postgres doesn't have any clustering / HA features. So out of the box it doesn't scale well.


It HAS them, just not built-in tooling to make using them easy. 2ndQuadrant's repmgr gets you partway there, I'm really hoping to see them revamp it now that pg_rewind is a thing to make restoring a failed master less of a pain in the butt (this is literally the only reason I don't bother with HA right now, it's usually much easier for me to get the DB back online or restore from a barman backup than deal with replication).


If you want that you can always use a variant of postgres that does like greenplum, citus, and a few others. They're battle proven. There's also MySQL and its variants as well.

Not to mention that are NoSQL alternatives that have a better track record than Mongo like Cassandra.


Are there NoSQL alternatives that have a worse track record than Mongo?


There probably are, but they just don't enjoy the same level of popularity as Mongo or they'd be scrutinized just as much.


Cassandra, HBase etc had checkered pasts with plenty of their own data loss and inconsistency bugs.

Now they are considered two of the most rock solid NoSQL databases. The hatred towards MongoDB really is pretty irrational given just how popular the database is.


It is anything but irrational.

https://aphyr.com/posts/322-jepsen-mongodb-stale-reads

If it isn't already fixed, they need to fix it. I can't vouch for HBase, but Cassandra is stable and reliable now. Cassandra is also easier to scale.


FWIW, I don't think Cassandra is particularly any better today semantically than it used to be. Merge conflicts are still at the cell level than the row level, and wall-clock time is still the way that LWW resolution is determined. It let's you mix strongly consistent and eventually consistent data together, which makes no sense.

But the difference is that Cassandra is reliably "broken" in those ways, and as a result there are ways of using it which don't lean heavily on those weaknesses. Such as writing only immutable data or isolating all data that will be used in paxos transactions into their own column families by convention, etc.

Cassandra more or less behaves exactly as it claims that it does. So you can do a somewhat thorough investigation of its system semantics and know what you can rely on and what you can't. MongoDB doesn't even uphold the system semantics it claims that it has, so it's just broken in weird and esoteric ways that you discover mostly by accident.


Scale to what though? RDBMs can easily handle large loads, have replication, etc... At the point where you need true scaling, you'll have a much better idea of your problem and can solve it appropriately.


Or if you want a rock solid json document-based db with an awesomely killer api - use RethinkDb.

Its a phenomenal product.


Why on earth haven't I come across this information before? I spent a crazy amount of time researching frameworks before settling for Meteor, and never came across this.


MongoDB: Because /dev/null doesn't support sharding.


But it does support sharting. Your entire app. Into the ether.


At webscale


Ha ha, you said "sharting."


nc -vv -l 27017 > /dev/null

That and some round robin dns entries should be good enough for prod use, yeah?

('course it's not webscale ready till it's deploayble as a docker container which contains critical nodejs code with a dependancy on leftpad.js...)


These days all the cool kids use LPaaS on http://left-pad.io/, it scales much better than running leftpad on a local machine.


If you use ncat you can even get TLS support.


> Look I get that's it's easy to use and easy to get started with

Yes, and that's why I'm using it.

> but you're going to pay for all of that later. [...] When things get busier you're in for a world of pain.

Will never happen.


I worked at a Data Analytics start up in Palo Alto back in 2011 and we had 8 or 9 databases in our arsenal for storing different types of data. MongoDB was by far the worst and most unstable database we had. It was so bad that for the presidential debate, I had to stay up and flip servers all night because even though the shards were perfectly distributed, the database would crash and fail over to two other machines which couldn't handle our entire social media stream. We ended up calling some guys from MongoDB in to help us troubleshoot the issue and the guy basically said "Yeah we know that's a limitation; you should probably buy more machines to distribute the load." I like the concept of Mongo, but there are other more robust NoSQL databases to choose from.


I had a similar experience in 2011 with Mango where we were running map reduce jobs which Mongo advertised to support. The whole system got blocked from running the map reduce and the Tengen consultant sighed when we told him we were running map reduce jobs.


Which isn't the best argument to make against MongoDB since you should have known - it's even part of their course curriculum - that map/reduce is not the optimal way to aggregate in MongoDB. They have their own aggregation framework (https://docs.mongodb.com/manual/core/aggregation-pipeline/).

I have no intention of defending MongoDB because what do I know, never worked with it in real life - but just out of curiosity I took the free courses they offer (https://university.mongodb.com/) and I find that a sizable share of the complaints about MongoDB come from people who don't seem to have learned much about the product they are using. It's like people complaining their new truck behaves badly in water.

A lot of critics seem to have chosen MongoDB when they needed a SQL DB from day one. If you need full flexibility to (re)combine data you need SQL, for example. A document store isn't "schema-less" at all - much of the schema is built-in and very inflexible after that.


NodeJS + MongoDB is this generation's Laurel and Hardy stack ("look at this mess you got me into"). Last generation's was PHP + MySQL.


I actually wonder what the correlation is between PHP use and MongoDB use. They both have an attitude that mistakes ease with simplicity, a philosophy that puts correctness way down the priority list, and an easy introduction with a heavy ongoing maintenance tax.


Actually PHP and MySQL is a pretty respectable stack today if you use it right.


That's because we've learned to avoid most of the pitfalls of the tools.


Not to mention that many MongoDB drivers are worse than MongoDB itself, adding insult to an already unprovoked vicious injury.

The official Java driver is the easiest way to waste otherwise useful CPU time due to its blocking nature and wasteful threading model.


Yes it is easy to use. But too bad it also fails "transactions" silently so that you don't even know if your changes were "committed" or not. Don't worry, it only happens every once it a while so it's not a big deal...

Unless you are coinbase or an organization that deals with money/bitcoins/etc and you need ACID compliant transactions so that "debits/credits" don't just magically disappear.

When the bitcoin craze was going crazy, coinbase had all kinds of problems due to their mongodb backend.


I remember during peak mongo that if you weren't working on a project that used mongo, other devs looked down their noses at you.


But that was only because they had to stand on their tippy-toes and tilt their heads back to avoid drowning in PagerDuty alerts. ;-)


It's pretty easy to use... until you have to normalize data and query across one or two joins. I've been forced to build with mongo for the past few months (still not sure why) and I can't think of a single valid use-case for this rubbish.

If you need denormalized/distributed caching, Redis does a good job. If you need to store some unstructured json blobs, postgres and now sql server 2016 can do that. If you need reliable syncing for offline capable apps, you probably want CouchDB. If you need real time, use Rethink Obviously, relational data belongs in a relational database.

I think the problem is that all of these databases do one or two things really well. Mongo tries to do all of these things, and does so very poorly.


All I ever hear is terrible things about it. I'm not a "gotta hear both sides" kind of person so that means something to me.


I (used to) hear lots of good stuff, but the type of devs were always hype driven. Asking for a reason why Mongo was used, the reply sounded just like the marketing hype on Mongo's homepage - lots of buzzwords and catchphrases ("big data", "schemaless") with no substance to the reason for choosing it.


I've just migrated one project from mongo to postgresql and i advise you to do the same. It was my mistake to use mongo, after I've found memory leak in cursors first day I've used the db which I've reported and they fixed it. It was 2015.. If you have a lot of relations in your data don't use mongo, it's just hype. You will end up with collections without relations and then do joins in your code instead of having db do it for you.


Even if you want NoSQL I'd use RethinkDB over Mongo any day. Way better query language, real-time support, and relational/regular SQL-like stuff.

https://rethinkdb.com/docs/comparison-tables/


And the use case of this post is exactly what RethinkDB does better: "One of our services periodically polls the database and reads the list of running containers with the query..."


> don't use mongo, it's just hype

I'm kind of curious as to where this hype is. I've almost never heard anybody say anything positive about mongodb. All I ever see is people saying it's terrible / hilarious for various reasons.


Like with any online community, Hacker News can be kind of an echo chamber where groupthink reigns and alternative points of view aren't encouraged. MongoDB hype has died down here, but there are still some people that are fans.

There are some things MongoDB does fairly well:

* MongoDB is really easy to use

* Document databases can be great and flexible solutions for some kinds of projects

* Documentation is fairly good so learning the basics isn't too hard even if you know nothing about it

* scales fairly well at the initial stages

* arguably quicker to get a project off of the ground with than traditions RDBMs, which might be the most important consideration for any startup even if a complete rewrite would eventually need to take place

That being said, I've used MongoDB significantly before and it wouldn't be my first choice for most types of new project: PostgreSQL probably would be


About the only thing I agree with is how great their docs are.

* Mongo is only easy to learn. Beyond simple demos, it gets harder and harder to use as projects evolve i.e. you have to do a lot of work yourself imo this is a common problem with nosql datastores that isn't exclusive to Mongo

* "Document databases can be great and flexible solutions for some kinds of projects": Postgresql has been able to work directly with JSON for some time now. There are also other document datastores that are more reliable than Mongo

* Scaling with Mongo is difficult, specifically the crazy setup. Even if you set it up properly, the results don't tend to match the marketing https://aphyr.com/posts/322-jepsen-mongodb-stale-reads

* "arguably quicker to get a project off of the ground with than traditions RDBMs" unless you're using Meteor, I'm also going to disagree here. Most frameworks target a relational database by default. Developing by convention tends to get you off the ground much faster than using something more specialized and niche


But they do make great mugs. Who here doesn't have at least a couple of MongoDB mugs. I don't use MongoDB and still have a bunch from random conferences over the last 3-4 years.



Idk about hype, but for node.js virtually every tutorial uses it. I think the stereotype is that postgres locks you into a data model but your needs vary drastically. I suspect mongo is harder to retool the schema than it is widely claimed, while pg/msql is slightly easier than is often claimed.

I doubt they meet in the middle, but both have great use cases. scaffolding out a quick isomorphic js app is great for mongodb for example. pg is faster and more robust.

its just tech, depends what you are optimizing for


> scales fairly well at the initial stages

Not sure what that means, but scalability is the worst thing about mongo (though my experience with mongo is all from ~2.5 ish years ago). As soon as your working set of indices gets bigger than memory, performance falls off a cliff and your entire app grinds to a halt. In my experience, mysql and postgres have a more gradual decline in performance so you have some time with only mildly degraded performance to figure out a solution (plus they have more options for tuning which can buy you more time).


The HN crowd tends to insult it, but outside of HN people hype it up. People keep talking about how much they love the MEAN stack. It's huge in the hackathon crowd, due to them sponsoring many of them, and its low learning curve.

I wish people would stop using acronyms and realize Express/Angular/Node is just as good with something other than Mongo.


I can confirm. On HN, meetups, Twitter, etc. no one talks about MEAN stack any more because of Mongo and the Angular 1/2 split (and React's popularity). In the last hackathon I went to, like most "corporate sponsored" ones, you got a special prize for using it.


Are people using React starter kits? The major issue I'd see with React at hackathons is the large amount of configuration you typically need to do before you get started. I dislike starter kits due to the additional complexity overhead, but I can imagine for a throwaway project they'd be fine.


thanks for spelling that out for me

great incentive to dive into React further and ditch Mongo


if you are just messing around you might be interested in vuejs & rethinkdb. found it a pretty intuitive stack, much easier than react for me


I think this is part of the reason who RoR, MEAN and other such frameworks are so popular. They tend to be easier to setup and many developers love it. My own theory is there are so many bad devs out there so that contributes to the hype because so may devs end up adopting it.

Take it with a grain of salt...


MEAN isn't as easy to set up and I would not compare it to Rails at all. Rails is a backend framework—MEAN is just a bunch of technologies used together to have an API and a SPA. I'd actually argue, from personal experience, that newbies have a lot of trouble with MEAN because while there's a CLI tool MEAN doesn't have the community support, conventions, etc. that Rails does. Plus, you have to learn too many things at once.


outside of HN people hype it up

Just today CNBC announced its list of "the 2016 CNBC Disruptor 50 companies"[1]

MongoDB is #19. It apparently is:

   Big data's thought leader
[1] http://www.cnbc.com/2016/06/07/2016-cnbcs-disruptor-50.html


It's accurate. MongoDB is killing it in the Big Data space.

Because of its excellent integration with Spark/Hadoop and the schemaless nature it's very useful in the analytics space.


I like the Postgres/Express/Ember/Node stack, if not just for the acronym.


Their website is quite hype prone:

>The Standard for Modern Applications

>MongoDB 3.0 features performance and scalability enhancements that place MongoDB at the forefront of the database market as the standard DBMS for modern applications.

>Also included in the release is our new and highly flexible storage architecture, which dramatically expands the set of mission-critical applications that you can run on MongoDB.

>These enhancements and more allow you to use MongoDB 3.0 to build applications never before possible at efficiency levels never before attainable.


The standard for the golden age of crap.

Hang out on the freenode room for mongo and you will see that most people, working on production software, make inquiries that reflect a complete lack of knowledge, common sense and leave you facepalming with no hope in humanity at a large.


The people who wrote that are not the same people who write the database.


It's massively successful in the entry level web coding world due to a combination of good marketing and the belief that it gives you unlimited scale 'for free' and everything 'just works.'

Not saying those are impossible goals but so far no database has managed to deliver that. Building a large-scale endlessly-scalable database is still very hard and very detailed and easy to screw up.


It got pretty hyped a few years back when it was new, and 'NoSQL' stuff was just coming out.


It was the hottest thing a 2-4 years ago. Hype is probably the biggest association anyone who lived through that and didn't drink the coolaid has.


When it first came out it was hyped big time. Every other article was about how great Mongo was. About a year later the fallout started bubbling up. Digg went down for one or two weeks because of switching over to Mongo-- Reddit saw an influx of users and now is what it is.


I recall that Digg moved at least some part of their site to cassandra. Did they in fact do both?


The hatred for MongoDB mostly comes from the PostgreSQL supporters camp. The rest of us are just using whatever tool makes sense.

MongoDB is the fastest database I've ever used. The easiest to get running, stable and scaled out. The best documentation by far and has excellent integration e.g. Spark, Hadoop.

If you are doing Big Data it's a great tool in the arsenal.


It's fast because it doesn't actually save your data when you ask it to. It saves it later, when it feels like it, maybe. Also, the hate I have for Mongo is from first-hand experience with the wretched thing. The scaling is insane. The next jump after your first server is like 9 servers (2 repsets w/ 3 servers each w/ 3 config nodes). I don't have that kind of cash laying around for servers, especially when instance-for-instance, Postgres can handle easily 10x the write load of mongo.


RethinkDb does all of this better, and with good engineering and API design behind it.


    If you have a lot of relations in your data don't
    use mongo
Why would you use Mongo if you have lots of relational data? Why would you not start with a relational database for that?

I know Mongo has issues but it's never going to beat an RDBMS on relational queries.


I think this is a common pitfall though: people start off thinking they don't really have relational data and then realize they actually do. Now they have a pile of code integrated with a DB that doesn't do relations well and can't be ported easily and then encounter cool bugs like this. No bueno.


We start our apps with mongo, and design them with a migration plan to postgres. We've found it's very easy to rapidly develop the application with mongo due to it's flexibility. Once we understand where or app is headed and what our relationships actually are, we pretty much pull the plug out of mongo and stick it in postgres. If you build a reasonably intelligent query wrapper it's fairly effortless. That being said, we're thinking of moving our early prototyping to Rethink now that it's made some strides.


Meanwhile, normalized data is what gives me so much flexibility when using Postgres at the start of an app. I just store my data as generically as possible and usually all I need to change under churn is the queries.

Denormalizing on day 1 (Mongo) has you making guesses about your data access patterns at the worst possible time instead of just thinking about the data itself.


> Denormalizing on day 1

This has nothing to do with database choice. This is just shitty development. It's a strawman at best.


Can you explain what you mean? As far as I know, normalizing is a nonsensical process for a document store. You can only normalize a relational schema.


> You can only normalize a relational schema.

Normalization is just a method of organization to minimize repetition of data. It has nothing to do with efficiency of operation. This is perfectly valid code:

    person = {
        _id: "person123",
        username: "lloyd-christmas"
    }

    comment = {
        _id: "comment123",
        person: "person123",
        text: "This is how I start",
    }
You don't have to do:

    person = {
        _id: "person123",
        username: "lloyd-christmas"
    }

    comment = {
        _id: "comment456",
        person: {
            _id: "person123",
            username: "lloyd-christmas"
        },
        text: "This is also valid"
    };
Sure, a join is faster than the first one where you'll have to hit the DB twice. The point is that you don't have to START with denormalizing everything. I start with normalized data and do more DB reads than I need. I figure out how the application uses my data as I go along, and denormalize the pieces I need only once I need them and am confident I won't bump into consistency issues (my username isn't updating every 5 seconds). Through this process I realize what the actual relationships are in my application and how my app functions request to request. This allows me to better structure my data. This is a quick update in mongo and usually a couple of lines of refactoring in application logic.

Obviously this is just an MCVE. My original point was that I find this to be a drastically more flexible process than starting off relational.


You can normalize data in Mongo and replicate relational database features in your application layer.

I'm just curious how that's an upside to using a relational database from the beginning when your plan is to migrate to a relational database anyways.

Switch out "Mongo" for "Postgres" in your bulk paragraph and you have the same scenario but with less work on your part and more features to help establish your data model.

One upside I can see if it you're more familiar with Mongo where using a relational database slows you down.


Maybe my other comment might help clarify: https://news.ycombinator.com/item?id=11858939

The bottom level of our application layer is a query builder which is almost a drag and drop replacement between mongo and postgres. By the time that layer is built out, we know what our database needs to look like. I find that adding/dropping fields and models in mongo to be drastically faster than moving models around in postgres. The above example would obviously end up in the same structure, whether we started with relational or not. It was nothing more than demonstrating an iterative process where you don't need to START denormalized just because "that's why you use nosql".

We try to be as incremental as possible when building our apps, and have found that using nosql allows for 20 small refactors that often end up being 2 larger refactors with a relational db. We've just found that it ends up being a faster production process, and we end up with a much more application-specific database instead of just "This is a Person, this is an Address, this is a Comment". Sure, we know beforehand that the application will contain all those components. We don't necessarily know how they'll be used in a request-by-request basis, and whether or not they will actually end up being one-to-one, one-to-many, or many-to-many.


"It has nothing to do with efficiency of operation."

To nit-pick, I think normalisation improves the efficiency of updates, as you only have to modify the one place where the piece of data lies.


This feels like you lack an architect who can see the bigger picture of your applications. I don't mean that insultingly, but with experience you tend not to need that second system syndrome. Do you find there's less rewriting over time as you become more experienced? Or is it just the kind of projects you work on?


> This feels like you lack an architect who can see the bigger picture of your applications.

Quite the opposite. We feel that going in with the assumption that you know where the application is going to end up is hard-headed. However, acknowledging that the situation will definitely change doesn't absolve you of planning it out properly given the information currently available to you.

> Or is it just the kind of projects you work on?

We build mainly internal facing or b2b apps in the medtech space. Given that we need to integrate with larger players that don't really care much for small businesses, we can receive slow response times for data/api requests from any external sources we deal with.

e.g., Recently we built an application for a home-town pharmacy where we were forced to use two databases; one under our control and one which was controlled by a pharmacy management system. We needed to update certain models in their database while reading from other ones. They promised to build out a few stored procedures that we needed. They flat out lied on a couple, and then quoted us a 8 month turn around on the other ones. We'd be stuck with them regardless, the expectation of a curveball like that allows us to rapidly adapt.

Obviously, the pharmacy's business model doesn't change very rapidly. Iterating over the app through a few of their business cycles tends to give you enough knowledge of what you can build them, as well as what they really want. Regardless of how much time we spend planning with them, they'll always leave us with some form of an XY problem that we'll only understand after they use the application for an extended period.

We don't deal with web-scale, so the problems generally encountered in the mongo complaint arena tend to be irrelevant to us. Given our use case, I think the decision is pretty reasonable.

> Do you find there's less rewriting over time as you become more experienced?

We've factored that into our development strategy, which is why I mentioned the query wrapper. That query wrapper paired with a DAO level that's reasonably database agnostic makes our transition fairly simple and quick.


I've actually observed the opposite, that the more experienced you get, the more rewriting occurs and the less you tend to plan things out ahead of time, until you're Google/Facebook level and just accept that you will be rewriting constantly.


I'm not necessarily sure that you actually rewrite more, but that you're more willing to accept the current status and plan on some rewrites.


That's not what "second system syndrome" means.


Because then you start moving your RDBMS logic in the application layer. And AFAIK, this is not the place for it.


It will never beat it in non relational queries also. To answer why - it was suggested by someone that said it's new modern db standard, that NOSQL is clear winner then showed me comparison on their site with SQL queries, all looked like it was substitution for RDBMS but it's not.


PostgreSQL and jsonb might make it a more seamless migration that some may think:

https://wiki.postgresql.org/wiki/What%27s_new_in_PostgreSQL_...

JSONB values are indexable and queryable, so there aren't very many downsides.


> If you have a lot of relations in your data don't use mongo, it's just hype. You will end up with collections without relations and then do joins in your code instead of having db do it for you.

So... you are not against MongoDB but against NoSQL in general? I've used MongoDB and I've never ended up with lots of joins in my code. But I guess it all depends on the use case and how you've structured your data. Document databases are not a silver bullet.


ArangoDB in contrast to MongoDB is a ACID document store also with graphs.


> If you have a lot of relations in your data don't use mongo

Perhaps you should have been using a relational database from the get go.

Sounds more like your issue, not mongo's.


Contrary to what the name implies, most relational databases don't handle querying relation-heavy data well. If you need to hit plenty of relations in your queries, instead consider something that is optimized for that, like a graph database (or multi-model including graph).


I still want to know how anybody is making money off of data that doesn't have a bunch of relationships in it.


If you're currently using MongoDB in your stack and are finding yourselves outgrowing it or worried that an issue like this might pop up, you owe it to yourself to check out RethinkDB:

https://rethinkdb.com/

It's quite possibly the best document store out right now. Many others in this thread have said good things about it, but give it a try and you'll see.

Here's a technical comparison of RethinkDB and Mongo: https://rethinkdb.com/docs/comparison-tables/

Here's the aphyr review of RethinkDB (based on 2.2.3): https://aphyr.com/posts/330-jepsen-rethinkdb-2-2-3-reconfigu...


How does it compare to Couchbase? That seems to be lighting the world on fire in that space lately.


I'm not sure if lack of overbearing marketing speak counts for something, but RethinkDB definitely has that going for it.

I'm not an expert on couchbase (and neither on RethinkDB, to be frank, though I am a huge fan), but here's what RethinkDB has going for it:

- Changefeeds - easily open a persistent connection to the server and get updates when the results of an almost arbitrary query changes.

- Joins

- Expressive query language that is pretty functionally minded, really shines in their clojure/haskell drivers

- Excellent client libraries, well maintained

- Geospatial queries/objects

- Amazing admin interface (it has been amazing for a long time, too, not a recent change)

- First class consideration of replication & sharding (it is not a bolt-on in any way shape or form)

- API-driven cluster configuration

- API driven permissions management (this is relatively new)

- Excellent, easy to follow documentation

There are more things, but this is just what I can think of off the top of my head.

The team at RethinkDB is also just great -- I've met them in person and gotten help from them and they're straight shooters.

They've also got this great project coming up called Horizon: https://www.youtube.com/watch?v=Sb1lH5mvYmU

Also they have a video up with a member of the team building a realtime game with React Native: https://www.youtube.com/watch?v=xRK0SYSgVF0

Maybe someone who is very familiar with couchbase can help make a list... I'll start it off:

- Custom query language

- First class consideration for scale -- replication and sharding


I'm ex-Couchbase, so I can probably give a reasonably informed but independent view on this.

Firstly, regarding the marketing, it may not have been to many people's tastes - but it definitely worked, and achieved a lot of what was set out in terms of raising the awareness of what was a decent product that wasn't as well known as its competitors. There may be cases where people avoid it because they don't like the marketing, but the reality, having seen its effect, is they are in the minority, and would probably serve themselves better by assessing products based on technology rather than spiel.

Now, on the actual technology!

What Couchbase has historically been good at is highly scalable Key-Value access, at very high performance and low latency. Performance is comparable to Redis, but CB much more mature sharding, clustering and HA. e.g. fully online growing/shrinking of cluster, protection from node failures, rack/zone failures and data center failures. Redis may be a good fit for single-machine caching situations, and also has its own advantages in terms its datastructures support, etc.

Quality of SDK's is pretty subjective, but I'd say the 2.x re-write of Couchbase SDK's makes them very solid. The Java SDK in particular is extremely good both in performance and by providing native RxJava interfaces.

In terms of query interface, there's geospatial and a new freetext capability on the way.

Couchbase chose down to go down the route of a SQL based interface as their main query language. This seems to be a bit love/hate with developers with some delighted and some perplexed. Maybe for devs it should really be about higher level interfaces like Spring are increasingly important anyway?

The native interface being SQL based is usually very popular with the BI / Reporting side of things.

Changefeeds (continuous query?) is a feature not in Couchbase which I would very much like to see in the future. One thing I would say is that it's something you have to be very careful in the design of to ensure scalability and performance. Consistency is something which would obviously need thought as well.


horizon is already here https://horizon.io/


RethinkDB's query language is infinitely better (among many things).


I assume Rethinkdb would not have this bug, could anyone from the team confirm this is the case?


Wouldn't that be like saying "Please confirm that MySQL doesn't have this Filemaker bug"?

Correct me if I'm wrong, but it's a completely different database.


A lot of Mongo DB bashing on HA. We use it and I love it. Of course we have a dataset suited perfectly for Mongo - large documents with little relational data. We paid $0 and quickly and easily configured a 3 node HA cluster that is easy to maintain and performs great.

Remember, not all software needs to scale to millions of users so something affordable and easy to install, use, and maintain makes a lot of sense. Long story short, use the best tool for the job.


This has also been my experience. Millions of large documents on a single (beefy) node with a single user it's been fine. Although, the sysadmins had previously left me with flat file xml on shared storage so the bar was pretty low.


Ha ha, had to scroll all the way down to find a positive comment. It's actually a great paradigm; not every problem fits into a relational box.


Oh, the fud of it.

The behavior is well documented here https://jira.mongodb.org/browse/SERVER-14766

and in the linked issues. Seasoned users of mongodb know to structure their queries to avoid depending on a cursor if the collection may be concurrently updated by another process.

The usual pattern is to re-query the db in cases where your cursor may have gone stale. This tends to be habit due to the 10-minute cursor timeout default.

MongoDB may not be perfect, but like any tool, if you know its limitations it can be extremely useful, and it certainly is way more approachable for programmers who do not have the luxury of learning all the voodoo and lore that surrounds SQL-based relational DB's.

Look for some rational discussion at the bottom of this mongo hatefest!


I wouldn't call a JIRA ticket good documentation.

While I agree that it's good to know the limitations of the tools you chose those limitations should be clearly spelled out in the documentation.

I don't think most programmers have the luxury of learning all the voodoo and lore that surrounds MongoDB from JIRA tickets and blog posts.


> I don't think most programmers have the luxury of learning all the voodoo and lore that surrounds MongoDB from JIRA tickets and blog posts.

That's how I learned everything I know about most FOSS products I have encountered - through the code pages and social media surrounding the project.

Pretty much everything about the mongodb hate derives from their marketing and sales. The truth is, they've obviously stumbled onto something the market wants, otherwise they would never have become so successful.

For me, as a long-time programmer with no database experience, the mental mapping of JSON constructs as both data and query language was far easier for me to absorb than the relational model, which didn't fit the paradigms that I was used to.

At my present gig, we've used Mongo DB for two years, scaling up to quite a large production setup. Like any technology it has strengths and weaknesses, but it has not been the utter failure that readers of Hacker News would be led to expect. We adopted it knowing quite a bit about its history, and it has turned out to be an excellent choice that has held up over time.

Periodically we've considered switching to postgres, and we may do so for part of our stack. But for the core jobs of data collection and batch processing data with fluid schema, I'm pretty sure we will stick with mongodb for the duration.

It's just a tool, folks.


Uh... things that seem to work fine in Mongo under light load tend to fall over during heavy load. Things like this problem. Like key reordering: https://jeremywsherman.com/blog/2013/04/23/key-reordering-ru...

Yes, with enough extremely careful coding knowing exactly all the internals of the database, you can probably avoid these giant gotchas. But that's a hell of a lot of work for a DB that's supposed to be easy.

And then you're still dealing with a database without ACID guarantees. No transactions, very little atomicity.... and good luck if your server crashes... we've had multiple customers have their DB corrupted that way.

MongoDB is only good for storing data you don't really care about... in which case, why are you bothering to store it at all?

What most people end up storing in Mongo: strongly schema'ed, relational data that is critical to their application. This is exactly what mongo is not for.


> The usual pattern is to re-query the db in cases where your cursor may have gone stale.

You mean, every time you query the DB? Do you also need to re-query the re-queries?


Strongly biased comment here, but hope its useful.

Have you tried ToroDB (https://github.com/torodb/torodb)? It still has a lot of room for improvement, but it basically gives you what MongoDB does (even the same API at the wire level) while transforming data into a relational form. Completely automatically, no need to design the schema. It uses Postgres, but it is far better than JSONB alone, as it maps data to relational tables and offers a MongoDB-compatible API.

Needless to say, queries and cursors run under REPEATABLE READ isolation mode, which means that the problem stated by OP will never happen here. Problem solved.

Please give it a try and contribute to its development, even just with providing feedback.

P.S. ToroDB developer here :)


How does ToroDB handles sharding across multiple instances?


Right now ToroDB handles sharding at the backend (RDBMS) level, with those dbs that support that. There's currently a Greenplum-based backend on the works, that obviously handles sharding by itself. Also CitusDB is on the roadmap.

At a later release, we also plan to natively support MongoDB's sharding protocol.


My general feeling is that MongoDb was designed by people who hadn't designed a database before, and marketed to people who didn't know how to use one.

Its marketing was pretty silly about all the various things it would do, when it didn't even have a reliable storage engine.

Its defaults at launch would consider a write stored when it was buffered for send on the client, which is nuts. There's lots of ways to solve the problems that people use MongoDB for, without all of the issues it brings.


I really agree with your sentiments, that first paragraph is a great quote. I grew quite an adverse to MongoDB after researching it. While I never found this specific caveat, I found other very worrying decisions.

> reliable storage engine

By "reliable" I assume you mean "consistent?" While MongoDB claims that it's CP (which it's not, as per the article) there's nothing wrong with inconsistent databases (AP, e.g. CouchDB). Mathematically there is no reason for MongoDB to behave like this. It's fundamentally broken; it's neither AP nor CP.


I actually mean reliable. Its probably different now, but at launch, the defaults were fsync'ing every 30 seconds or so. It would literally just apply the change to an memory mapped buffer and just fsync it once in a while.

They did that so they could look good in benchmarks, and it's why they recommended so strongly that your memory completely fit in RAM or else things would fall apart (pro-tip, any system that recommends that has a poorly designed storage engine).

They also screwed up the consistent side of things as well.


I have moved from Mongo to Cassandra in a financial time series context, and it's what I should have done straight from the getgo. I don't see Cassandra as that much more difficult to setup than Mongo, certainly no harder than Postgres IMHO, even in a cluster, and what you get leaves everything else in the dust if you can wrap your mind around its key-key-value store engine. It brings enormous benefits to a huge class of queries that are common in timeseries, logs, chats etc, and with it, no-single-point-of-failure robustness, and real-deal scalability. I literally saw a 20x performance improvement on range queries. Cannot recommend it more (and no, I have no affiliation to Datastax).


Genuinely curious: when you say "it brings enormous benefits to a huge class of queries that are common in timeseries", what are you referring to, exactly?

I run Cassandra in production and I love its operational simplicity, scale-out design, and write performance. But I think its support for time series is perhaps over-hyped. To me, it seems the only queries you can run in Cassandra is a key lookup (partition key row get) and a column slice (partition key row get filtered by an ordered range of columns). This allows for a certain time series use case e.g. where each row represents exactly one series, and where the only thing you want to do with a series is to get its raw values. But it doesn't allow for many of the things I personally think of when I think about "time series queries", e.g. resampling, aggregates, rollups, and the like.


I am referring to anything that resembles a range query, ie, where you require a bunch of contiguous information queried on a single key. Think "give me all of this person's chat entries from x time to y time", or indeed "give me all this topic's comment entries from x time to y time" (but not both - only one of the above would be efficiently stored - you decide which it would be).

Cassandra, as you know, forces a certain amount of "low level awareness" requirement on the programmer because to tap into its uniqueness, you need to know how you will query stuff, so that Cassandra will ensure that the most common range queries are contiguously stored in rows. All other databases hide the on-disk storage order from you in an abstraction, and you can find atomisation causing inefficiency. Cassandra forces you to think about it, and in return, guarantees contiguous storage order on disk along one of your keys so that along that key, retrieval is lightning fast as it requires only one pass.

Basically, both spinning disks but also SSDs, are in essence, 1d media (ie, a lot in common with tape) in the sense that along one dimension you can read stuff massively fast, but as soon as you need to seek (ie start using dimension 2), even on an SSD, your performance dramatically declines. Cassandra forces you to think about your queries so that they will be "aligned" along the most efficient direction on disk.

Now agreed that if your queries cannot be aligned along said direction, then Cassandra drops to being no better than all the others, and penalises you with some complexity. That includes some examples of aggregrates, resampling etc (though I would argue that the order of magnitude contiguous read still helps these). Some of this can be mitigated with denormalisation ie: storing stuff more than once, in transposed or sub-sampled orders, something that relational DB purists will hate, with some justification (potential for inconsistency).

FWIW Riak TS sounds promising with automatic "blob" style storage etc and resampling capabilities which might take Cassandra on quite explicity and in a higher level, more convenient way. I am about to evaluate it because I agree with you that the resampling capability in particular could be better supported in Cassandra, though ultimately, both databases will still be limited by the underlying D1 v D2 "contiguous v seek" capabilities of the storage so I'm not expecting miracles from Riak.

By the way, I'm not even touching on Cassandra's scale-out ease. More perf needed? Literally just add boxes though it would be unfair not to comment on the cost of this, which is Cassandra's node-level consistency tradeoffs for very recently added data, and which is, if I recall correctly, why Facebook went to Hbase. You can force consistency at the query level, but performance can suffer.


After some truely horrific experiences with Riak K/V, especially combined with Riak Solr, I won't touch anything from Basho with a ten foot pole. Not sure what's going on over there, but the reality of Riak in production was miles away from what Basho's sales claimed was possible. And yes, we even spent about 4 months working with their tech support. It almost seems that "It's based on Erlang thus it scales" was the entirety of their design work.

I've also worked with Cassandra and have nothing but good to say about it, did what we asked it right out-of-the-box. Datastax was really helpful as well.

--

And I have no affiliation with either Basho nor Datastax, just really happy with one product and completely blown away with the poor performance of the other.


Did you evaluate Postgres with something like citus to handle timeseries data?


Weird to see that Mongo is still around. We started to use them on a project ~4 years ago. Easy install, but that's where the problems started. Overall terrible experience. Low performance, Syntax a mess, unreadable documentation.

They seem to still have this outstanding marketing team.


Should an infrastructure company be advertising the fact that it didn't research the technology it chose to use to build its own infrastructure?

All these people saying Mongo is garbage are all likely neckbeards sysadmins. Unless you're hiring database admin and sysadmins, Postgres (unless managed - then you have a different set of scaling problems) or any other tradition SQL store is not a viable alternative. This author uses Bigtable as a point of comparison. Stay tuned for his next blog post comparing IIS to Cloudflare.

Almost every blog post titled "why we're moving from Mongo to X" or "Top 10 reason to avoid Mongo" could have been prevented with a little bit of research. People have spent their entire life working with the SQL world so throw something new at them and they reject it like the plague. Postgres is only good now because they had to do some of the features in order to compete with Mongo. Postgres been around since 1996 and you're only now using it? Tell me more about how awesome it is.


My goal in writing this post was not to convince people to use or not use MongoDB, but to document an edge case that may affect people who happen to use it for whatever reason, which as far as I could tell was inadequately documented elsewhere.


Only the first line was directed at you - and it was more in jest. Everything else was directed more at the other commenters and Mongo detractors in general.


While I love to hate on MongoDB as much as the next guy, this behavior is consistent with read-committed isolation. You'd have to be using Serializable isolation in an RDBMS to avoid this anomaly.


I think this is incorrect, but it's not as simple as the other replies are making it out to be.

Under read-committed isolation, within a single operation, you must not be able to see inconsistent data. So if you do "SELECT <star>" on a table while rows are being updated, you're guaranteed to always see either the old value or the new value. But if you do two separate statements, "SELECT <star> WHERE value='new'" and "SELECT <star> WHERE value='old'" in the same transaction, you may not see the row because its value could have changed. Serializable isolation prevents this case, typically by holding locks until the transaction commits.

It gets messy because the ANSI SQL isolation levels are of course defined in terms of SQL statements, which don't map perfectly to the operations that a MongoDB client can do. Mongo apparently treats an "index scan" as a sequence of many individual operations, not as a single read. So you could argue that it technically obeys read-committed isolation, but it definitely violates the spirit.


This is worse than read-committed because you're not even seeing the old state of the document. If an update moves a document around within the results, and it ends up in the portion you've already read, you just don't see it at all.


The article suggests that tuples being moved to different storage locations can cause them to not show up in a table scan.

No such thing can happen in a sane RDBMS, no matter the transaction isolation level.


In postgres (and a fair number of other databases) you'll not see that anomaly, even with read committed. Usually you'll want to have stricter semantics for an individual query, than for the whole transaction.


With read-committed you see the old state.


Quoting from the very first paragraph of the blog post:

> Specifically, if a document is updated while the query is running, MongoDB may not return it from the query — even if it matches both before and after the update!

How's that compatible with READ COMMITTED isolation level?


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: