Said it before, will say it again... "MongoDB is the core piece of architectural rot in every single teetering and broken data platform I've worked with."
The fundamental problem is that MongoDB provides almost no stable semantics to build something deterministic and reliable on top of it.
As a guy who works on ACID database internals, I'm appalled that people use MongoDB. You want a document store? Use Postgres. Why on earth would you use a database that makes so little in the way of guarantees about what results you get from it? I think most people have really low load and concurrency, so things seem to work. When things get busier you're in for a world of pain. Look I get that's it's easy to use and easy to get started with, but you're going to pay for all of that later.
> Why on earth would you use a database that makes so little in the way of guarantees about what results you get from it?
Because some people can't stand having to work with SQL,migrations,schema and constraints, it's as simple as that ( That's not my opinion,that's just the rational behind MongoDB). Even if you use Postgres with the Json column type, you still need to write SQL queries and schemas.
In the context of analytics, it might make sense, I'm not a big data analyst, but I've seen MongoDB used to centralize logs.
My company used mongo for years before we got our shit together.
Schmemas were always implicit (until we got our shit together and started defining and enforcing them with Python Schematics).
Migrations were crazy scripts you run in prod or hacks you stick into your code to "transition".
And yes, surprise constrains left and right causing awful anti-patterns. One-character key names to save disk. Hashed values for indexed keys to save memory. Awkward structuring to improve query performance.
The worst part is, we now have tons of important data in these databases and almost no one understands the legacy crazy app logic that makes them tick.
> used mongo for years before we got our shit together.
That's actually a legit use case. Use MongoDB while you get your shit together. I use global variables while I'm noodling around in code. Eventually I refactor.
I think this is a recipe for disaster. First, there are basic things that you should do from the get-go, e.g. not using globals. Second, the problem with "eventually I'll do it right" is that by that time, your stuff is out in the open, used by clients and heavily depended upon, and you have no way of refactoring. A company that uses a bad piece of technology will suffer many years before they could replace it.
Depends on the language, I suppose. I'm more productive with Python when I write everything procedurally and refactor into functions, classes, etc. every dozen lines or so. It's more fun than writing UML diagrams (and seems to produce better code, too!).
Or do you think so long that your head aches and your colleague Hephaestus splits you open to find a fully-formed cooperative multiple inheritance hierarchy?
The problem is, in a great portion of real world projects "eventually" never comes and there's just no time for any major refactoring or replacing technologies since you are too busy implementing the feature that was needed two weeks ago.
I've often dreamed of a specific type of software built and released as "prototypeware", where any app created using it will have certain built-in scaling limits—and going past them will irrevocably force the app into a read-only mode. It would warn anyone monitoring it well in advance of hitting such a limit, of course. But there'd be no way to just slide the limit upward or otherwise tarry. It'd force the migration to something better just as if it were a Big Customer with Enterprise Compliance Demands.
If an enforceable mechanism like that existed, I'd be a lot more confident in mocking things up. Stick SQLite in for the database, munge HTML and Javascript together, whatever—it's literally going to slap away the hand of anyone who tries to use it on a production workload, so why not?
(Going further, it'd be interesting to create some sort of quagmire of a software license, specifically for prototypeware, such that you'd be forced to rewrite all the prototype code instead of reusing even a hair of it in production. Maybe something like reassigning the IP to a trust, with the trust having an obligation to sue anyone and everyone who tries to create derivative works of the code they've been handed?)
This will not work. The whole "prototype" idea assumes once you grow out of the "prototype" phase you have the time, money, manpower, etc. to rewrite the whole thing based on solid, powerful technology and tools. That is, more of then than not, not the case.
The first problem is that every tool has demands, especially the limited ones, and you end up writing your application around those limits and demands, using platform-specific code that will have to be discarded and re-written come the migration.
The second problem is that these tools dictate design, and once you try migrating, you still have an application designed around the prototype tools, which make a lot of concessions and have design flaws because of that.
Finally, I've never understood the need for learning a specific tool, platform or language for "rapid prototyping". Use the tools you will use eventually, it's not that building something in, say, Java from scratch will take an order of magnitude more time and effort than building it on Node.js, despite all the hype, especially if you're a Java shop.
> it's not that building something in, say, Java from scratch will take an order of magnitude more time and effort than building it on Node.js, despite all the hype, especially if you're a Java shop.
I think we're picturing different things here. You're picturing having software engineers make the prototype, and then having the same engineers do the final implementation. Meanwhile, I'm picturing two different teams, with different competencies—one who knows a prototyping toolchain backward and forward and is extremely productive in it, and the other who knows a solidly-architected platform just as well.
The classical pipeline in the animation industry is to have two separate "teams" of artists. One team does concept illustration and storyboarding, and the other does keyframe animation and in-betweening. The first of the two teams is essentially a team of prototypers. Their output is a product which stands on its own for internal evaluation purposes—but which isn't commercially viable "in production." (Nobody really wants to watch 1FPS sketches.) So, after the storyboarding is complete, the whole product is redone by the actual animators into the more familiar product of 24FPS tweened vector-lines or CGI model-joint movements.
The more familiar case of this for web development is where the "prototype" is a PSD file. Professional capital-D Designers are usually Photoshop experts—they're very productive in it, and can mock up something that can be evaluated for being "what the customer wants" quickly, with rapid iteration if it's not right. Once they've got the customer's sign-off, their output product—their prototype—can be tossed over to development staff to "make it work." (There are also an increasing number of interaction-design prototyping apps targeting the same set of designers, under the theory that they'll be able to become productive in quickly iterating the "feeling" of an app with a customer in the way they're already doing with the "look" of the app. I haven't met a designer that uses one of these professionally, but I think that's mostly because there aren't any of these yet well-known enough to be taught in art schools.)
But when it comes to workflow and use-case design, we don't really see the equivalent pipeline. Looking through the lens of separated "prototyper" and "engineer" roles, there are clearly tons of software-development tools that were intended to be used purely by "prototypers": Rails' view scaffolding, for example. But since this role isn't separate, these things get used by engineers, and sneered at, since, as you said, it's no more effort—when you're already an engineer—to just engineer the thing right from the beginning.
Interestingly, all of the true examples of workflow prototyping I can think of come from the specific domain of game development—but even there, nobody seems to realize that prototyping is the goal of these tools, and tries to misuse them as "production" tools. RPG Maker, seen as a tool for making a commercial RPG, is total crap. RPG Maker, seen as a tool for prototyping an RPG, is an excellent tool. Its output is effectively a sketch, a cartoon in the classical sense:
> The concept [of a cartoon] originated in the Middle Ages and first described a preparatory drawing for a piece of art, such as a painting, fresco, tapestry, or stained glass window.
A cartoon is a prototype used to communicate intent. Yes, you (as the producer of the finished piece) can cartoon together with a client to iterate on a proposal. But much more interestingly, a client can learn to cartoon on their own—and then, in place of a long design document, they can submit their cartoon to you. An RPG Maker game project is the best possible thing I could hope to receive as a design proposal from a client asking for me to make an RPG. It forces all the same decisions to be made that making the actual commercial game does—and thus embeds the answers to those decisions in the product—but it doesn't require the same skillset to create that the commercial game does, so the client can do it themselves. The prototyping tool, here, is doing the "iterating on a design together" job of the designer for them.
We do have one common prototyping tool in the software world—Excel. A complex Excel spreadsheet is a cartoon of a business process, that nearly anyone can make. We as engineers might hate them, because people generally have no sense of project organization when making them—but every project to convert an Excel "app" will take far less time than one that involves collecting the business requirements yourself. The decisions have already been made, and codified, into the spreadsheet. You don't have to sit there forcing the client to make them. The process of cartooning has forced them to do it themselves.
---
To summarize: software prototyping tools aren't for engineers—if you have an engineer's mindset, you'll prototype at the speed of sound engineering practice, so prototype tools won't be any help to you; and you'll be more familiar with the production-quality tools anyway, so you'll be more productive in those than with the prototyping toolset.
But software prototyping tools definitely have uses: they can help designers to iterate on a "functional mock-up" to capture a client's intent; or they can even help clients to create those same mock-ups on their own. This is why "prototypeware" makes sense as software—but also why it should be self-limiting from being used in production. The prototype app wasn't created by someone with an engineering mindset—so there's no way it could end up well-engineered. Its purpose is to serve as a cartoon, a communication to an engineer; not to function in production on its own.
(Mind you, prototypeware could be made to function as an MVP in closed-alpha test scenarios, in the same way that the MVPs of many startups are actually backed by manual human action in their early stages. The point there is to test the correctness of the codified business process, rather than to support a production workload.)
There is nothing as long lasting as a temporary solution.
I've just fixed up some code marked "proof of concept" that had been in production for a decade...
Admittedly some people's PoC work is better than what some consider to be release ready, but still this was not intended to be in that state for that long.
I don't think that is an apt comparison. Replacing your database backend, at the minimum, usually requires a massive migration of data, and possibly even changes to your entire architecture.
A refactoring does not change behavior, and can be perfomed in minor -- and in your example of a global variable, perhaps even trivial -- increments.
Sure, there's a continuum of refactoring, from trivial to complete re-write.
Any time the data schema(s) change, you need to migrate. I'll bet that even when sticking with the same database flavor you'll need to migrate a handful of times over the first few months. Requirements change, blah, blah. After the first couple migrations, you refactor to make that less painful. Eventually it might get to the point that your persistence layer is fairly abstracted and you can change databases without ripping apart everything else. Doesn't happen with every project, but sometimes.
My concern would be whether Mongo will cause me to lose data.... "To recap: MongoDB is neither AP nor CP. The defaults can cause significant loss of acknowledged writes. The strongest consistency offered has bugs which cause false acknowledgements, and even if they're fixed, doesn't prevent false failures."https://aphyr.com/posts/284-call-me-maybe-mongodb
...or get my data corrupted: "When MongoDB is all you have, it’s a cache with no backing store behind it. It will become inconsistent. Not eventually consistent — just plain, flat-out inconsistent, for all time. At that point, you have no options. Not even a nuclear one. You have no way to regenerate the data in a consistent state."http://www.sarahmei.com/blog/2013/11/11/why-you-should-never...
When you refactor or rewrite your code, you have the old code in version control, can write tests to confirm that it still works as expected, and there's no inherent time pressure.
If you pick an unreliable database and your data has been or is being lost and/or corrupted, it's more like a "try to stop the bleeding before the patient dies" situation.
That's not the time I want to be considering changing databases.
Likely better to scrape the data out through the app (if it's a web app) than to try to talk to that sort of database directly. The app would at least put names to everything.
I often liken NoSQL databases to dynamically typed languages.
With a NoSQL database, you have an implicit schema, but it will only be enforced and fail at runtime - when your code expects a field but failed to find it, for instance.
With a dynamically typed language, you have implicit types, but only enforced at runtime - when your code expects a value to be an int but finds a boolean, for instance.
And both are fine, there is a need for both. I can see how the flexibility of being able to change, well, everything by just flipping a switch in your head ("this is an int now") might be helpful for, say, data exploration problems.
It's just that in a production environment, these features of NoSQL databases and dynamically typed languages turn into massive sources of problems and oh god, just don't.
You and Lazare are right on the money. And the thing with the database is that the code that inserts/updates it has to agree with all the querying code about what the implicit schema should be - but it's implicit and scattered around your code - so on a large team it's very hard for everyone to understand that implicit contract and it's going to be a constant source of production bugs.
Schemas don't change that much compared to code, having a strict schema enforced by the database saves you so much time and pain and downtime in the long run.
This list makes me want to cry a little. It rings too true.
> Have migrations (except they're going to be some scary ad hoc nodejs script that loop through your document store and modify fields on the fly).
I literally just spent the better part of tonight AND yesterday evening dealing with one of these scripts. I had pulled down the production table to locally test the script (gross), but when I later ran it in the production environment, we'd somehow had an array sneak in to what was an object field. The whole thing just felt like a mess.
Because you can't just test it on one document and see if it works; you have no guarantee that all the documents will be identical. And if the migration script crashes halfway through... oh man.
> And if the migration script crashes halfway through... oh man.
Schema-issues and typing aside, I looked at MongoDB just long enough to find out there are no transactions, then ran away, quickly.
For a lot of tasks, I guess I would find MongoDB very useful, but lack of transactions is a complete deal breaker for me. Not having a real schema, referential integrity and all that makes them even more important, IMHO.
At work, I have had more than one quickly-hacked-together Perl script crash on me in the middle of a run. Having proper transactions has saved my butt repeatedly.
Mongo has its weaknesses, yes its main strength is cited as its simplicity, or that its quick to get something out the door.
I agree with your last comment. I can't help but to laugh at people who think they would get away with designing a database with no schema. Schemaless for me meant that unless you enforce constraints, there won't be any.
There's a reason why there are ORMs even though Mongo drivers are sufficient for most cases.
1) I've always designed my data with future changes in mind. I often spend up to an hour thinking of possibilities of data that I want to store in a collection, before writing the schema. The flexibility i have with Mongo is that if I think I need a field but am unsure of the exact data type to store, i.e. is it a string or array of strings, or array of objects with strings? In that case I just leave the field as an object and change it later. The plus being that as long as I haven't stored anything with that field, I can always change its type without a 'migration script'.
I've only needed to 'migrate' by updating documents 4-5 time. When GeoJSON landed, and a few other times when I needed small changes to my data.
3) The only way I can think of enforcing constraints on < 3.2 is through indices, which is insufficient. Most ORMs do the enforcing. I've never needed to enforce them at an app level.
I've used MongoDB primarily for its Geo support, and JSON enabling me to get things done quicker relative to maintaining SQL tables. I've got a small but interesting use case, public transit. https://movinggauteng.co.za and https://rwt.to.
When I started with the projects, PostgreSQL + PostGIS felt like a black box, and I wanted something that would give me ease and flexibility. At the time hstore was the talk of the day, but seemed to not meet my needs.
It would now with JSON, but I'll stick with Mongo for now.
Exactly. While the process of designing the structure of your data can make you feel like “you're not getting real work done”, in the long run, it actually prevents headaches caused by inconsistent data. Data always has a structure, it's just that some people are too lazy or mentally feeble to figure out what it is.
For me that is the most important aspect of starting / designing an application. If the data model is accurate, then the code falls into place easily. If its not quite right, more and more code ends up in the application trying to make up for the poor data model.
My first task in any project is to design the whole data model based on current requirements and while designing it I think of the interfaces and how would they read and write data (to refine requirements). Writing views and actions/APIs on top of well-formed data model then becomes a breeze.
Agile doesn't mean "don't gather requirements or plan anything." It just means that you evaluate your results frequently and maybe change course, instead of waiting until the end when you're "done".
To add to your point about schemas. The new generation has not learned that the data almost always outlives whatever throwaway front end was written to work with said data. Tying the data to some sort of flavor of the month framework is setting up for all sorts of pain later.
I despise mysql, but even it is better than mongo. At least with it I can easily transition the data to many different uses.
Also, and a point I find amusing is that many users of nosql claim schemaless and then go and write a layer on top of the datastore to enforce a schema. It would have been so much simpler to use a RDMS out the gate instead of badly implementing one.
> If you're EVER going to read the data back and do anything with it, it has a schema.
You are giving the NoSQL crowd too much credit. Some abominations have no recognizable schema at all. The data store will just contain arbitrary dump of data which different developers decided their "schema" should be. The number of "columns" will vary, the "columns" will have arbitrary formats, so on and so forth.
If one developer decided to separate name into "first: John", "last: Doe", you will have that. If another decided to have "name: John Doe". That's what will be there. If one developer decided social security should be "SSN: 123-45-6789" and another decided it should "SSN: 123456780", well you are going to have fun cleaning up the data at the business or even application layer.
But that's not even the big issue with MongoDB. It's their lack of ACID compliance!
> Because some people can't stand having to work with SQL,migrations,schema and constraints
The real question is “How come these people are allowed anywhere near data stores?” SQL isn't ideal, but how many of the alternatives are better at protecting the integrity of your data?
Thank you. Not everything is easy. This is the difference between engineering and 'hacking'. Hacking is not something to aspire to; it's something you do because of crushing, external pressures.
> At the end of the day, a SQL database doesn't represent the data in a way the programmer uses the data.
Errrr that's exactly what they do, unless you've got a terrible schema and havent thought about your data enough. The thing is about 'sql databases' is you can use the power of sql to fetch the data in any representation you want.
ORMs are a really bad attempt to force a square peg into a round hole. The mismatch between the relational model and object-oriented design principles is simply too big.
In the relational model:
(0) A relation is a collection of tuples of primitive values. Every relation has a relation schema, which determines the arity of its tuples and the type of each tuple component. In other words, the relational model is first-order.
(1) There are a few basic operators for computing relations from other relations (relational algebra).
(2) There is a mathematical theory (database normalization) of how to design primitive relation schemas to avoid storing duplicate information, and running into insertion, update and deletion anomalies.
On the other hand, in a pure object-oriented program:
(0) An object is a collection of data and operations on it. The data is hidden from the rest of the program, so the only way to operate on it is to use the object's operations. The operations may take objects as arguments and return objects as results, so objects are intrinsically higher-order.
(1) In general, there are no limits on how one can define a single object's operations. However, it's impossible to define operations which require knowledge of the internal representation of two or more objects at a time.
(2) There are heuristic guidelines (e.g., SOLID principles) for designing flexible object-oriented systems. However, they lack any sort of rigorous foundation beyond “it seems to work in practice”, so object-oriented designers may deviate from these guidelines at their own discretion.
---
For data-oriented applications, it's pretty clear to me that the relational model has important advantages over object-orientation:
(0) The decoupling between data and operations allows the database designer to focus exclusively on data integrity constraints, instead of anticipating whatever queries users will want to make.
(1) The limited expressiveness of relational algebra (with no recursively defined relations) is also a blessing, because it makes automated query optimization tractable in practice.
While objects present problem after problem:
(0) Object graphs are intrinsically directed, and must be traversed in the direction of its links. This makes queries less declarative.
(1) Objects have a notion of identity, which destroys many opportunities for using equational reasoning to build large queries. This also makes queries less declarative.
Of course, the relational model says nothing about general-purpose programming, whereas object-orientation does. But there exist other paradigms for general-purpose programming that are less badly in conflict with the relational model. For instance, functional and logic programming:
(0) Don't reject the use of first-order data, decoupled from operations.
(1) Prefer the notion of mathematical variable, whose meaning is given by substitution (a first-order operation), to imperative assignment, whose meaning is given by certain predicate transformers (intrinsically higher-order gadgets).
Is having seen something used some way a leading indicator of it being a good idea to have used that thing that way?
Because I've seen Excel used as database with all kinds of macros and VBA scripts bolted-on/embedded to provide the workbook various shapes of stored-procedure and query capability... but, while sorta impressive in a "Holy crap, lol wut?" kind of way, I'm not sure any instance I observed of uses like that were actually good ideas. Full of epic cleverness and ingenuity? Definitely. A good idea? Probably not.
Did they make the company a lot more money than they cost? If so they were probably a good idea. Not all code needs to be "pretty" to serve a purpose. I've seen some pretty epic hacks that I know generated hundreds of thousands of dollars of new revenue.
They often had a significant "bus factor" problem as a result of this in the best cases, and in the worst cases these mountains of hacks were a massive impediment to growth and/or evolution to meet changing marketplace demands... despite being a central pillar of data management and revenue as it existed in the status quo.
In my experience in market research, advanced spreadsheet programming with macros and pivot tables and whatnot are more a contemporary incarnation of Reporting than raw database querying and operations.
So my question is then: Why not use CouchDB instead? I don't see what Mongo gives you over that and CouchDB is at least dependable and predictable in its operation.
CouchDB is too reliable and actually fsyncs your documents to disk. That is plain boring. I like to live on the edge and have some documents go to /dev/null once in a while. Life is just more exciting that way ;-)
I really like Couch, I wish it had more adoption than it seems to have and that its ecosystem was more mature than it seems to be... and that javascript wasn't its first class citizen.
But its a really cool database (though I'm partial to rethinkdb now)
Even Javascript is sort of a second-class citizen in Couch. The real first-class citizen is native Erlang code running unsandboxed in the server context. If you want high(er) performance, that's where you go. (Alternatives to this have been discussed, like embedding the luerl Lua interpreter to give the option of a sandboxed programming target without the IPC cost. Nothing in the immediate pipeline, though.)
Me too. RethinkDB is my document database of choice these days. In my experience, its proven to be reliable and fast and the development team very responsive and helpful. They also seem quite mindful when it comes to new features and will delay things for years (eg auto-failover, which they now support but it took a while) if rushing it would impact quality.
That's what I want from a database: first and foremost it must be solid and not lose my data. Everything else (including high availability) can come after.
Is CouchDB still alive? I spent a weekend playing with it in January, but it seemed to be a very quiet project, with the last stable release being almost two years ago.
Most of the activity happens at Couchbase now, the company that the inventor D. Katz founded based on CouchDB technology. You can still use Couchbase for free, but it's possible to pay for support. The coolest thing they have is Couchbase Lite, the mobile version of CouchDB, lets you replicate with your server. I find it a very interesting alternative to Core Data, parse and co. and we use it in production.
> some people can't stand having to work with SQL,migrations,schema and constraints, it's as simple as that
Use the right tool for the job, right? Admittedly something like MongoDB could be the right tool for the job (examples around here include RethinkDB and CouchDB). MongoDB, however, is like a hammer with no head.
It has been pointed out before that json(b) has problems with indexing. IIRC the cost estimates of indexes on JSON data are static, and therefore very rarely accurate. I'm terribly sorry but couldn't find a reference with 5min of searching. I still like postgres over mongo
I guess the workaround would be creating indexes on computed columns that query from the json data, together with changing one's queries to use that computed field. For example, with a json column storing names in various places, a computed column could collect all of them in an array. An index on that computed column will have good statistics.
Bottom-line: if you want your queries to run fast, you will have to tell your store what kind of data you have and what kind of queries you will run. Otherwise, there's little the store can do.
Having a traditional database with various constraints is a way to give that information. With json columns, you may have to do it in another way (for now).
> Because some people can't stand having to work with SQL,migrations,schema and constraints, it's as simple as that
So use an ORM that understands Postgres' JSON columns. Don't need to write a single SQL statement, automagic migrations, no explicit schema (unless you make one), no constraints (unless you add them).
It works great, we did a rather large project last year using Django's ORM and postgres where we didn't know the final data schema until months after launch.
Never heard of ToroDB. Just checked out the website and it looks interesting, however the tagline "The first NoSQL and SQL database" is untrue.
At least OrientDB has had both schema+schema-free and SQL + NoSQL querying interfaces.
That is, you can optionally supply a schema for your documents. IIRC you could choose either schemaless, schema or mixed (where mixed allows fields not in the schema to exist as schemaless fields).
The default query language was SQL with "enhancements" (to allow for graph traversal), but you could also query with Gremlin. Not sure if this is still the case or not as I don't use OrientDB.
The above was true in 2012 and possibly a lot earlier. I see ToroDB's first Github commit was in 2014.
It does work for a certain volume of data. You can index fields you're interested in, even do so after the fact, and it's like any other database in that case. And sometimes you have small apps that do need complete historical log data, so Kafka et al just introduce unnecessary complexity since you'd need to aggregate into a key value store anyways.
But if you do this, god forbid you go beyond where indices can fit in RAM of a single machine. And you will do so, with probability one given your product doesn't shut down. So you're running a gauntlet against a redesign.
It’s useful for prototyping. When you don’t know which schema you’ll end up using having an *SQL database is tedious because you have to do migrations every time you change the schema. Once you’re done prototyping you can switch to a better alternative.
If you need to preserve data between the application versions then you still get all the headaches with MongoDb (either migrating the data or supporting multiple schema versions when you read the data, oh the fun!).
If you don't need to preserve data between the versions then you don't need to write migrations scripts in SQL, just scrape everything and pretend it's the first version of application.
I guess this is a question of how usefuly, deployable the prototype shoukd be. Why not just have an in memory object cache, literally a hashmap, for your dal? If you're composing app level code, you don't need to know what the backend does to your data. You could even create a simple method to populate the data at app boot in your in the dev profile. When you figure out the storage requirements and finalized model, build your db.
This would save you time on picking the db, schema changes or even migration changes in Mongo. You don't have to worry about bad documents from an earlier app revision.
In that case you can start with postgres and stuff your documents in a single json column, accessing it just as you would have in mongo while you're prototyping and don't care about speed and indexing, and when you're done, you can just change that table to a more proper structure without changing databases.
Exactly. Use the right tool for the right job. Start prototyping and development with MongoDB and then migrate to Postgres, or Cassandra or whatever suits your user-case better.
This is key and often overlooked - MongoDB is so popular not because it's the best database but because it's so easy to get started with. Download/unzip/run to have a database engine ready. It also helps that you can also immediately store anything without any prior setup steps.
Postgres/mysql/sqlserver/etc are nowhere near as easy to install, as fast to get started with or as portable to move around.
Postgres members should listen this and have a simple getting started guide for osx, Windows, Linux. I tried brew install postgresql. There was no single place which tells me how to start server, access command line, create db etc.
On OSX there is the fantastic http://postgresapp.com/ . It installs into /Applications so it is easy to remove, and comes with a start/stop GUI and taskbar icon. Great for local development.
But installing and configuring Postgres "properly" on a server is still something of a challenge. Do I need to modify random_page_cost on a SSD or not? What are good memory limits on modern big servers? What exactly needs to go into pg_hba.conf?
None of these seem too difficult after reading a few tutorials and wikis, but it would be nice if the server set itself up with reasonable defaults based on the machine its running on.
Getting started with PostgreSQL on Linux is actually trivial. What is annoying though that there are lots of guides which talk about editing pg_hba.conf which is not necessary for the simplest setup. The default pg_hba.conf is good in most distros.
We must have different definitions of trivial compared to what I had to go through every time - it's a mess of an install process that takes tweaking config files just to have it even listen to external requests.
With the ease of services like AWS, we never installed a database server. Pick a database flavor, version, click, click and you're up. I suppose designing the schema take a little effort, but I find it much easier than properly architecting software.
Many if not most installations are still being done on actual dev machines and servers. While RDS and other managed services are nice, they're just a small fraction of the usage.
Also the fact that managed services help so much only speaks to the fact of how difficult these relational databases typically are to work with operationally.
if you're on a mac you can download postgresql.app[0] which produces a small icon in the top right status bar. You don't have to install users or permissions or anything it's super easy to set up. Getting it on prod can come later but for the first five minutes it works.
(granted, this neglects contrib extensions like hstore)
It's not just installing for development, mongodb (other than its ridiculous clustering) is very easy to install on production servers as well and moving an installation is basically just zipping up the folder and moving it somewhere else.
That sets up postgres. It doesn't let you get started doing CRUD operations inside your postgres though - which is where MongoDB shines, "just store my data, fuck it"
> Mongo is a dumb, dead-end platform, but they know how important ease-of-use is.
By “ease of use”, do you mean “ease of making something that seems to work” or “ease of making something that actually works”? I've never used a schema-free database, and ended up thinking to myself “I'm completely sure this database can't possibly contain garbage data”. Or do programmers simply not care about data integrity anymore?
The single biggest source of grief in our production database has been the one JSON field we used once to avoid adding another table. That goddamn thing has crashed the server so many times with invalid data, that I'm never using anything schemaless again. We recently migrated to a proper table and I'm thanking my lucky stars I finally got rid of that devil.
You can insult the developers that use Mongo or you can look at how to get those users onto a better platform. With the modern expectations of full-stack development, is it any wonder that something promising simplicity and zero-configuration data storage does well?
Well, I have used one and had that assertion. Ease of use means making something that works. It seems we're condoning lack of knowledge or experience with programming. If you've never used databases in your life, whether you're using SQL orNoSQL you'll likely end up with rubbish data. In the SQL world it could be that you're storing time in the wrong format, concatenating long fields, or not normalising when you should or the other way around.
In NoSQL you could be reinventing the wheel, or storing data that you can't query efficiently because you can't index it well etc.
All the excuses of not using some document stores beyond ACID really sound like people won't know what the heck they're doing.
> Well, I have used one and had that assertion. Ease of use means making something that works.
For me, it means, under no circumstance, no interleaving of transactions or scheduling of commands, nothing, nichts, nada, can the database be in a state where a business rule is violated. If I need to worry what silly intermediate transaction state can be observed from another transaction, or if I need to worry whether a master record can be deleted without cascade-deleting everything that references it, then the DBMS has failed me.
> not normalising when you should or the other way around.
I've never seen a situation where anything less than 3NF (actually, ideally, at least EKNF) is acceptable.
What they neglect to mention is not that you don't need a DBA, that you are the DBA yourself. With all the responsibilities that go along with that role. Who's getting paged at 3am now...?
If thats really a problem host your database on a cloud platform and let those guys do the job of a DBA. Works for me at present. I am aware its not going to be a solution for everyone, though it still sounds a lot better than being your own Mongo DBA.
The DBA is the role who is reponsible for the organizations data. Even if you outsource the routine tasks such as "doing backups" you still need someone to assume that role.
Yes, it will help you to cover cases like where the server phyically explodes, but that's basically irrelevant, most problems where you need a DBA are caused either by data corruption caused by application code or developer, or performance issues caused by DB structure - in those cases the cloud platform won't do anything for you, they just host the server. They can restore backups, do monitoring and tune the server, not your particular app/db structure - but all the big problems are there.
"most problems where you need a DBA are caused either by data corruption caused by application code or developer, or performance issues caused by DB structure"
Is running Mongo going to solve any of those problems?
Without a rigidly enforced schema I would guess those problems are going to be amplified rather than solved.
This will be highly subjective but you need to get over the "postgresql has the longest feature list so why don't you use it". The last startup I have been involved with tried to use PostgreSQL and needed to move to MySQL (yeah, well) because commercial support was both more expensive and less useful than what we were able to get for MySQL. Perhaps today it's different.
While I no longer use PostgreSQL much, every time I need to touch it seems rather developer unfriendly, just last month I found MySQL, heck even SQLite supports triggers with code inlined into the trigger body but PostgreSQL mandates writing a separate function for the trigger. And, of course, it needs to be in plpgsql because reasons. The most trivial "let's calculate another column" becomes a complicated nightmare.
So then if you don't want to use PostgreSQL what then? The answer now is MySQL, again, because 5.7 has JSON.
And mind you, I have grown to dislike MongoDB slowly over the years as new types of queries have appeared and it's a complete mess by now. There was an excellent article on this posted on Linkedin of all places this March https://www.linkedin.com/pulse/mongodb-frankenstein-monster-...
It's really interesting how MySQL is the most usable and most supported database by now...
> This will be highly subjective but you need to get over the "postgresql has the longest feature list so why don't you use it".
I think if you re-read it, you might see that at no point did the post that you're replying to imply that Postgres was preferred because it had a longer list of features. They're speaking entirely about the strong guarantees that an ACID system gets you.
Document stores are only mentioned because this is one of the (incorrectly) perceived advantages that Mongo has over Postgres and other databases.
"just last month I found MySQL, heck even SQLite supports triggers with code inlined into the trigger body but PostgreSQL mandates writing a separate function for the trigger"
The right response, as a postgres developer, is to agree that you describe a useful feature, and perhaps implement it to help other users.
But my advice to you is to be willing to put up with some short-term annoyances. Sometimes the best choices are a little annoying, and if you refuse to consider them, it will cost you (or your employer) much more later.
It is listed on the TODO https://wiki.postgresql.org/wiki/Todo page, apparently since 2012. I haven't coded in C since 1998, I do not think you want me to touch the PostgreSQL code base.
> just last month I found MySQL, heck even SQLite supports triggers with code inlined into the trigger body but PostgreSQL mandates writing a separate function for the trigger
I found something similar (and in the last month too) – insofar as we're talking missing popular features – but with MySQL's and Postgres's positions reversed.
`ALTER TABLE ... ADD CONSTRAINT CHECK ...` runs on MySQL without an issue, and so does any INSERT or UPDATE violating that CHECK constraint. A bug was filed in 2004.
My response to the comment for triggers and PGSQL would be: Postgres tries its best to stop you shooting yourself in the foot.
Similar to the top comment, all the real problems I've ever encountered with postgres (heck, all major RDBMS's for that matter) come from certain areas, mainly triggers.
A startup which I have helped architect back end uses mongodb for everything. before starting the project I have requested the CTO not to use mongo as it was not a right fit.Basically they needed more of a relational stuff.The CTO chose mongo because he was thinking every startup uses it and why not us. Now they are suffering as they need ACID and relational features. They want to rewrite to postgres but they are heavily invested and not easy to go back.
Postgres wasn't always a great document store...there was definitely a time period where if you wanted to take a document-oriented approach to data modeling, MongoDB was a good way to go. JSONB was only added in the last minor version of Postgres, and while the JSON and HSTORE types were available, it didn't give you quite the same speed. Now that JSONB is a thing, I think the two databases are more comparable as a document store.
Is that actually true? I mean the part about Postgres might be, I don't know. But was there a time when MongoDB was a good way to go?
Was there ever a time when it actually worked consistently well at something that was database shaped? Because I started dealing with it in ~2010 I think, and it wasn't a suitable database for anything other than toy projects or throw-away data back then, and while it's many versions newer, it still appears to be pretty fast and loose with its supposed system guarantees.
There was a point when they raised $100+ Million in funding that I thought they'd take that money and actually build a database. At least as recently as last Summer that wasn't a reality yet.
A text/blob field with a normalized key column or two were always vastly superior. We're talking about data loss at an incredible level. I mean, a new Jepsen test comes out and this community goes bonkers over how database X might suffer a split-brain problem for a few milliseconds under an extreme condition but Mongo on a single instance has never been safe, and people are making excuses for it.
> I’ve hesitated to recommend RethinkDB in the past because prior to 2.1, an operator had to intervene to handle network or node failures. However, 2.1’s automatic failover converges reasonably quickly, and its claimed safety invariants appear to hold under partitions. I’m comfortable recommending it for users seeking a schema-less document store where inter-document consistency isn’t required. Users might also consider MongoDB, which recently introduced options for stronger read consistency similar to Rethink–the two also offer similar availability properties and data models.
I've been using Postgres/jsonb for JSON document store. It works OK - the query capabilities are still a little rough (9.5 is better than 9.4), and some frameworks like Loopback don't support JSON in Postgres yet (not sure which ones do), but it's definitely capable and reliable...
What do you think about noSQL in general. From what I could follow from aphyr rethinkdb seems pretty awesome. I like it a lot, but I am also not getting a ton of traffic on localhost:3000...
That is an unanswerable question since it's about everything. "NoSQL" is a huge variety of techniques - many of them yet to be invented, that only have one thing in common: "not SQL". From document storage over key/value storage to graph databases. Anyone who tells you what they think "about NoSQL" either has to redirect the question to become a useful one, or if they actually attempt to answer it take your popcorn and expect entertainment at best.
fair enough. I mentioned rethinkdb above because I find it very intuitive and versatile. I've used mongodb a fair bit but I like rethinkdb better for a host of reasons. I guess what I meant was, I thought mongodb was ok however everyone here seems to have always known it was deeply flawed. I tried to follow the jepsen report on rethink but I don't fully comprehend the tradeoffs/benchmarks ect, and was curious what others thought about it.
You're dealing with a torrent of incoming semi-unstructured data, where losing a good chunk of it is minor nuisance because you only need a decent sample, from which you extract data.
In those kind of scenarios, making it easy to work on the code can often be far more important than reliability.
I have a project like that now. I'd love to use Postgres, and probably will eventually once things "settle down" and we know what data we need to store. But or now MongoDB is the "quick and dirty" solution. We define a schema client side for everything we nail down, so as we nail down more aspects of what data to process, it gets easier to transition to a proper database.
As ORMs get better support for Postgres' JSON capabilities, it will likely get less and less appealing to use MongoDB for stuff like this too.
It HAS them, just not built-in tooling to make using them easy. 2ndQuadrant's repmgr gets you partway there, I'm really hoping to see them revamp it now that pg_rewind is a thing to make restoring a failed master less of a pain in the butt (this is literally the only reason I don't bother with HA right now, it's usually much easier for me to get the DB back online or restore from a barman backup than deal with replication).
If you want that you can always use a variant of postgres that does like greenplum, citus, and a few others. They're battle proven. There's also MySQL and its variants as well.
Not to mention that are NoSQL alternatives that have a better track record than Mongo like Cassandra.
Cassandra, HBase etc had checkered pasts with plenty of their own data loss and inconsistency bugs.
Now they are considered two of the most rock solid NoSQL databases. The hatred towards MongoDB really is pretty irrational given just how popular the database is.
FWIW, I don't think Cassandra is particularly any better today semantically than it used to be. Merge conflicts are still at the cell level than the row level, and wall-clock time is still the way that LWW resolution is determined. It let's you mix strongly consistent and eventually consistent data together, which makes no sense.
But the difference is that Cassandra is reliably "broken" in those ways, and as a result there are ways of using it which don't lean heavily on those weaknesses. Such as writing only immutable data or isolating all data that will be used in paxos transactions into their own column families by convention, etc.
Cassandra more or less behaves exactly as it claims that it does. So you can do a somewhat thorough investigation of its system semantics and know what you can rely on and what you can't. MongoDB doesn't even uphold the system semantics it claims that it has, so it's just broken in weird and esoteric ways that you discover mostly by accident.
Scale to what though? RDBMs can easily handle large loads, have replication, etc... At the point where you need true scaling, you'll have a much better idea of your problem and can solve it appropriately.
Why on earth haven't I come across this information before? I spent a crazy amount of time researching frameworks before settling for Meteor, and never came across this.
I worked at a Data Analytics start up in Palo Alto back in 2011 and we had 8 or 9 databases in our arsenal for storing different types of data. MongoDB was by far the worst and most unstable database we had. It was so bad that for the presidential debate, I had to stay up and flip servers all night because even though the shards were perfectly distributed, the database would crash and fail over to two other machines which couldn't handle our entire social media stream. We ended up calling some guys from MongoDB in to help us troubleshoot the issue and the guy basically said "Yeah we know that's a limitation; you should probably buy more machines to distribute the load." I like the concept of Mongo, but there are other more robust NoSQL databases to choose from.
I had a similar experience in 2011 with Mango where we were running map reduce jobs which Mongo advertised to support. The whole system got blocked from running the map reduce and the Tengen consultant sighed when we told him we were running map reduce jobs.
Which isn't the best argument to make against MongoDB since you should have known - it's even part of their course curriculum - that map/reduce is not the optimal way to aggregate in MongoDB. They have their own aggregation framework (https://docs.mongodb.com/manual/core/aggregation-pipeline/).
I have no intention of defending MongoDB because what do I know, never worked with it in real life - but just out of curiosity I took the free courses they offer (https://university.mongodb.com/) and I find that a sizable share of the complaints about MongoDB come from people who don't seem to have learned much about the product they are using. It's like people complaining their new truck behaves badly in water.
A lot of critics seem to have chosen MongoDB when they needed a SQL DB from day one. If you need full flexibility to (re)combine data you need SQL, for example. A document store isn't "schema-less" at all - much of the schema is built-in and very inflexible after that.
I actually wonder what the correlation is between PHP use and MongoDB use. They both have an attitude that mistakes ease with simplicity, a philosophy that puts correctness way down the priority list, and an easy introduction with a heavy ongoing maintenance tax.
Yes it is easy to use. But too bad it also fails "transactions" silently so that you don't even know if your changes were "committed" or not. Don't worry, it only happens every once it a while so it's not a big deal...
Unless you are coinbase or an organization that deals with money/bitcoins/etc and you need ACID compliant transactions so that "debits/credits" don't just magically disappear.
When the bitcoin craze was going crazy, coinbase had all kinds of problems due to their mongodb backend.
It's pretty easy to use... until you have to normalize data and query across one or two joins. I've been forced to build with mongo for the past few months (still not sure why) and I can't think of a single valid use-case for this rubbish.
If you need denormalized/distributed caching, Redis does a good job.
If you need to store some unstructured json blobs, postgres and now sql server 2016 can do that.
If you need reliable syncing for offline capable apps, you probably want CouchDB.
If you need real time, use Rethink
Obviously, relational data belongs in a relational database.
I think the problem is that all of these databases do one or two things really well. Mongo tries to do all of these things, and does so very poorly.
I (used to) hear lots of good stuff, but the type of devs were always hype driven. Asking for a reason why Mongo was used, the reply sounded just like the marketing hype on Mongo's homepage - lots of buzzwords and catchphrases ("big data", "schemaless") with no substance to the reason for choosing it.
I've just migrated one project from mongo to postgresql and i advise you to do the same. It was my mistake to use mongo, after I've found memory leak in cursors first day I've used the db which I've reported and they fixed it. It was 2015.. If you have a lot of relations in your data don't use mongo, it's just hype. You will end up with collections without relations and then do joins in your code instead of having db do it for you.
And the use case of this post is exactly what RethinkDB does better: "One of our services periodically polls the database and reads the list of running containers with the query..."
I'm kind of curious as to where this hype is. I've almost never heard anybody say anything positive about mongodb. All I ever see is people saying it's terrible / hilarious for various reasons.
Like with any online community, Hacker News can be kind of an echo chamber where groupthink reigns and alternative points of view aren't encouraged. MongoDB hype has died down here, but there are still some people that are fans.
There are some things MongoDB does fairly well:
* MongoDB is really easy to use
* Document databases can be great and flexible solutions for some kinds of projects
* Documentation is fairly good so learning the basics isn't too hard even if you know nothing about it
* scales fairly well at the initial stages
* arguably quicker to get a project off of the ground with than traditions RDBMs, which might be the most important consideration for any startup even if a complete rewrite would eventually need to take place
That being said, I've used MongoDB significantly before and it wouldn't be my first choice for most types of new project: PostgreSQL probably would be
About the only thing I agree with is how great their docs are.
* Mongo is only easy to learn. Beyond simple demos, it gets harder and harder to use as projects evolve i.e. you have to do a lot of work yourself imo this is a common problem with nosql datastores that isn't exclusive to Mongo
* "Document databases can be great and flexible solutions for some kinds of projects": Postgresql has been able to work directly with JSON for some time now. There are also other document datastores that are more reliable than Mongo
* "arguably quicker to get a project off of the ground with than traditions RDBMs" unless you're using Meteor, I'm also going to disagree here. Most frameworks target a relational database by default. Developing by convention tends to get you off the ground much faster than using something more specialized and niche
But they do make great mugs. Who here doesn't have at least a couple of MongoDB mugs. I don't use MongoDB and still have a bunch from random conferences over the last 3-4 years.
Idk about hype, but for node.js virtually every tutorial uses it. I think the stereotype is that postgres locks you into a data model but your needs vary drastically. I suspect mongo is harder to retool the schema than it is widely claimed, while pg/msql is slightly easier than is often claimed.
I doubt they meet in the middle, but both have great use cases. scaffolding out a quick isomorphic js app is great for mongodb for example. pg is faster and more robust.
its just tech, depends what you are optimizing for
Not sure what that means, but scalability is the worst thing about mongo (though my experience with mongo is all from ~2.5 ish years ago). As soon as your working set of indices gets bigger than memory, performance falls off a cliff and your entire app grinds to a halt. In my experience, mysql and postgres have a more gradual decline in performance so you have some time with only mildly degraded performance to figure out a solution (plus they have more options for tuning which can buy you more time).
The HN crowd tends to insult it, but outside of HN people hype it up. People keep talking about how much they love the MEAN stack. It's huge in the hackathon crowd, due to them sponsoring many of them, and its low learning curve.
I wish people would stop using acronyms and realize Express/Angular/Node is just as good with something other than Mongo.
I can confirm. On HN, meetups, Twitter, etc. no one talks about MEAN stack any more because of Mongo and the Angular 1/2 split (and React's popularity). In the last hackathon I went to, like most "corporate sponsored" ones, you got a special prize for using it.
Are people using React starter kits? The major issue I'd see with React at hackathons is the large amount of configuration you typically need to do before you get started. I dislike starter kits due to the additional complexity overhead, but I can imagine for a throwaway project they'd be fine.
I think this is part of the reason who RoR, MEAN and other such frameworks are so popular. They tend to be easier to setup and many developers love it. My own theory is there are so many bad devs out there so that contributes to the hype because so may devs end up adopting it.
MEAN isn't as easy to set up and I would not compare it to Rails at all. Rails is a backend framework—MEAN is just a bunch of technologies used together to have an API and a SPA. I'd actually argue, from personal experience, that newbies have a lot of trouble with MEAN because while there's a CLI tool MEAN doesn't have the community support, conventions, etc. that Rails does. Plus, you have to learn too many things at once.
>MongoDB 3.0 features performance and scalability enhancements that place MongoDB at the forefront of the database market as the standard DBMS for modern applications.
>Also included in the release is our new and highly flexible storage architecture, which dramatically expands the set of mission-critical applications that you can run on MongoDB.
>These enhancements and more allow you to use MongoDB 3.0 to build applications never before possible at efficiency levels never before attainable.
Hang out on the freenode room for mongo and you will see that most people, working on production software, make inquiries that reflect a complete lack of knowledge, common sense and leave you facepalming with no hope in humanity at a large.
It's massively successful in the entry level web coding world due to a combination of good marketing and the belief that it gives you unlimited scale 'for free' and everything 'just works.'
Not saying those are impossible goals but so far no database has managed to deliver that. Building a large-scale endlessly-scalable database is still very hard and very detailed and easy to screw up.
When it first came out it was hyped big time. Every other article was about how great Mongo was. About a year later the fallout started bubbling up. Digg went down for one or two weeks because of switching over to Mongo-- Reddit saw an influx of users and now is what it is.
The hatred for MongoDB mostly comes from the PostgreSQL supporters camp. The rest of us are just using whatever tool makes sense.
MongoDB is the fastest database I've ever used. The easiest to get running, stable and scaled out. The best documentation by far and has excellent integration e.g. Spark, Hadoop.
If you are doing Big Data it's a great tool in the arsenal.
It's fast because it doesn't actually save your data when you ask it to. It saves it later, when it feels like it, maybe. Also, the hate I have for Mongo is from first-hand experience with the wretched thing. The scaling is insane. The next jump after your first server is like 9 servers (2 repsets w/ 3 servers each w/ 3 config nodes). I don't have that kind of cash laying around for servers, especially when instance-for-instance, Postgres can handle easily 10x the write load of mongo.
I think this is a common pitfall though: people start off thinking they don't really have relational data and then realize they actually do. Now they have a pile of code integrated with a DB that doesn't do relations well and can't be ported easily and then encounter cool bugs like this. No bueno.
We start our apps with mongo, and design them with a migration plan to postgres. We've found it's very easy to rapidly develop the application with mongo due to it's flexibility. Once we understand where or app is headed and what our relationships actually are, we pretty much pull the plug out of mongo and stick it in postgres. If you build a reasonably intelligent query wrapper it's fairly effortless. That being said, we're thinking of moving our early prototyping to Rethink now that it's made some strides.
Meanwhile, normalized data is what gives me so much flexibility when using Postgres at the start of an app. I just store my data as generically as possible and usually all I need to change under churn is the queries.
Denormalizing on day 1 (Mongo) has you making guesses about your data access patterns at the worst possible time instead of just thinking about the data itself.
Can you explain what you mean? As far as I know, normalizing is a nonsensical process for a document store. You can only normalize a relational schema.
Normalization is just a method of organization to minimize repetition of data. It has nothing to do with efficiency of operation. This is perfectly valid code:
person = {
_id: "person123",
username: "lloyd-christmas"
}
comment = {
_id: "comment123",
person: "person123",
text: "This is how I start",
}
You don't have to do:
person = {
_id: "person123",
username: "lloyd-christmas"
}
comment = {
_id: "comment456",
person: {
_id: "person123",
username: "lloyd-christmas"
},
text: "This is also valid"
};
Sure, a join is faster than the first one where you'll have to hit the DB twice. The point is that you don't have to START with denormalizing everything. I start with normalized data and do more DB reads than I need. I figure out how the application uses my data as I go along, and denormalize the pieces I need only once I need them and am confident I won't bump into consistency issues (my username isn't updating every 5 seconds). Through this process I realize what the actual relationships are in my application and how my app functions request to request. This allows me to better structure my data. This is a quick update in mongo and usually a couple of lines of refactoring in application logic.
Obviously this is just an MCVE. My original point was that I find this to be a drastically more flexible process than starting off relational.
You can normalize data in Mongo and replicate relational database features in your application layer.
I'm just curious how that's an upside to using a relational database from the beginning when your plan is to migrate to a relational database anyways.
Switch out "Mongo" for "Postgres" in your bulk paragraph and you have the same scenario but with less work on your part and more features to help establish your data model.
One upside I can see if it you're more familiar with Mongo where using a relational database slows you down.
The bottom level of our application layer is a query builder which is almost a drag and drop replacement between mongo and postgres. By the time that layer is built out, we know what our database needs to look like. I find that adding/dropping fields and models in mongo to be drastically faster than moving models around in postgres. The above example would obviously end up in the same structure, whether we started with relational or not. It was nothing more than demonstrating an iterative process where you don't need to START denormalized just because "that's why you use nosql".
We try to be as incremental as possible when building our apps, and have found that using nosql allows for 20 small refactors that often end up being 2 larger refactors with a relational db. We've just found that it ends up being a faster production process, and we end up with a much more application-specific database instead of just "This is a Person, this is an Address, this is a Comment". Sure, we know beforehand that the application will contain all those components. We don't necessarily know how they'll be used in a request-by-request basis, and whether or not they will actually end up being one-to-one, one-to-many, or many-to-many.
This feels like you lack an architect who can see the bigger picture of your applications. I don't mean that insultingly, but with experience you tend not to need that second system syndrome. Do you find there's less rewriting over time as you become more experienced? Or is it just the kind of projects you work on?
> This feels like you lack an architect who can see the bigger picture of your applications.
Quite the opposite. We feel that going in with the assumption that you know where the application is going to end up is hard-headed. However, acknowledging that the situation will definitely change doesn't absolve you of planning it out properly given the information currently available to you.
> Or is it just the kind of projects you work on?
We build mainly internal facing or b2b apps in the medtech space. Given that we need to integrate with larger players that don't really care much for small businesses, we can receive slow response times for data/api requests from any external sources we deal with.
e.g., Recently we built an application for a home-town pharmacy where we were forced to use two databases; one under our control and one which was controlled by a pharmacy management system. We needed to update certain models in their database while reading from other ones. They promised to build out a few stored procedures that we needed. They flat out lied on a couple, and then quoted us a 8 month turn around on the other ones. We'd be stuck with them regardless, the expectation of a curveball like that allows us to rapidly adapt.
Obviously, the pharmacy's business model doesn't change very rapidly. Iterating over the app through a few of their business cycles tends to give you enough knowledge of what you can build them, as well as what they really want. Regardless of how much time we spend planning with them, they'll always leave us with some form of an XY problem that we'll only understand after they use the application for an extended period.
We don't deal with web-scale, so the problems generally encountered in the mongo complaint arena tend to be irrelevant to us. Given our use case, I think the decision is pretty reasonable.
> Do you find there's less rewriting over time as you become more experienced?
We've factored that into our development strategy, which is why I mentioned the query wrapper. That query wrapper paired with a DAO level that's reasonably database agnostic makes our transition fairly simple and quick.
I've actually observed the opposite, that the more experienced you get, the more rewriting occurs and the less you tend to plan things out ahead of time, until you're Google/Facebook level and just accept that you will be rewriting constantly.
It will never beat it in non relational queries also.
To answer why - it was suggested by someone that said it's new modern db standard, that NOSQL is clear winner then showed me comparison on their site with SQL queries, all looked like it was substitution for RDBMS but it's not.
> If you have a lot of relations in your data don't use mongo, it's just hype. You will end up with collections without relations and then do joins in your code instead of having db do it for you.
So... you are not against MongoDB but against NoSQL in general? I've used MongoDB and I've never ended up with lots of joins in my code. But I guess it all depends on the use case and how you've structured your data.
Document databases are not a silver bullet.
Contrary to what the name implies, most relational databases don't handle querying relation-heavy data well. If you need to hit plenty of relations in your queries, instead consider something that is optimized for that, like a graph database (or multi-model including graph).
If you're currently using MongoDB in your stack and are finding yourselves outgrowing it or worried that an issue like this might pop up, you owe it to yourself to check out RethinkDB:
It's quite possibly the best document store out right now. Many others in this thread have said good things about it, but give it a try and you'll see.
I'm ex-Couchbase, so I can probably give a reasonably informed but independent view on this.
Firstly, regarding the marketing, it may not have been to many people's tastes - but it definitely worked, and achieved a lot of what was set out in terms of raising the awareness of what was a decent product that wasn't as well known as its competitors. There may be cases where people avoid it because they don't like the marketing, but the reality, having seen its effect, is they are in the minority, and would probably serve themselves better by assessing products based on technology rather than spiel.
Now, on the actual technology!
What Couchbase has historically been good at is highly scalable Key-Value access, at very high performance and low latency. Performance is comparable to Redis, but CB much more mature sharding, clustering and HA. e.g. fully online growing/shrinking of cluster, protection from node failures, rack/zone failures and data center failures. Redis may be a good fit for single-machine caching situations, and also has its own advantages in terms its datastructures support, etc.
Quality of SDK's is pretty subjective, but I'd say the 2.x re-write of Couchbase SDK's makes them very solid. The Java SDK in particular is extremely good both in performance and by providing native RxJava interfaces.
In terms of query interface, there's geospatial and a new freetext capability on the way.
Couchbase chose down to go down the route of a SQL based interface as their main query language. This seems to be a bit love/hate with developers with some delighted and some perplexed. Maybe for devs it should really be about higher level interfaces like Spring are increasingly important anyway?
The native interface being SQL based is usually very popular with the BI / Reporting side of things.
Changefeeds (continuous query?) is a feature not in Couchbase which I would very much like to see in the future. One thing I would say is that it's something you have to be very careful in the design of to ensure scalability and performance. Consistency is something which would obviously need thought as well.
A lot of Mongo DB bashing on HA. We use it and I love it. Of course we have a dataset suited perfectly for Mongo - large documents with little relational data. We paid $0 and quickly and easily configured a 3 node HA cluster that is easy to maintain and performs great.
Remember, not all software needs to scale to millions of users so something affordable and easy to install, use, and maintain makes a lot of sense. Long story short, use the best tool for the job.
This has also been my experience. Millions of large documents on a single (beefy) node with a single user it's been fine. Although, the sysadmins had previously left me with flat file xml on shared storage so the bar was pretty low.
and in the linked issues. Seasoned users of mongodb know to structure their queries to avoid depending on a cursor if the collection may be concurrently updated by another process.
The usual pattern is to re-query the db in cases where your cursor may have gone stale. This tends to be habit due to the 10-minute cursor timeout default.
MongoDB may not be perfect, but like any tool, if you know its limitations it can be extremely useful, and it certainly is way more approachable for programmers who do not have the luxury of learning all the voodoo and lore that surrounds SQL-based relational DB's.
Look for some rational discussion at the bottom of this mongo hatefest!
> I don't think most programmers have the luxury of learning all the voodoo and lore that surrounds MongoDB from JIRA tickets and blog posts.
That's how I learned everything I know about most FOSS products I have encountered - through the code pages and social media surrounding the project.
Pretty much everything about the mongodb hate derives from their marketing and sales. The truth is, they've obviously stumbled onto something the market wants, otherwise they would never have become so successful.
For me, as a long-time programmer with no database experience, the mental mapping of JSON constructs as both data and query language was far easier for me to absorb than the relational model, which didn't fit the paradigms that I was used to.
At my present gig, we've used Mongo DB for two years, scaling up to quite a large production setup. Like any technology it has strengths and weaknesses, but it has not been the utter failure that readers of Hacker News would be led to expect. We adopted it knowing quite a bit about its history, and it has turned out to be an excellent choice that has held up over time.
Periodically we've considered switching to postgres, and we may do so for part of our stack. But for the core jobs of data collection and batch processing data with fluid schema, I'm pretty sure we will stick with mongodb for the duration.
Yes, with enough extremely careful coding knowing exactly all the internals of the database, you can probably avoid these giant gotchas. But that's a hell of a lot of work for a DB that's supposed to be easy.
And then you're still dealing with a database without ACID guarantees. No transactions, very little atomicity.... and good luck if your server crashes... we've had multiple customers have their DB corrupted that way.
MongoDB is only good for storing data you don't really care about... in which case, why are you bothering to store it at all?
What most people end up storing in Mongo: strongly schema'ed, relational data that is critical to their application. This is exactly what mongo is not for.
Strongly biased comment here, but hope its useful.
Have you tried ToroDB (https://github.com/torodb/torodb)? It still has a lot of room for improvement, but it basically gives you what MongoDB does (even the same API at the wire level) while transforming data into a relational form. Completely automatically, no need to design the schema. It uses Postgres, but it is far better than JSONB alone, as it maps data to relational tables and offers a MongoDB-compatible API.
Needless to say, queries and cursors run under REPEATABLE READ isolation mode, which means that the problem stated by OP will never happen here. Problem solved.
Please give it a try and contribute to its development, even just with providing feedback.
Right now ToroDB handles sharding at the backend (RDBMS) level, with those dbs that support that. There's currently a Greenplum-based backend on the works, that obviously handles sharding by itself. Also CitusDB is on the roadmap.
At a later release, we also plan to natively support MongoDB's sharding protocol.
My general feeling is that MongoDb was designed by people who hadn't designed a database before, and marketed to people who didn't know how to use one.
Its marketing was pretty silly about all the various things it would do, when it didn't even have a reliable storage engine.
Its defaults at launch would consider a write stored when it was buffered for send on the client, which is nuts. There's lots of ways to solve the problems that people use MongoDB for, without all of the issues it brings.
I really agree with your sentiments, that first paragraph is a great quote. I grew quite an adverse to MongoDB after researching it. While I never found this specific caveat, I found other very worrying decisions.
> reliable storage engine
By "reliable" I assume you mean "consistent?" While MongoDB claims that it's CP (which it's not, as per the article) there's nothing wrong with inconsistent databases (AP, e.g. CouchDB). Mathematically there is no reason for MongoDB to behave like this. It's fundamentally broken; it's neither AP nor CP.
I actually mean reliable. Its probably different now, but at launch, the defaults were fsync'ing every 30 seconds or so. It would literally just apply the change to an memory mapped buffer and just fsync it once in a while.
They did that so they could look good in benchmarks, and it's why they recommended so strongly that your memory completely fit in RAM or else things would fall apart (pro-tip, any system that recommends that has a poorly designed storage engine).
They also screwed up the consistent side of things as well.
I have moved from Mongo to Cassandra in a financial time series context, and it's what I should have done straight from the getgo. I don't see Cassandra as that much more difficult to setup than Mongo, certainly no harder than Postgres IMHO, even in a cluster, and what you get leaves everything else in the dust if you can wrap your mind around its key-key-value store engine. It brings enormous benefits to a huge class of queries that are common in timeseries, logs, chats etc, and with it, no-single-point-of-failure robustness, and real-deal scalability. I literally saw a 20x performance improvement on range queries. Cannot recommend it more (and no, I have no affiliation to Datastax).
Genuinely curious: when you say "it brings enormous benefits to a huge class of queries that are common in timeseries", what are you referring to, exactly?
I run Cassandra in production and I love its operational simplicity, scale-out design, and write performance. But I think its support for time series is perhaps over-hyped. To me, it seems the only queries you can run in Cassandra is a key lookup (partition key row get) and a column slice (partition key row get filtered by an ordered range of columns). This allows for a certain time series use case e.g. where each row represents exactly one series, and where the only thing you want to do with a series is to get its raw values. But it doesn't allow for many of the things I personally think of when I think about "time series queries", e.g. resampling, aggregates, rollups, and the like.
I am referring to anything that resembles a range query, ie, where you require a bunch of contiguous information queried on a single key. Think "give me all of this person's chat entries from x time to y time", or indeed "give me all this topic's comment entries from x time to y time" (but not both - only one of the above would be efficiently stored - you decide which it would be).
Cassandra, as you know, forces a certain amount of "low level awareness" requirement on the programmer because to tap into its uniqueness, you need to know how you will query stuff, so that Cassandra will ensure that the most common range queries are contiguously stored in rows. All other databases hide the on-disk storage order from you in an abstraction, and you can find atomisation causing inefficiency. Cassandra forces you to think about it, and in return, guarantees contiguous storage order on disk along one of your keys so that along that key, retrieval is lightning fast as it requires only one pass.
Basically, both spinning disks but also SSDs, are in essence, 1d media (ie, a lot in common with tape) in the sense that along one dimension you can read stuff massively fast, but as soon as you need to seek (ie start using dimension 2), even on an SSD, your performance dramatically declines. Cassandra forces you to think about your queries so that they will be "aligned" along the most efficient direction on disk.
Now agreed that if your queries cannot be aligned along said direction, then Cassandra drops to being no better than all the others, and penalises you with some complexity. That includes some examples of aggregrates, resampling etc (though I would argue that the order of magnitude contiguous read still helps these). Some of this can be mitigated with denormalisation ie: storing stuff more than once, in transposed or sub-sampled orders, something that relational DB purists will hate, with some justification (potential for inconsistency).
FWIW Riak TS sounds promising with automatic "blob" style storage etc and resampling capabilities which might take Cassandra on quite explicity and in a higher level, more convenient way. I am about to evaluate it because I agree with you that the resampling capability in particular could be better supported in Cassandra, though ultimately, both databases will still be limited by the underlying D1 v D2 "contiguous v seek" capabilities of the storage so I'm not expecting miracles from Riak.
By the way, I'm not even touching on Cassandra's scale-out ease. More perf needed? Literally just add boxes though it would be unfair not to comment on the cost of this, which is Cassandra's node-level consistency tradeoffs for very recently added data, and which is, if I recall correctly, why Facebook went to Hbase. You can force consistency at the query level, but performance can suffer.
After some truely horrific experiences with Riak K/V, especially combined with Riak Solr, I won't touch anything from Basho with a ten foot pole. Not sure what's going on over there, but the reality of Riak in production was miles away from what Basho's sales claimed was possible. And yes, we even spent about 4 months working with their tech support. It almost seems that "It's based on Erlang thus it scales" was the entirety of their design work.
I've also worked with Cassandra and have nothing but good to say about it, did what we asked it right out-of-the-box. Datastax was really helpful as well.
--
And I have no affiliation with either Basho nor Datastax, just really happy with one product and completely blown away with the poor performance of the other.
Weird to see that Mongo is still around. We started to use them on a project ~4 years ago. Easy install, but that's where the problems started. Overall terrible experience. Low performance, Syntax a mess, unreadable documentation.
They seem to still have this outstanding marketing team.
Should an infrastructure company be advertising the fact that it didn't research the technology it chose to use to build its own infrastructure?
All these people saying Mongo is garbage are all likely neckbeards sysadmins. Unless you're hiring database admin and sysadmins, Postgres (unless managed - then you have a different set of scaling problems) or any other tradition SQL store is not a viable alternative. This author uses Bigtable as a point of comparison. Stay tuned for his next blog post comparing IIS to Cloudflare.
Almost every blog post titled "why we're moving from Mongo to X" or "Top 10 reason to avoid Mongo" could have been prevented with a little bit of research. People have spent their entire life working with the SQL world so throw something new at them and they reject it like the plague. Postgres is only good now because they had to do some of the features in order to compete with Mongo. Postgres been around since 1996 and you're only now using it? Tell me more about how awesome it is.
My goal in writing this post was not to convince people to use or not use MongoDB, but to document an edge case that may affect people who happen to use it for whatever reason, which as far as I could tell was inadequately documented elsewhere.
Only the first line was directed at you - and it was more in jest. Everything else was directed more at the other commenters and Mongo detractors in general.
While I love to hate on MongoDB as much as the next guy, this behavior is consistent with read-committed isolation. You'd have to be using Serializable isolation in an RDBMS to avoid this anomaly.
I think this is incorrect, but it's not as simple as the other replies are making it out to be.
Under read-committed isolation, within a single operation, you must not be able to see inconsistent data. So if you do "SELECT <star>" on a table while rows are being updated, you're guaranteed to always see either the old value or the new value. But if you do two separate statements, "SELECT <star> WHERE value='new'" and "SELECT <star> WHERE value='old'" in the same transaction, you may not see the row because its value could have changed. Serializable isolation prevents this case, typically by holding locks until the transaction commits.
It gets messy because the ANSI SQL isolation levels are of course defined in terms of SQL statements, which don't map perfectly to the operations that a MongoDB client can do. Mongo apparently treats an "index scan" as a sequence of many individual operations, not as a single read. So you could argue that it technically obeys read-committed isolation, but it definitely violates the spirit.
This is worse than read-committed because you're not even seeing the old state of the document. If an update moves a document around within the results, and it ends up in the portion you've already read, you just don't see it at all.
In postgres (and a fair number of other databases) you'll not see that anomaly, even with read committed. Usually you'll want to have stricter semantics for an individual query, than for the whole transaction.
Quoting from the very first paragraph of the blog post:
> Specifically, if a document is updated while the query is running, MongoDB may not return it from the query — even if it matches both before and after the update!
How's that compatible with READ COMMITTED isolation level?
The real problem with Mongo is that it's so enjoyable to start a project with that it's easy to look for ways to continue using it even when Mongo's problems start surfacing. I'll never forget how many problems my team ended up facing with Mongo. Missing inserts, slow queries with only a few hundred records, document size limits. All while Mongo was paraded as web scale in talks.
I haven't looked at Apollo, but GP should've explained that GraphQL is not a database and can be hooked up to any backend, so with it I'm guessing you can use any kind of database in Apollo/Meteor apps. Still, kind of weird.
I remember when Meteor was the JavaScript flavour-of-the-month and everyone was saying it will kill Rails. I wanted to believe so I looked into Meteor, then I saw its dependence on MongoDB... Nope!
If you study the backend JavaScript ecosystem you'll see no Rails-like all-in-one framework has ever succeeded. They're just not part of the culture. The only Rails-like thing in the JS ecosystem that's popular and great is Ember but that's only frontend.
MongoDB reminds me of an old saying that if you have a problem and you use a regex to solve it, you end up with two problems.
I have personally used MongoDB in production two times for fairly busy and loaded projects, and both times I ended up to be the person that encouraged migrating away from MongoDB to a SQL based storage solution. Even at my current job there's still evidence that MongoDB was used for our product, but eventually got migrated to PostgreSQL.
Most of the times I've thought that I chose the wrong tool for the right job, which may be true, but still leaves a lot of thought about the correct application. Right now I have a MongoDB anxiety - as soon as I start thinking about maybe using it(with an emphasis on maybe), I remember all the troubles I went through and just forget it.
It is certainly not a bad product, but it's a niche product in my opinion. Maybe I just haven't found the niche.
This single issue would make me not want to use MongoDB. I'm sure there are design considerations around it but I rather use something that has sane semantics around these edge cases.
Not when they're changing rapidly, anyway. Well, that's relaxed consistency for you.
Does this guy have so many containers running that the status info can't be kept in RAM? I have a status table in MySQL that's kept by the MEMORY engine; it's thus in RAM. It doesn't have to survive reboots.
A bit late to the party, but we use Couch + Pouch at work.
It is in some ways magic, and I literally don't know how we'd achieve what we do without it.
But.
The big pain for us is filtered replication: we have a bunch of cell phones (Pouch) that should only see some of the documents from the server (Couch). No mater how you slice it pretty much, you need filtered replication in there somewhere.
And it's sllllooooooowwwwwwwww. It basically canes a CPU core while it's running per user, which means you can have CORE user's replicating at any given time, which is terrible. There isn't really much a solution for it, except "don't use filtered replication".
I'm not the original poster, but I can give you some of my limited experience with CouchDB from an application I inherited. The original idea for the project still seems like a good idea to me. Basically they wanted to record events that came into the system and store them in a write only ledger. Then they wanted to version every change so that you have an audit trail. Finally they wanted to be able to create views of that ledger to create the kind of data that they would work with on a day to day basis. For this, CouchDB seems like a perfect fit.
Unfortunately, it didn't work out as well as one might hope because the people who implemented the idea didn't seem to be able to resist using the DB the way they would use a relational db. Instead of maintaining the concept of a write only ledger, they started to use it as a data store for things that were ephemeral. Also, instead of replicating the db, using a view to create a new db that was optimal for certain queries, they wrote a huge number of views in the main db. Finally they organised the views by relation rather than by use, so you would have 60-80 views in the same design document that would have to be reindexed if one of them changed.
The result was something with very poor performance and where the storage for the indexes was more than an order of magnitude more than the storage for the documents themselves.
CouchDB is also not super speedy at the best of times. There is a lot of latency involved in serializing the documents and farming them out to view servers, etc. So it takes a good 10 minutes to process a million documents, but you will find that your CPU is chugging along at 30-40% utilisation.
Having said all that, one of the things I want to try (but have only done some preliminary trials with) is to keep the concept of the write only ledger, but to replicate the db into several views of the data (some with severely restricted content). Then instead of building something like a rails application to farm out the data, make "couch applications" where you serve the HTML and JS directly from attachments on documents in the DB. In fact, I've written a React application to allow users to interact with portions of the data and it was quite simple. Then you can write a really small coordinating application to allow users to navigate to the parts of the system (really single page apps) that they want to use.
Again, the nice thing about this is that you have a write only data store with versioning and the ability to audit history. You have views that allow you to interact with a small subset of the overall data. You can easily write single page applications where deployment is as easy as pushing a document to the DB. Replication is relatively cheap and you can move expensive view creation to restricted versions of the DB. You can stick the whole thing behind a load balancer and scale it as cheaply as setting up a new replication (again just another document in your DB).
But, I will warn you. Don't use it like you would a relational DB, or else you will be in for a world of hurt. Especially you will see comments in this thread about migrations. If you are migrating your data, by definition you do not have a write-only-with-versioning application. Your application will have to deal with multiple versions of data or else you will not have the ability to audit history. If you do not care about this, then possibly there are better solutions than this.
one of the things I want to try (but have only done some
preliminary trials with) is to keep the concept of the
write only ledger, but to replicate the db into several
views of the data (some with severely restricted content).
How is that different from the current concept of CouchDB views? You meant to replicate the DB to various different places and use the same CouchDB views from there? Or you meant something like a replication-view, in which some calculations are done with the documents in the source database and the target database receives the result of those calculations as their primary documents?
Yes, the latter. One of the main problems I've seen is that indexing things that you will never query is both expensive in time and space. Also it's amazing how many views tend to have exactly the same data, only sorted differently. And the reason to sort it differently is because you only want to work on a subset of the data, but you can only restrict the query in a contiguous section of keys.
An example of this might be that you have a large number of daily reports. They all need different aspects of the data, but you end up writing views that sort by date and then collate the result in the server. So you end up maintaining an index for data that you will never query again and you are doing lots of extra processing merging the data after the query. Much better to replicate one day's worth of data to a new db every evening (possibly setting up a continuous replication to keep it up to date) and then add views on that db to do what you want. Like I said a full replication of a million documents takes about 10 minutes, so it's a reasonable thing to do.
I like the concept of views. I think it is the best part of CouchDB. Saving data and later defining views (that should be very fast). Once I wrote an app that stored massively enormous documents with lots of data, the views were used to turn that data into queriable information later.
I like what you suggested very much, because what CouchDB can currently do with views is very limited to what a powerful views implementation would, and the one you've suggested is a good suggestion of how to do it better.
If I understand correctly, this method says "only scan the built in _id index, not any other index". Which means that you will not hit this index-specific bad behavior, but also that you won't get the performance characteristics of using an index.
Seriously, who looks at MongoDB and thinks "this is a sane way of doing things"?
To be fair, I've never been much of a fan of the whole NoSQL solution, so I may be biased, but what real benefits do you gain from using NoSQL over anything else?
Benefits of using MongoDB? Nothing. There are on the other hand other NoSQL systems which offer real benefits. Like Cassandra which gives reliable distributed database with not too much effort.
I worked with MongoDB quite a lot in context of Rails applications. While it has performance issues and can generally become pain because of lack of relations features, it also allows for really fast prototyping (and I believe that Mongoid is much nicer to work with than Active Record).
When you're developing MVPs, work with ever changing designs and features, ability to cut off this whole migration part comes around really handy. I would however recommend to anybody to keep migration plan for the moment the product stabilizes. If you don't, you end up in the world of pain.
RethinkDB has been really mindful of its' consistency, has an in the box story for observable collections, and a really great UI out of the box. I really hope I get to use it for more than toying around with.
I'm not sure that I'd choose Mongo over alternatives these days, if I'm on AWS and can use RDS, I'd go PostgreSQL mostly, but RethinkDB, ElasticSearch, C*, and others also have their places.
Mongo in on word: popular. Couchbase, not so much. I'd hazard a bet that you'd find some dirt under that couch if you went looking but not enough people have looked at it yet.
There are like 5+ storage engines available for Mongo, probably more, but those are just the ones I'm aware of, plus various forks, like TokuMX an Percona, etc... This is all FUD.
So it's a bit weak in the design department, offers a bit less rigid semantics than one might hope, and from the start it's a technology that was almost a reaction to the rigid and enterprise-y of old.
Unless you want to code every rdbms and enterprise feature in the application layer, don't use Minho, use Postgres or Use Marklogic. It is 'nosql', but it is acid compliant and uses MVCC so what the queries return is predictable.
CouchDB is rock solid. Used it for 5 years now. Never got corrupted data. Has master-to-master replications. Really shines in sometimes offline operation mode (with re-sync on reconnect).
I use that extensively to build custom replication cluster topoligies (overlapping rings, star, hierarchy) etc.
Has HTTP interface so easy to build clients for.
Transactions are per document only. So have to design your application to accomodate it. Raw single document write speed is not as fast as Mongo or Postgres. But I noticed in a concurrent environment, multiple connections writing it scaled pretty well.
Moreover, CouchDB 2.0 will have built-in clustering from code donated by Cloudant. And it will also have a similar query language like MongoDB (instead of having to use Javascript / Python / Other map-reduce functions).
Did you miss the part about how they're running a hosting platform that stores details about the status of containers for all their customers?
Zookeeper is fine for things like service discovery that deal with a bounded amount of data. You don't want to use it for something where the amount of data depends on, say, how many containers your customers decide to start. Every ZK server keeps all of its data on the Java heap, so if your data gets too big, pow. How big is too big? Don't worry, you'll find out the hard way sooner or later!
Plus, there's no sharding -- every write operation has to be acknowledged by a majority of nodes in your cluster. So for write-heavy workloads (which is what I would expect a service status dashboard to experience) your cluster actually gets slower if you try to add more machines.
Zookeeper slows down when you add nodes since quorum/consensus is larger. You can mitigate some of this with non-voting nodes (observer nodes) but only up to certain extent. So yes, a single Zookeeper cluster won't scale horizontally.
But that doesn't limit the amount of independent clusters you can have.
The reason I suggested Zookeeper is because it offers you ephemeral nodes, which is convenient to mark stuff as unavailable.
I am pretty excited about cockroachDb. Its still in beta so not suggested for production use yet, but its being designed pretty carefully and by a great team.. check them out cockroachlabs.com
It's not, go ahead and use it, learn and gain experience. It's not a replacement for SQL databases. It doesn't have joins and the biggest issue academics and as sysadmins have is it's not fully ACID compliant, so no transactions for example.
If I was writing this 2 years ago, I would say horizontal scaling is much easier. Add a node to your replica, watch it catch up, and continue.
Have data stored in an array 4 levels deep? Mongo will find it for you. It's only difficult to switch to an alternative to the extent that you've convolved your schema in an unfriendly way. Most migration entails normalising your data into different SQL table and exporting it. Not rocket science as people make it seem to be.
I use SQL at work, Oracle, SAS excuse, a bit of MySQL and sometimes Postgres - I'm a consultant.
I have tried some NoSQL DBs but always come back to Mongo, for personal projects.
I've done a few prototypes for clients using Mongo, but those are almost always for geospatial support.
I'm prototyping in meteor using MongoDB and Compute Engine.
I have two VM instances in google cloud platform. One is a web app and the other is a MongoDB instance.
They are in the same network. The connection I use is their internal IP.
Can other people eaves drop between my two instances?
TL;DR During updates, Mongo moves a record from one position in the index to another position. It does this in-place without acquiring a lock. Thus during a read query, the index scan can miss the record being updated, even if the record matched the query before the update began.
Seriously. I was looking at DB usage statistics recently and was appalled MongoDB is still so popular. I thought it was done, nail in the coffin, when https://www.youtube.com/watch?v=b2F-DItXtZs came out 6 years ago, I haven't followed it much since then apart from the occasional post like this whose content is just "you thought it was bad already? haha it's worse."
That crosses into personal attack, which is not allowed on Hacker News.
Also, please don't create many obscure throwaway accounts on HN. This forum is a community. Anonymity is fine, but users should have some consistent identity that other users can relate to. Otherwise we may as well have no usernames and no community at all, and that would be an entirely different forum.
"Power" in an organisation is the formal authority necessary to fulfill your reponsibilities. The choice of a particular database technology doesn't make the responsibility for the security, availability etc of the data "go away". As I say you can outsource the mundane tasks like "doing backups" but the buck still stops with someone. If you don't know who that is, it might be you...
Programmers should have more than a passing understanding of database administration. They should understand regularization / constraints / ACID / SQL / etc. This was core curriculum in my CS undergraduate. More importantly they should understand why they are important and when they are needed. Most of the commenters here are probably programmers who understand these things -- or programmers who have gotten burned by database stew and learned the hard way. It seems like you would to well to take a deep breath, drop the antagonistic view of traditional databases (just another tool), and educate yourself on their use and implementation.
The people who work on databases are (in my estimation) usually up there with those who work on operating systems, compilers, and video games. Most programmers are simply not dealing with very interesting constraints in terms of latency, extensibility, throughput, storage, concurrency, or other challenging requirements, but database people are. Don't be irritated when they ask for better tools.
Your many accounts trolling this thread are a serious abuse of Hacker News, and we've banned all of them. Please don't do anything like this here again.
I have been a DBA for 20 years, Oracle, SQL Server, Sybase, Informix, Postgres, MySQL/MariaDB... About the only big name I haven't had serious experience with is DB2. We did hot-releases into Prod routinely.
I have absolutely no idea what a "SQL migration" is or why it would be so hard. I assume it's just a bogeyman made up by the MongoDB snake-oil salesmen. It is certainly not something "traditional" database people actually do, or worry about.
> I have absolutely no idea what a "SQL migration"
Transforming data to conform to a new schema. It's a tedious process that could use some automation, but ditching schemas because data migrations are tedious is like ignoring traffic rules because you're in a hurry.
> I assume it's just a bogeyman made up by the MongoDB snake-oil salesmen.
It's just like dynamic type marketing: “If you never define the structure of your data, you never need to worry about it!” The only problem is that it isn't true.
Oh, that doesn't happen much, because serious organizations have dozens of apps connected to a DB and no-one would be crazy enough to big-bang change them all at once to use a different schema. Adding columns and tables tho' is easy, populating new tables from existing, replacing tables with views, no problem.
I'd argue the inverse. Serious organisations shouldn't have lots of different apps reading from each other's databases, for the reason you suggest above. It makes changes difficult and the application landscape fragile. It also encourages not having well-documented interdependencies between your apps (and consequently no way of knowing what the impact of any given change to an app will have).
It's not uncommon however in large organisations with mature applications, without strict IT policies.
Complex applications that are undergoing active changes shouldn't be constrained in this manner, and in that context a SQL migration is more of an in-place upgrade of the data structures that hold the application data between different versions of the application.
Adding columns, adding tables, replacing columns with other columns and managing how the data should look in the new schema, based on the 'old' data structures.
They are not "each others" databases tho'. It is "the database", deliberately chosen as the integration point. The alternative is a horrific tangle of replication, or a vast undiscoverable landscape of tiny APIs.
That sounds horrible. No wonder you've never done a migration - it's been made impossible. Why would you not hide the database behind an API, so that you CAN migrate it, change it, etc? This is like not having a separate data layer in your application, except it's across many applications all at once.
What API do you suggest? Bear in mind it needs to provide the same results and enforce the same constraints for everything from COBOL to Java to Excel to ColdFusion to Python to R and any other future language. I mean you could write your business logic in all of them, and maintain it in all of them, but that would be insane.
Yes, and the way you solve it is by having an API that holds your business logic. That's how you avoid duplicating business logic in all those applications. Is your database just chock full of insane triggers and functions?
Business logic doesn't belong in a database. data does.
Doing large incompatible migrations is unavoidable eventually, so avoiding them doesn't solve the problem either it just postpones the problem until the entire enterprise has to move to an entirely "new system" at fantastic cost. For some reason the old system never really dies so after the multimillion dollar migration the old zombie system still supports some business functions, leading to that replication anyway.
Both solutions are unpalatable: "the database is the integration point we can't ever change" is bad and "no apps have a central ground truth database" is also painful.
I'd prefer the latter with small APIs and more duplication, than the common enterprise solution with a massive holy unchangable db directly serving apps.
Yes. But "adding" is a pretty small subset of changes. If you need to do even a trivial migration like the classic "merge first and last name columns into a single column" is pretty hairy if you have a dozen legacy apps that assume "customers" have "firstname".
In the end it means you give the data an elevated role in the system, and you write the code around the data. When the code would be cheaper to write or maintain with a different data representation - you can't. It's a comfortable situation for a while, but when you eventually grow out of the current system it's a world of hurt.
Like I said - it's two bad (non) solutions but I very much prefer duplication, isolation, inconsistency, a forest of small obscure APIs etc. where I can add/remove/normalize/denormalize/add caching etc - to "the data is the ground truth and the schema we can only change through addition" non-solution.
> In the end it means you give the data an elevated role in the system, and you write the code around the data.
That's how things are meant to be. The structure of a program naturally follows that of the data it manipulates.
> When the code would be cheaper to write or maintain with a different data representation - you can't.
Who says that? The real problem is that most programming languages aren't capable of integrating database schemas into the type-checking process, so when you change a schema, it's difficult to make sure applications follow through in a consistent manner. That's what makes change unnecessarily risky and painful. But it's possible to do better than that: http://impredicative.com/ur/
In such a scenario, none of the apps "own" the data - the DB is a primary source of truth and consistency is enforced there instead of relying that the code in the many apps will match.
Application logic can undergo active changes and refactoring without changes to the data permanent storage, indeed, a 1-to-1 correspondence between DB and application data structures isn't expected. It's just as with backwards compatibility with file formats, changing the application version shouldn't require incompatible in-place changes to the backing data persistence layer - you can need to add some new features, but the old version of the app should work fine with the updated DB.
I'm not going to comment on the content of your posts - others have done that. But you might want to check your tone. You sound like a petulant teenager offended by the mere notion that someone might think you wrong.
It honestly makes it really hard not to dismiss whatever you say out of hand. If you don't really care about people taking you seriously then by all means keep it up, but if you're trying to participate in a grown up conversation, you might want to stop throwing tantrums. You know, the basic rule of human interactions, show people the respect you expect them to show you?
> Unless you have never written migrations in SQL before you would know that they are even scary and a big bunch of sql
I've written a ton of SQL migrations this year, none of them are scary or even really count as "big bunch of sql". If this describes your code, you should probably stop and rethink what you're doing, because that's bad practice.
The difference between, eg, merging your "first_name" and "last_name" fields into a single "name" field with Mongo or a SQL DB is basically that the SQL DB has better tooling so it'll be cleaner, faster, and more concise.
> Now in your migration script you have to hack SQL and language-based ORM commands
Also false. Nothing requires you to use the ORM commands if they don't make your life easier; literally last week I wrote a migration for some ORM based code using pure vanilla SQL because it was more appropriate. If you know what your ORM is doing, it's trivial. And if you don't, well, you have bigger problems.
> Actually they can be just as explicit and documented.
With a relational DB: Check the table schema.
With mongo: Dig through the app code and try and reverse engineer what fields the code is expecting. (Or pull up a few documents and see what fields actually exist, then ponder how you know whether you're looking at document that uses the latest schema, or maybe an outdated or dead document using an outdated schema. Also, how do you know what optional fields might be missing? Or what possible values the currently null fields might have? Good luck with that.)
> Mongo even has schemas now I think.
No. Mongo still lists flexible schemas as a feature, ie, they do not enforce any specific schema.
> SQL constraints are very limited and literally every application has additional constrains
Yeah, but what relational DBs are very good at is enforcing foreign key constraints. If your data is relational at all (and let's be real, very, very little world data isn't at least somewhat relational), you'll need it, and Mongo doesn't have it. (Well, Mongo doesn't even have foreign keys, but it lacks the equivalent feature for dealing with denormalized data too.) Managing to set some non-numeric characters in your phone number field when some of your app code contains a hard assumption it will be numeric is bad, yeah. But when you start screwing around with relations, foreign keys, failing to properly propagate changes to every copy of a denormalized data structure...oh man, you can spend days trying to untangle that mess. And it's a class of error that relational DBs don't have unless you misuse them badly.
In short, you seem to be suggesting that as long as you're blindly using an ORM, Mongo is almost as good and flexible as Postgres. Which...sure, okay? I guess?
what relational DBs are very good at is enforcing foreign key constraints. If your data is relational at all [..]
Nitpick: the "relational" in "relational database" is not a reference to the concept of foreign keys, but to "relational algebra", the mathematical model for regular data structures that SQL is based on.
I disagree with this. After doing a bunch of work with SQLAlchemy and Hibernate over the years, I've developed a major dislike to ORM's as a replacement for SQL and would much rather write my queries in plain SQL. My currently favourite SQL library is HugSQL[1], which lets you write your queries in SQL in .sql files and HugSQL will load them and provide functions which you can call to invoke the query.
I find SQL quite solid for querying and updating data. Its what it was designed to do. Much more solid that the query API's I've seen in ORM's.
However, more importantly, when performance matters, it is very difficult to properly understand performance characteristics of complex ORM queries (I often find myself making the ORM generate raw SQL, which I then manually review to determine what its doing (eg what indexes it uses, etc)) and I find it much easier to optimise SQL than ORM calls.
My point isn't to badmouth ORMs. If you like them and they solve your problems, that is awesome. My point is that calling SQL a "shitty language" is purely subjective and your opinion and other people have different opinions. SQL is not an objectively shitty language. I think its a pretty good data query and manipulation language. Not perfect, but definitely not shitty.
And no, I'm not a DBA. I'm a programmer who hates having to mess around with databases. It is for this reason that I want things to just stay out of my way so I can get my work done with minimal touching of the database. I've recently found myself forced to do things I never wanted to do (1. change data capture on a MySQL database; 2. query and table optimisation to make a database perform and scale better) and for #2, my life would have been a lot easier if I could have worked with raw SQL.
Tl:dr - you used a crappy ORM and now think all ORM libraries sick. So... use MongoDB... which is always accesses through an ORM...
All your comments come across as confused mate. Enforcing arbitrary constraints belongs in the business logic of your app, but dealing with consistency issues does not. Postgres handles row constraints just fine, you can enforce any arbitrary logic (i.e colX != colY and its not the first Tuesday of the month), combined with stuff like expression indexes, json type and actual schemas leaves me confused as to why you would advocate mongo so hard.
Just because you write your migrations in a language like JavaScript doesn't mean it's any better than pure SQL.
You're replies in the comments are rude and borderline fanatical in tone. There are two classes of constraints: one being 'this string shouldn't be longer/shorter than x' or 'this value should only be between 1 and 10'. The other kind is 'on the first Tuesday of every month only during a full moon can this record be deleted, and only by a user with permission X'.
The first can easily be modelled in SQL, the second should be modelled in your app. The first are not arbitrary, the second are. Not hard to comprehend mate. Just because SQL can't model all constraints doesn't mean you should throw it out the window and reach for mongo.
Comments like 'none of you understand this because youre not programmers and therefore know nothing' is incredibly insulting. I'm of the opinion that you know nothing, and if you're happy knowing nothing with fellow mongodb users then so be it. Just keep your nonsense away from me thanks.
> A number between 1 and 10 on Tuesday is only slightly different. You can say it is not arbitrary, whatever.
It's not arbitrary, because the constraints at least in Postgres are themselves constrained to validating a single row in a single table. Your app code can be truly arbitrary, you could trigger a device that posts a carrier pigeon to someone many miles away and waits for it to return with a result. That's arbitrary.
> Just in case you forgot the original claim, it was that if you use a SQL database, the constraints are in your database, but if you use mongo, the constraints are in your code.
Sorry, that's not what I was claiming. I was stating, correctly, that constraints relating to the data belong in the database. Constraints belonging to the business logic that powers your app belong in your app. You can be 100% sure that a constraint in your database will be respected, whereas the same can't be said of your apps code.
If you are given a specification that says "no username can be longer than 12 characters", that's what you add to your database constraints. A specification that says "No emails from temporary email providers are allowed" belong in your app code. Clear separation of concerns, that mongodb has no clue about. 'Just validate everything in your code, and if you get junk data in your database then I fucking hope your code can handle it'. Please.
> Therefore, saying that it is a great win that a SQL database can do some basic constraints checking
It can do pretty complex constraints. On the data. The more complex, completely arbitrary constraints belong in your app code.
> is a ridiculous argument because it hardly contributes any benefit due to the fact that even more complex constraints are in the code.
What. That makes no sense. So having 100% certainty that no column value in your database will violate any of the constraints you place on it 'hardly contributes any benefits' because... you have other, unrelated, business constraints in your app? What logic is this?
Your code enforces consistency and detects when a transaction fails the consistency rule rolls back the changes to the state when the transaction started? That's some great code I'm sure, but a lot of effort.
The only limitation on an SQL check constraint is that you cannot use columns in other tables.
As has been pointed about by many of us slow learners, it's far easier to understand constraints in SQL than it often is to dive into a code base.
> No the difference is one has to be written in a shitty language (SQL), whereas the other can be written in an actual programming language.
SQL is a DSL. It has pros and cons for data manipulation. For many combinations of (particular programmer) + (particular thing to get done) + (set of considerations), SQL turns out to be a better tool than a more general-purpose programming language. YMMV, obviously.
> Err no it isn't false. It is based on real-world experience, rather than "theory" about "what you can do". In the real world, people use ORMs and the ORM migration library, and then they find that the migration library doesn't do what they need it to do, so they have to use raw SQL. But then they run into the problem that because the ORM handles all of the generation of indexes and column names and so on, they have to figure out the correct naming of everything so that the ORM understands what you have done. This often results in failure because the ORM does not want you to tinker with the schema yourself.
It sounds like you're making a decent argument against manually modifying a database schema at the SQL level, behind the back of some ORM's which can't gracefully handle such surprises. But I don't think anyone is disagreeing with you about that.
> Which requires you to context switch from the code to the schema and piece together all of the relations based on foreign keys.
Many programmers can handle that complexity. Maybe the additional layer of indirection / complexity added by your ORM is the particular systems you deal with too hard for you to reason about?
You have a house with an array of rooms, and each room has a house with an array of rooms? Etc.
Unless that's a reference to a house in the room class, that is not in any way a good design. And if it is a reference, then you implement it like this:
House is a table
Room is a table that has a foreign key that references the House primary key.
> No the difference is one has to be written in a shitty language (SQL), whereas the other can be written in an actual programming language.
Your other comments already make you sound like a petulant child and a moron (or at the very least, someone with no actual real-world experience), but this just the cherry on the cake.
SQL is extremely powerful, and using it effectively doesn't take much more effort than learning another new mainstream programming language.
Anyway, enjoy your tortured use of shitty ORMs + Mongo for your toy projects; move aside and let the real programmers go on with their work.
Arbitrary constraints which look at multiple rows require serializable isolation to work under concurrency, and most people do not run their databases at serializable. But you get pretty far with single row check constraints and exclusion constraints (exclusion constraints can be used for checking against range overlap).
There's no need of serializable level of isolation in common case. It depends on particular data model, particular access pattern, particular database implementation (MVCC or blocking, and how exactly transaction isolation levels are implemented - e.g. Oracle's "serializable" is closer to PostgreSQL's "repeatable read").
Anyway, any sufficient modern RDBMS implementation provides pretty good level of performance even with serializable isolation level, thanks to decades of tuning and research in the field.
Not in the common case, but it is necessary in the general case. One of the main motivations I have heard from the few people actually using serializable in their systems is the ability to enforce arbitrary constraints under parallelism.
And, yes, performance should in general still be good, but there is less knowledge out there about how to solve the performance issues specific to serializable since there are few people who use it (at least in the PostgreSQL world).
Well, not so many people use serializable in their systems (in the whole system, for every transaction) and that is common and reasonable approach. They use the minimal isolation level for particular transaction that keeps their data consistent. Most widely used safe default is repeatable read, in PostgreSQL terms.
Broadly speaking, of course there are some quirks in the field, let's start from "A critique of ANSI SQL isolation levels"[1] by Jim Gray et al., an author of highly respected fundamental book about transactions [2]. But the very kind of problems discussed in RDBMS world, is a rather contrasting with an "ACID? why do we need it?" attitude that is so often in the world of "web scale NoSQL".
> "using serializable in their systems is the ability to enforce arbitrary constraints under parallelism"
That's a funny way to say that. :-) Because what it really means is, "wanting to enforce a total lack of parallelism when under parallelism." I mean I get what you're saying, it was just funny to read.
Can you give a use-case for such a constraint that can not be avoided by a better data organization? I'm not sure why you would deliberately design your data with constraints based on other row contents.
Also: how do you enforce such constraints with MongoDB exactly? If the answer is "do it in the application", then your answer applies to relational databases too.
A common example is making sure that every event in a bookkeeping system balances to zero (e.g. three rows: -125 EUR bank, +100 EUR office materials, +25 EUR tax). To enforce this at the database level you either need to run at SERIALIZABLE, take a table lock, or do something horrible with your database structure (like putting all rows of the event in a JSON blob).
When I have solved this case I have done it in the application by making sure we are only using a small set of carefully audited stored procedures to modify the event table.
MongoDB does not even try to solve this kind of problem, and I am the wrong guy to ask if you want someone to try to make a case for using it anywhere.
A common example is making sure that every event in a bookkeeping system balances to zero (e.g. three rows: -125 EUR bank, +100 EUR office materials, +25 EUR tax). To enforce this at the database level you either need to run at SERIALIZABLE, take a table lock, or do something horrible with your database structure (like putting all rows of the event in a JSON blob).
Or you deny direct table inserts, and provide a stored procedure for inserting multiple rows in one transaction. But I see how it could be useful in the generic case.
Which is exactly what I did when I had to solve the problem. :) And you are right in mentioning that the fact that you can restrict modification of the database to only certain stored procedures is a great help here.
With on commit triggers or deferrable check constraints it would be easier, but in Oracle you could use a materialized view with "on commit refresh" and add triggers to this MV.
You don't have to avoid such constraints if they are natural to data model actually. It's much faster (and safer, thanks to transactions) to check new data against some old data when the checking code is as close to the data as possible, instead of make a couple additional requests between DBMS and the business logic layer.
I've just commented with pretty much the same as you're saying.
Some people sound like they haven't even tried Mongo.
It's great that PG has JSON support now, but a few years ago people were having together stuff on hstore, storing their geo objects as binary blobs that you can't read without extensions etc.
SQL databases have been playing catch up with JSON, MSSQL being the worst. Even with JSON support, it's still a pain looking at some SQL queries that have to be written.
If postgres does it for you, don't go bashing everything else that tries to be an alternative I feel.
> If postgres does it for you, don't go bashing everything else that tries to be an alternative I feel.
People bash on MongoDB because they care. People just want to raise awareness that there are two types of MongoDB users: Those who understand it is a deeply broken system, and those who haven't used it thoroughly. The truly careless and anti-social approach would be for people to leave no negative comments about MongoDB and let others fall into the same trap.
Everytime I hear arguments for going back to relational databases, I remember all the scalability problems I lived through for 15 years in relational hell before switching to Mongo.
The thing about relational databases is that they do everything for you. You just lay the schema out (with ancient E-R tools maybe) load your relational data, write the queries, indexes, that's it.
The problem was scalability, or any tough performance situation really. That's when you realized RDBMSs were huge lock-ins, in the sense that they would require an enormous amount of time to figure out how to optimize queries and db parameters so that they could do that magic outer join for you. I remember queries that would take 10x more time to finish just by changing the order of tables in a FROM. I recall spending days trying different Oracle hints just to see if that would make any difference. And the SQL-way, with PK constraints and things like triggers, just made matters worse by claiming the database was actually responsible for maintaining data consistency. SQL, with its naturalish language syntax, was designed so that businessman could inquire the database directly about their business, but somehow that became a programming interface, and finally things like ORMs where invented that actually translated code into English so that a query compiler could translate that back into code. Insane!
Mongo, like most NoSQL, forces you to denormalize and do data consistency in your code, moving data logic into solid models that are tested and versioned from day one. That's the way it's supposed to be done, it sorta screams take control over your data goddammit. So, yes, there's a long way to go with Mongo or any generalistic NoSQL database really, but RDBMS seems a step back even if your data is purely relational.
> And the SQL-way, with PK constraints and things like triggers, just made matters worse by claiming the database was actually responsible for maintaining data consistency.
I...just...I can't.
Your database should ALWAYS be responsible for maintaining consistency. Your application will likely be dead in a couple years, and if it isn't then you are going to end up having something else interfacing with the database at some point, guaranteed. The only sane place to guarantee consistency is in the database itself, otherwise every time you change constraints you are going to be updating every integration you have and hoping you didn't miss something.
Suppose you're writing a system for vacation rentals. Here are two data consistency rules: 1) only one user may have a given email address 2) a given property can't have two overlapping rentals.
Postgres can enforce both, in a way that's not subject to race conditions. 1) Is a unique constraint and 2) is an exclusion constraint.
As far as I'm aware, the only way for application code to do this would be for it to 1) lock an entire table/collection of data 2) do a query to see what data is there 3) check the new data against the existing 4) write new data 5) unlock the database.
Postgres probably has to do the same thing conceptually, but by using indexes and running the checks in C on the same machine where the data lives, it can do it very quickly.
I've been in the opposite situation and I couldn't disagree more. But I will say this, it's always possible to take an RDBMS model and de-normalize it and use it like a NoSQL database (like reddit does, for example) but it's not possible to go the other way.
I don't mean it's possible to use a different technology; I mean within the same technology (postgres, for example) you can use it both as a normalized relational database and/or as a de-normalized document store.
The article is interesting, but title is fud.
Besides, all this is not unexpected:
> How does MongoDB ensure consistency?
> Applications can optionally read from secondary replicas, where data is eventually consistent by default. Reads from secondaries can be useful in scenarios where it is acceptable for data to be slightly out of date, such as some reporting applications.
The fundamental problem is that MongoDB provides almost no stable semantics to build something deterministic and reliable on top of it.
That said. It is really, really easy to use.