I think the NoSQL trend is something similar to what is going on with programming language trends (in industry, not in academia). Many commonly used languages were strictly typed and static, such as C, C++, and Java. Then, dynamically typed languages became popular because they are easy to use, although they give up a lot of the safety and static analysis capability of the older languages. Let's call that the second phase, and it is exemplified by Ruby and Python. Now, newer languages with type inference are coming out that combine the safety of the older languages with the ease of use of the dynamic languages -- Scala, Go, etc. Let's call that the third phase.
In the database world, we are still in the second phase, which is exemplified by databases such as Mongo, which have started from scratch and thrown out many of the good ideas embodied in older databases. I suppose that the new JSON capabilities of Postgres are an attempt to move database design into the third phase, which combines the ease of use of NoSQL with the many superior qualities of traditional relational databases.
There's definitely a lot of throwing the baby out with the bath water. ACID is hard on a distributed system. Let me rephrase that. Performant ACID is hard on a distributed system.
ACID and schemas are not actually coupled. You can have strict schemas without ACID and vice versa.
Relational stores without ACID are very hard to reason about. Any associations that are created with the primary entry may or may not be there at lookup. This is solved via documents.
I agree with your prediction that e will have document schemas enforced in the DB.
> There's definitely a lot of throwing the baby out with the bath water. ACID is hard on a distributed system. Let me rephrase that. Performant ACID is hard on a distributed system.
> Relational stores without ACID are very hard to reason about. Any associations that are created with the primary entry may or may not be there at lookup. This is solved via documents.
This is only solved via. documents for applications which don't need relational consistency or transactions, which is a minor percentage of web-applications.
Most applications of Mongodb that I've seen do not fall into that category, further MongoDB is marketed as a generic, not as a very application specific kind of solution..
As soon as you find your application does need transactional consistency between entries, you'll be up the creek making your own paddle - since unlike with SQL databases where ACID is at least supported MongoDB just gives you:
> $isolated does not work with sharded clusters. [0]
Well, somewhat. The C in ACID (consistency) is impossible without schema's, because they contain the data validation rules. Without schema's you'll have to do all data validation in the application layer. Which is probably what you want, given that you're going schemaless, but it's AID at the database layer, not ACID.
There's also additional considerations that weren't touched in the article, and they're not functional but architectural.
Sometimes you have payloads that can't fit in a traditional RDBMS, and SQL/Relational don't cluster/shard well.
NoSQL bring convenient clustering/sharding to your data. Anecdote: an old team chose Cassandra solely for its admin-tools (which are really cool), even if we didn't actually need a NoSQL; I lost that battle, but I least we all learned something.
I don't remember much automated testing being done in those early days of PHP, and I don't see how your theory would explain the current trend back towards typed languages.
Perl was one of the first hugely successful dynamic languages, though I will politely defer to Shell and TCL folks for a rigorous discussion of that claim. (:
Perl has been really serious about testing since early on, and, to my understanding, remains one of the most test-centric major languages out there.
On the other hand you could use static typing to avoid writing a lot of tests. Maybe it's a bit of preference thing, I don't mind either approach personally.
I think automated testing is more about integration and regression than code correctness. If you have a web app that connects to a Java backend, for example, static typing won't prevent issues. You need to test the API calls and the UI, and make sure the right requests are made, and the expected outcomes occur. This was done manually by QA people in the past, but it's obviously more scalable to automate it.
The third phase already exists. It's SQL engines e.g.
Hive, Presto, SparkSQL, SAP Vora etc. on top of NoSQL and something almost every big company is using today.
SparkSQL in particular can "run all 99 TPC-DS queries, which require many of the SQL:2003 features".
It may or may not be apt, but it just felt like I was dumping JSON objects wherever, with a loosely applied path of db.collection to keep things together.
I wish the OP had also drawn a comparison between Postgres and Cassandra (which he mentions early on).
Based on the schema he writes at the beginning, it looks as though he works mostly with time-series. My understanding is that Cassandra is particularly well-suited for these types of workloads (definitely more so than MongoDB?). If the write throughput is such that a single server can handle it, how would Postgres compare with Cassandra? What are the distinctions in read-latencies? Storage space?
So true, I often see people choose a database without considering pros and cons, they often considering only what they want as a developer to work with or to try and don't consider the nature of the data. Time series data fits the best in Cassandra and some other column-oriented DBs but definitely not in MongoDB.
Time series data is actually a great fit for MongoDB.
You create a collection for each measurement, JSON document for each day and then preallocate the fields for each hour/minute/second. With batching you can update 100K documents a second pretty easily off one node. For example:
https://blog.serverdensity.com/using-mongodb-as-a-time-serie...
I read that more as what he normally uses for tests and demos since it can be really easily generated using native SQL functions, and not necessarily his day job. Not that it matters much, I guess, and I agree it'd be neat to see that compared to Cassandra.
That's pretty much it. No time series just yet, just an existing Mongo set that I'm trying to wrap my head around, and that means picking it up ASAP.
I've messed around with Cassandra, but not in much depth. I ran into the way it treats NULL, and that all searchable fields must be indexed, and put it on the backburner for later.
From my experience building and deploying production systems with mongodb the primary benefit is not needing a migration or ORM framework.
- Deployments get much simpler when migrations for additive changes are unnecessary. Hundreds of deployments a week!
- JSON queries easy to pass around and decorate in application code. Hydration from a dictionary is straightforward. Combined, an ORM isn't really necessary.
>primary benefit is not needing a migration or ORM framework
You don't need an ORM, but they are handy. I'm pretty sure it wouldn't be an issue to load rows from a RDBMS, using a language like Python, as a dictionary, and just decorate the snot out of that dictionary, to make it look like an object.
Anyway, regarding migrations, I sort of agree. Migrations can be terrifying and slow things, but at least you're allow to make certain assumptions about your data afterwards. Assumptions like "this field will always be there".
With MongoDB, and any other NoSQL database I've used, you need to handle your "migrations" in the application code. So now you have block of code that checks if a given attribute exists in your JSON document.
It's not necessarily a problem, but it is something you need to consider. Personally I much prefer RDBMS, because I actually like SQL and the ability to make ad hoc queries easily. If JavaScript is advertised as the query language for a database I'm not terribly interested, unless the pay off is extremely high.
Is NoSQL really all that much different than relational? At the end of the day isn't a collection pretty much like a table? And you have to tie two collections together with a common ID (ie, join)?
NoSQL is less rigid but your code pretty much has to be as rigid, right?
This is often what happens when Mongo is used, except now the task of joining and aggregating is pushed up to the application, instead of getting it for [mostly] free from the DB itself.
In the process of pushing it up to the app, you lose ACID guarantees. Without ACID guarantees, your ability to view/update a snapshot in time of all your data becomes a very difficult problem to solve. As such, it's hard to say the two methodologies are equivalent.
When you start putting relational data into a non-relational DB, you are going to have a bad time.
In addition to losing ACID guarantees, you lose single-query joins. They're performed in memory, sequentially, with all the incurred costs of a waterfall series of requests and whatever network communication is involved.
The entire premise of NoSQL is based on two (perceived) shortcomings of SQL databases: its rigid structure limits flexibility and its transactional safety limits speed and scale. So it really depends on what you mean. No, NoSQL isn't that different from regular SQL in that it still provides an object-based data storage and retrieval facility. Yes, NoSQL is different from regular SQL because it does away with the most fundamental properties of SQL.
isn't a collection pretty much like a table
No, it isn't. A table (or relation) has predefined columns, and each row adheres to the data type and constraints for each column. If you were to model a NoSQL "collection" as a table, you'd end up with a sparse matrix where the number of columns could be equal to number_of_rows * number_of_distinct_attributes of the original collection.
you have to tie two collections together with a common ID
Nothing in SQL requires you to join collections based on a common ID (though it is the most useful). But still, an SQL join has a predefined result structure regardless of the actual table contents or join condition. There is no predicting the data structure of a NoSQL join.
My experience says that the issue is not differences, but bad architectural decisions. Most applications out there are glorified CRUDs. They are perfectly serviced by SQL databases. NoSQL fits a smaller set of systems whose scope is mostly passing data around. PostgreSQL now supports JSON as a datatype and allows querying it. Works beautifully. No need for MongoDb at all.
My understanding, which is admittedly minimal, is that MongoDB provides some conveniences around partitioning and distributing your dataset that Postgres does not without additional work. Simply supporting JSON accommodates MongoDB's approach to schemas (not having them; er, pushing constraints to application code), but doesn't address the distribution concern.
How well does MongoDB support this need, and how hard is it to achieve equivalent results with Postgres? What are the drawbacks?
Side note: you can have constraints, `check`s, indexes and triggers on `json` and `jsonb` columns in Postgres. In fact, you can have everything other data types have. So "no schema" or "moving schema to app level" doesn't have to be the case, when using PG.
Exactly. With Postgres, you can add CHECK constraints to enforce specific keys, with any level of nesting desired. So an INSERT could reject a document for having missing fields, fields of improper type, etc. It all depends on how strict you want to be.
Mongo's partitioning/distribution/replication does not work as well as it should. In particular, my team has seen several serious bugs pertaining to replication in release versions that were so bad, they should have blocked the release.
I will argue _against_ MongoDB all day, but there is most certainly a time and place for it. If you have a real firehose of data, don't care too much about the accurateness, and really need good speed... MongoDB is not an awful choice.
How well does it stand up against other "NoSQL" solutions that compete in the same space? My impression has been that it tends compete poorly with a number of other solutions for data integrity, and my intuition is that its appeal is correlated to the proliferation of JavaScript and the convenience of interoperating with it.
From my experience, the problem is not how does any NoSQL solution stand up vs. any other NoSQL solution. The question to ask is "What caveats affect me the least?" and then go with that NoSQL solution. In the case of MongoDB, the biggest issue (at least in the past) is that the caveats simply aren't known well, or at all!
Compare this to something like Elasticsearch, which is well-known (and well-documented) to have data loss in certain situations.
Is a relational database really all that much different from a file system? At the end of the day isn't a directory pretty much like a table? And you tie data together with their common id, or file name?
I think there is a difference, especially in practice. Relations are a first class citizen in relational databases. Integrity is enforced. Most types of operations can be bundled up in transactions.
That abstraction differs from non relational data stores and you have to design your application to fit it.
Well, the predecessor to DB2, IMS, is basically a database in a file system. Sure, IMS is hierarchical and not relational but it is a full on database with fields and indexs. In fact, Mongo (and other NoSQL solutions) seem to have a lot in common with IMS including the lack of a proper schema and referential integrity.
Well, there are a few other massive differences - RDBMS have basically its own version control built in (which is what allows ACID) and by containing the logic for joins it allows performance based on statistics (which can't be done in an upper layer)
But you still can insert non-consitent data into the database with CLI, or otherwise. And `default` values defined inside Mongoose is no good when filtering or searching documents.
There are plenty more problems with emulating schema in app code (Mongoose or anything else), so it's not that trivial.
I'm building a Node.js app where I was thinking of using MongoDB but I'm reconsidering it after reading the comments here.
The database's only job is to keep persistence between server restarts. On server start I fetch the a collection from the database and populate a variable with it, which I use for any direct access.
When a user updates something it will update the variable as well as updating the collection in the database.
Basically any database should do for this usecase. A simple key-value store probably best. I'd aim for keeping the technological complexity as low as possible.
What is the simplest tool to do the job?
Or maybe better yet: Use something that's already (or will be), in your technology stack to avoid having one more thing in it.
For me it's definitely MongoDB but I'm afraid of data loss etc. My example was simplified. In reality I want to update a sub-array within the collection, like this: collection['some_identifier']['items'][3] = 123. Overwriting the whole 'some_identifier' document seems wasteful. With MongoDB it's easy though to update nested sub-data.
Poor wording on my part. My point isn't about performance, it's about API. Other than selecting documents from single collection (.find), Mongo's queries are shockingly wierd and inconsistent to me.
Compare aggregation framework to what SQL has, for example:
If you ever want to scale your app to more than one process (CPU core), or more than one server, you'll have a problem...
Using Redis could be a solution (if Redis can accommodate your read pattern), or you'll need a pub/sub system (like that of redis) to propagate db changes from one process to the other...
It also can read from and write to Redis using Foreign Data Wrappers. Pretty neat, if you want your DB to be single place you store data, but something faster for reads or cache.
If you want to do updates you want a traditional transactional database. If you really need massive scalability (which seems unlikely) you should figure out a way to store your data in a log-structured/append-only way, and then use Cassandra or Riak.
Thanks! In my real application I want to be able to update a sub-array of the document, like this: collection['some_identifier']['items'][3] = 123. Just setting the value for the key (e.g. overwriting the whole document) seems wasteful since I only want to update a portion of it. With MongoDB it's easy to set sub-data so I guess I'm looking for something where it's as easy but without the downsides.
We've been running node+postgres in production for a couple of years and it's been great so far. If you have advanced questions, #postgresql on freenode is always there for help, and the node-pg libraries are very mature.
It would probably work but in my real application it's a bit more complicated since I only want to update a sub-array of the document, like this: collection['some_identifier']['items'][3] = 123.
I guess using a plain file would mean I have to overwrite everything on each write which seems wasteful.
Sure, it depends on your use case though. For infrequent updates I don't think it's wasteful to just overwrite the file, it seems less wasteful to me than running a whole database. But it all depends on your setup :)
Also I think updating a sub array in a collection triggers a whole write to that collection in mongodb. In any case I would avoid mongodb if you can, it seems redis might be a better fit for you.
I note the smillie but could you actually expand on that comment for someone who is only really familiar with relational databases?
I've recently become interested in using PouchDB (a CounchDB implementation) for syncing offline content between a user's browser and server where the content is based on a set of fixed documents. I believe that MongoDB can provide similar replication tricks. Once on the server I would expect to extract the data from the documents into a relational warehouse database (probably PostgreSQL as it's one of the databases I know best).
A document store _seems_ like the right tool for the job but could expand on why something like Mongo or Couch is a bad idea for situations where you're just storing/syncing documents?
Couch is fine, other NoSQL is fine. It's just Mongo I don't like:
- weak, if any, guarantees on data consistency and safety;
- poor API. If you need anything more complicated than `findOne`, there are at least 3 unrelated syntaxes with major differences that are not compatible;
- performance is not what marketing team says it is. Even at home JSON turf, Postgres is pretty comparable.
If you have data and queries that will benefit from document store, go for it. But don't choose only based on marketing.
There are many more rant on the web, I like this one:
Look up on CouchDB problems, several of its customers have given up on it (more than MongoDB it seems)
It's not a bad idea, MongoDB has a lot of undeserved hate (some of it deserved, but the minority), however I'm not sure it would do syncing as you want automatically (CouchDB is probably closer out-of-the-box to what you want)
I started with Mongo, used it for little bit over 2 years. Then new project used Postgres (2 more years). After that I got a chance to work with Couch.
Whenever I encounter Mongo nowadays, it's either simulating multi table joins on app level, or inventing locking with Redis.
Part of the issue is people storing relations in NoSQL just because, but I can't think of any use cases where Mongo wins over other NoSQL solutions. Perhaps you have some?
Image storage * When working on any type of image-based data science, it's incredibly easy to manage, it's fast, and meta data is incredibly convenient. The drivers are pretty consistent and readable across languages. Assuming that you know javascript, it's a pretty quick and easy set up to add a light weight node server with an image cropper on the front end, then set up a python app to do your data processing once you're done munging. I set that up any time there is an image-related competition on kaggle or setting up HAL-like robots on my raspberry pi to mess with my roomates: "I'm sorry Dave, I can't let you cook that".
* all you "OMGZORS USE A FILE SYSTEM!" can keep your mouth shut. This is drastically faster and much more manageable for the use-case.
Very few organizations in the world truly need to scale out their database. Their problem is usually that of a bad design. For those that need to scale out, NoSQL is not a magic solution either. The scaling out solution pretty much these days for SQL database is replication, sharding and partitioning tables. Postgres does these very well.
> The scaling out solution pretty much these days for SQL database is replication, sharding and partitioning tables. Postgres does these very well.
I tend to disagree that Postgres does partitioning well; partioning table resolution is rather dumb, there is no hinting (e.g. there is no way to tell postgres that there is a unique mapping for each query to a table) and no caching of CHECK results.
> Very few organizations in the world truly need to scale out their database.
Everyone organization that has a database wants to scale out to at least a 2 node cluster just for not having a single point of failure. Scaling is not always about performance, at first, it's more about availability.
Almost every major organisation will have a big data analytics program doing feature extraction, modelling, machine learning etc. Usually this starts with Hadoop/Spark with data in HDFS but then a point comes when you want this in a database.
PostgreSQL is awful at this which is why almost no uses it in this space. Apart from the lack of native drivers it is simply too difficult to do basic clustering. And the fact that it isn't built into the product doesn't give you a lot of confidence that it is (a) going to work and (b) is going to be supported.
And yes for those that scale out NoSQL is a magic solution. That's why they are so popular. I can deploy a 40 node Cassandra cluster in 30 mins and guarantee it works. Likewise MongoDB replica sets are ridiculously easy.
Well, we ran a billion queries per day on a 1.4TB system I was in charge of a couple years ago. Now we're building a 500+TB reporting system and probably putting it on Postgres-XL. Seems to scale fine to me. ;)
Only loosely in so far as bad performance can make scaling more important; however, scaling is not about performance, a super fast db could scale poorly while one that scales well could perform worse than the fast one but scale better. Scaling is about growing across machine boundaries well, so that adding more machines increases the ability to handle more load. That doesn't require good performance, it's requires good architecture.
Agreed. Scaling out databases is never something you want to do, particularly for Postgres.
But it's something you're less likely to need if you structure your data, in an RDBMS. If you know your SQL, then optimizations should be preferred before scaling out. Not just because you need to, but because it can improve user experience (server response latency).
1. In many cases, there is a ~30ms floor on operations if a server is non-local. That's true of virtually all services on AWS -- whether SQS, SNS, SES, RDS (MySQL/PostgreSQL), etc.
2. In many cases, rows are missing. A table has a million rows. Ten are deleted. It's not compacted. So it's not a simple size lookup. A lot of this depends on how the data is stored/indexed. That's something which is naively pretty slow, but easily optimized.
In Postgres, because it uses MVCC, every count requires verifying the table row in the data segment. This means a sequential scan over the contents of the table to verify the rows exist and are visible to the current session.
Few operations have this particular limitation, and it's been a sticking point in the Postgres world for years. Unfortunately there's no easy way around it.
> I tried to scale Postgres few months ago but I found that it is quite tedious.
If your application is big enough that you really need to scale postgres, I assure you, you will have a whole lot of tedious problems - not just sharding.
In the database world, we are still in the second phase, which is exemplified by databases such as Mongo, which have started from scratch and thrown out many of the good ideas embodied in older databases. I suppose that the new JSON capabilities of Postgres are an attempt to move database design into the third phase, which combines the ease of use of NoSQL with the many superior qualities of traditional relational databases.