
NoSQL Databases: A Survey and Decision Guidance - DivineTraube
https://medium.com/baqend-blog/nosql-databases-a-survey-and-decision-guidance-ea7823a822d#.z0xgmd1a4
======
jcadam
On my personal projects, my default database choice is Postgres unless I have
a glaring, obvious reason to choose something NoSQL (in which case I like
CouchDB if the situation calls for a document store, don't know much about the
others).

At work, it doesn't matter if we have a perfect use case for a particular
NoSQL solution (I'm working on something now that I think would be a great fit
for a Graph database like Neo4J). We use Oracle, because it's expensive, we're
already paying for it, so damn it, we're going to use it.

~~~
Mister_Snuggles
I do the same.

If I'm doing something where I'm storing JSON (e.g., from an external
service), but only care about certain bits of it, I'll store the bits I care
about in columns, then put the whole JSON object into a jsonb column.

The key advantage that I see is that when I decide that a new piece of the
JSON needs to be cared about, I can do something like this:

    
    
        (modify data types as required)
    
        ALTER TABLE blah ADD COLUMN newthing text;
        CREATE INDEX ON blah.newthing; -- if needed
        UPDATE blah SET newthing = jsoncolumn->>'new_thing';

~~~
takeda
Just a nitpick, but I believe it would be more efficient to create index as
the last step here.

~~~
brightball
Not necessarily.

If the table is already fairly large you'd have to use Postgres CREATE INDEX
CONCURRENTLY to avoid locking the table while it sorted out the index. The
only downside is that you have very little control of the process.

If you already have the data available in a JSONB column you can create the
column, create the index and then set your app code to look to the JSON data
if the column is null during transition while a background job goes through
and populates rows either individually or in smaller batches to avoid
disrupting user experience.

------
lostcolony
One slightly concerning thing about this is the table at the end dictating
what popular NoSQL DBs do is populated by, and here I quote - "The methodology
used to identify the specific system properties consists of an in-depth
analysis of publicly available documentation and literature on the systems.
Furthermore, some properties had to be evaluated by researching the open-
source code bases, personal communication with the developers as well as a
meta-analysis of reports and benchmarks by practitioners."

Which looking at the table, doesn't seem to include Aphyr's work (Jepsen), so
much of what is being said may in fact be wrong. I'd prefer a chart populated
from that (though admittedly such a chart would largely be "undefined",
"inconsistent", etc).

~~~
DivineTraube
We tried our best to incorporate these results in the cases where they reflect
fundamental shortcomings not bugs (of which Aphyr found quite a few). But you
are right, we could include a list of popular examples, where description and
experimental findings diverge. Which cases did you have in mind? MongoDB being
CP?

~~~
lostcolony
And Redis, yeah. In general, seeing his work has made me very cagey at
trusting any DB's marketing claims; I'd love to see citations as to what
resources went into each claim, as that would inform me "this has been tested
by a third party" vs "this has been claimed in marketing docs and YMMV; you
should probably test it yourself"

------
antonios
This article will soon become outdated: CouchDB 2.0 is in release candidate
stage, and will soon be released. It features built-in and automatic sharding
and clustering and a new, MongoDB-inspired document query language...and lots
of small improvements as well.

For more information, read the excellent blog posts from their official blog:
[https://blog.couchdb.org/](https://blog.couchdb.org/)

~~~
DivineTraube
I'm also very curious about CouchDB 2.0 and whether CouchDB will be able to
make a comeback. When I talked to Adam Kocoloski about a year ago, I got the
impression that it takes a very good technical approach (consistent hashing,
causal hash histories, MongoDB queries, etc.). However, the implementation is
now mainly driven by Cloudant and CouchDB kept alive by them. I hope they
manage to rekindle the old fascination they once had.

In any case I will be happy to update the article and include it.

------
SmellTheGlove
This is pretty useful for someone like me who's been in large companies using
Teradata forever, just to understand the NoSQL products out there and
potential use cases. On any given day, I feel like I've fallen way behind on
database technology simply because we're fairly locked in, admittedly very
happy with TD. It's helpful to keep up here though, because we do have some
projects we'd like to do that would probably work better outside of an RDBMS.

I'd love to see something similar on the various ETL/data pipeline tools. In
that regard, we still write a ton of SAS code because we have a variety of
sources, which SAS does well, pretty solid if you need to do some data
cleansing into the target, and the code itself is very batch-friendly and
maintainable. It's been a while since I've surveyed alternatives.

------
sandstrom
Great to see RethinkDB covered (although it would have been better if it was
included in the primary comparison tables too).

It feels like a second-generation NoSQL database where most of the drawbacks
with earlier NoSQL-databases has been taken care of!

------
virmundi
I enjoyed this topic while researching New Data: a Field Guide [1]. In my
research I've come to really enjoy ArangoDB. It has good support for
documents, performance, relational data queries and basic graphs. Without
doing the research I don't think I would have ever dug into the new tools. If
you ever get a chance to do a small prototype, looking using these tools.

1-
[https://leanpub.com/NewDataAFieldGuide](https://leanpub.com/NewDataAFieldGuide)

------
cwmma
I'd point out that there are arguments[1] that 'CA' is not a real choose you
can make when it comes to the CAP theorem.

1: [https://codahale.com/you-cant-sacrifice-partition-
tolerance/](https://codahale.com/you-cant-sacrifice-partition-tolerance/)

------
Feneric
Not even a mention of ZODB. While it's true it's Python-only, it's certainly
one of the most mature NoSQL databases around.

~~~
lgas
The context the author is describing is clustered databases, does ZODB offer a
clustering option that is equally as mature?

~~~
Feneric
You mean like ZEO? Yes, it's probably been around for a couple of decades now.

~~~
lgas
I don't think so. I am referring to something that allows you to scale your
database cluster horizontally by adding more servers. ZEO appears to only
support a single server.

------
abannin
This is a fantastic overview of the issues surrounding DB's. Would love to see
some deeper discussions on graph databases in the next version!

------
DivineTraube
Felix, author of the article and founder of Baqend [1] here. I'm happy to
answer any questions. If you have suggestions for the next version of the
article, we're eager to hear them.

[1] [http://www.baqend.com](http://www.baqend.com)

------
novaleaf
boo. doesn't include a comparison vs Google's NoSql SaaS offerings (Cloud
DataStore and BigTable).

i personally use Cloud DataStore and find it a great fit, wish there was some
comparisons against it.

~~~
CydeWeys
Just wait for Cloud Spanner! [http://siliconangle.com/blog/2016/06/27/google-
tools-up-with...](http://siliconangle.com/blog/2016/06/27/google-tools-up-
with-its-spanner-database-looks-for-a-fight-with-aws/)

------
asdr
If you wish an introduction to NoSQL there is a good webinar at
[http://www.prohuddle.com/webinars/nosql/NoSQL_Distilled.php](http://www.prohuddle.com/webinars/nosql/NoSQL_Distilled.php),
which includes discussions, examples and comparisons.

------
dvirsky
> Redis, as the only non-partitioned system in this comparison ...

This is a bit misguided, redis cluster has been in GA for over a year.

~~~
DivineTraube
I agree, Redis Cluster has been available for a while. From a practical
perspective we did not deem it production-ready yet, due to various
shortcomings, e.g.: -Redis Cluter being neither AP nor CP with rather
unsatisfying justifications [1] -Strong scalability issues, for instance with
PubSub [2]

But you are right, I think Redis Cluster could be added as a separate system.

[1] [https://aphyr.com/posts/283-jepsen-
redis](https://aphyr.com/posts/283-jepsen-redis) [2]
[https://github.com/antirez/redis/issues/2672](https://github.com/antirez/redis/issues/2672)

~~~
antirez
A couple of points, just for clarity:

1) It is true that Redis is not AP or CP, but this is not covered by [1] that
covers Redis Sentinel (a previous version with different behavior compared to
Sentinel v2 btw). So, while Redis Cluster is just an eventually consistent
system that does not guarantees strong consistency nor availability (so is not
AP nor CP), the pointed post is not related.

2) Redis Cluster is very scalable actually, since it is a flatly partitioned
no-proxed system. Pub/Sub is not very scalable in Redis Cluster but Pub/Sub is
a minor sub-system of Redis, most users look at the ability to scale the key-
value store that is... 90% of what Redis is, so I think your claim is not
justified.

3) The fact of not being AP or CP does not mean that a system is not useful.
Depends on the business requirements. In fact, most SQL databases + failover
setups, that is what takes _a seizable_ portion of the big services of the
internet up, are not AP or CP as well.

Database systems features are related to use cases, I believe that the Redis
Cluster properties cover a huge set of real world use cases, in fact is the
most actively mentioned/requested feature right now, even if we document in
very clear terms the tradeoff and the actual behavior of the system. So people
know what they want. IMHO excluding systems because of _what you think_ being
acceptable tradeoffs is not a good idea.

Otherwise you should also mention that SQL+
Main_widely_used_failover_solutions systems have the same unacceptable
shortcomings, for instance.

Actually if the Redis Cluster implementation will be made the center of the
future Redis development, it could easily become the most used sub-system of
Redis, like Sentinel is already becoming. There are many devs that know what
they are doing and know how to apply things depending on the requirements even
if those things are not CP or AP.

~~~
DivineTraube
I was oversimplifying of course, Redis Cluster definitely deserves a place in
the comparison. We are actually very satisfied with using Redis in production
but had to work around the PubSub limitations using keyspace notifications,
which is why I'm biased regarding that point.

In any case, it's difficult in terms of presentation, since a "normal" master-
slave-replicated Redis is a totally different system from Redis Cluster as the
distribution models of Redis Cluster also affects the functional properties to
a large extend, e.g. regarding atomic Multi/Lua blocks and all types of multi-
key operations.

The failover solutions for relational database systems are a good point, they
too should be included.

Regarding use cases, I totally agree: every system should be discussed in the
light of the use cases it tries to solve. I personally think that Redis
Cluster with a choice for trading consistency against latency would open up a
whole new range of use cases that are currently not a good fit for Redis
Cluster. For example, we have a Redis-Coordinator project (not open source,
yet) which behaves similar to a scalable Zookeeper (BTW also neither CA nor CP
[1]). However, it has weaker guarantees, due to lack of tuning knobs in Redis
and Redis Cluster regarding consistency.

[1] [https://martin.kleppmann.com/2015/05/11/please-stop-
calling-...](https://martin.kleppmann.com/2015/05/11/please-stop-calling-
databases-cp-or-ap.html)

------
morrbo
serious question, what is to stop everyone (where a K/V store is not enough)
from just chucking some form of SQL server in a VM in a ramdisk and using
that? I've always wondered about that idea

~~~
virmundi
Lost efficiency. If you want a semi-persistent relational DB, you can use
BerkleyDB. There are also embedded DBs for Java, and probably other languages.
If you do this, you're not wasting RAM for the VM and the guest OS. The DB
gets its memory from the large application install.

edited for typo.

------
fizixer
Slightly off topic but: As someone out of touch on the topic of DB utilizing
apps for a while, what's the update on the ORM impedance mismatch? (I mean the
guidelines or best practices).

------
samwyse
The TL;DR was TL;DR. Ho, seriously. About halfway through, the buzzwords and
jargon got to be too much for me. Let me know if there's anything interesting
Wining.

------
mohanmca
Great work!

