

Why Replication is Awesome - d4vlx
https://cloudant.com/blog/why-replication-is-awesome/

======
rdtsc
CouchDB and by extension Cloudant are really cool.

Change feeds and replication are first class citizens if you wish, not some
side-feature.

This means 3 things for me:

* Can build your own cluster topology. Replicate in a star pattern, in a circle with overlapped neighbors, in a hierarchy. You decide. Replication is explicit and configurable.

* Ideal for real-time applications. Having client poll for changes can grind things down. CouchDB support continuous updates and changes feed. It can go via long polling, event sourcing, can still to GET's with a 'since' parameters. Up to the user.

* Having to handle merge conflicts explicitly. MVCC semantics is not hidden or hand-waved away with a timestamp-last-writer-wins-unless-your-ntp-is-messed-up. Conflicts are first class citizens as well and one needs to know how to handle. That is good for some case but bad in others. I like it. For example one can attach an application specific conflict resolver that knows how to solve conflicts for that particular database in an application specific manner. Riak is another database that handles explicit conflicts well.

BTW: Just noticed CouchDB 1.5 was just released.

[http://www.apache.org/dist/couchdb/notes/1.5.0/apache-
couchd...](http://www.apache.org/dist/couchdb/notes/1.5.0/apache-
couchdb-1.5.0.html)

Will have to play with the new Admin UI and Node.JS query server.

~~~
skrebbel
With regard to your 3rd point, conflicts that happen during replication aren't
actually first class citizens, if memory serves me well.

If you change the same field on two databases at roughly the same time, some
heuristic determines who wins.

~~~
rdtsc
That is true but the conflict is preserved in the database. So a change feed
set up to monitor conflicts will notice it immediately. You do have to set up
the monitor and conflict resolver.

From the CAP standpoint your system is available but not consistent for a
short period of time because initially on a GET during a conflicted document
you might get one version but after the custom merge resolve runs, it could
end up replacing with another.

That is the price to pay for availability, and it works best if applications
on top are designed to work with it. It is not always possible so this, so it
is a tool that is good in some cases but not in all cases.

------
tony_landis
I think it was Rich Hickey that said the document database is the worst of
them all, because you are now married to that structure.

Having used Couchdb in production for two years, I have to agree with his
analysis, and offer my own opinion that Couchdb is highly overrated. Not
because it is not a good implementation of a document style database, but
because the document store itself is not a good match for most use cases.

If the only requirement is a replicated JSON document store, it may work OK
for you. But so would Riak, Postgres and some others.

If you need to update the data in those documents or ever need to query the
data in ways you did not initially envision, you will quickly find yourself
missing features which even traditional SQL databases are very good at.
Development is slower.

Writing map/reduce for queries seems particularly cumbersome, particularly if
you prefer not to use Javascript. And you have to plug them into a textarea in
a webpage interface, or manually put them into Couchdb over http using curl or
some library that abstracts this away. Either way it is a degree of separation
that makes the data feel more out of reach than through a console interface
like psql or mysql.

Consider the scenario where you want to update the value in an attribute on
several thousand, or even just several documents that match some criteria. In
SQL, you would simply jump in the console and in a few seconds or minutes
complete that as a transaction with something like:

> update table set col=val where criteria.

There is no such feature in Couchdb. You will need to write code to filter and
fetch each matching document, manipulate it as needed, then write the entire
thing back. All to update a few bits that hopefully were not nested too deep
as that really increases the complexity of the code you will need to write.

As memracom stated, the replication is not perfect. My experience even on a
low latency network is the only safe way to ensure a client can immediately
read back what they just wrote is to pass them through the likes of haproxy
and use a sticky session. Otherwise you have a good probability of getting a
404 after a POST (create) or stale data after a PUT (update).

So for what it is worth, here is my advice on choosing a database from an ease
of development standpoint:

1) has as many features as you can, even if you don't need them initially 2)
has top notch libraries for your language / framework 3) has relation
awareness - do not denormalize unless you must 4) supports consistency 5)
supports in place updates - easily filter and change values (doesn't apply to
Datomic) 6) has tools to make schema changes / reshaping data is easy, and can
be done online

Maybe 2 years ago Couchdb was a great solution. But with memory and ssd
storage being so cheap and so much innovation with traditional and NoSQL DBs,
I don't foresee myself deploying Couchdb again. If I did need a place to dump
some semi-structured data, I find Amazon's hosted offerings more attractive.

~~~
js4all
When criticising CouchDB, don't forget about its killer features:

Replication: slave, master, multi-master, pull, push, single, continuous over
http(s), you name it.

Update handlers: You don't have to fetch, modify and save in every case.

MVCC semantics: Lock-free write access. Never, ever database dead-locks.

~~~
tony_landis
Here is the reference for update handlers for anyone wanting to check it out:

[http://docs.couchdb.org/en/latest/ddocs.html#update-
function...](http://docs.couchdb.org/en/latest/ddocs.html#update-functions)

It still requires writing code, and moving it into the database.

Once that is done, how do you call that function against an arbitrary list of
documents and pass the new values to it without writing even more code
somewhere?

This workflow of putting code/logic in the db is that it is forcing developers
out of their preferred development environment, workflow, and most likely
language.

Not to mention the fact that debugging all these couchdb functions and
map/reduce calls becomes a nightmare. And testing - not sure how that could be
done efficiently.

All of this this slows development.

It is possible to implement some web apps completely in static html, js, and
couchdb, eliminating the need for anything server side. In those cases,
couchdb is one of a kind.

------
memracom
Meh... Replication has been around for a long time. Couchdb was not anywhere
near the first, in fact Couchdb was just leveraging capabilities that were in
Erlang OTP for many years.

And replication is not perfect either. There are tradeoffs and corner cases
that need to be understood. Riak is another cluster datastore that is written
in Erlang OTP, like Couchdb, and the people who build Riak are quite open
about the issues that they have to deal with and the corner cases. They often
present at conferences and write blogs on these topics so if you really want
to UNDERSTAND replication, google for blogs and conference presentations
connected to Riak.

In any case, Couchdb is good, Riak is good, but even traditional RDBMSes like
PostgreSQL are good and can do replication. In all these cases, the developers
have wrestled with the math and computer science behind replication and have
made something that mere mortals can use.

~~~
BigBlueHat
To be clear, "replication" in this case is "multi-master replication."

Riak's Replication:
[http://docs.basho.com/riak/1.3.2/references/appendices/conce...](http://docs.basho.com/riak/1.3.2/references/appendices/concepts/Replication/)

CouchDB's Replication:
[http://docs.couchdb.org/en/latest/replication/intro.html](http://docs.couchdb.org/en/latest/replication/intro.html)

Riak's is for data safety within the cluster.

CouchDB's is multi-master, cross-cluster, cross-device, etc. It can either be
uni-directional (push or pull) or (if you do both) it can serve to synchronize
two distributed master databases--each serving as a primary write point in the
architecture.

That seems unique.

------
mwnz
When I first read this I thought it said "Why Republican is Awesome". Now that
would have been an entertaining read (not that the actual article isn't
entertaining - I'm sure it is).

