

STM, CouchDB, and writing 5500 documents/second - swannodette
http://dosync.posterous.com/stm-couchdb-and-pushing-5500-documentssecond

======
cloudhead
...except it's not durable. If your server (clojure) crashes, you lose the
data in memory.

~~~
damienkatz
It's quite durable. CouchDB has pure tail append storage and will fsync every
update and batch concurrent updates into a single commit. It can get fairly
close to a drives max sustained write speed.

~~~
saurik
The point of this article is queuing requestd at the webtier until you have
enough to swamp the HTTP transport overhead to CouchDB. If the website crashes
you are going to lose that data.

~~~
damienkatz
I'm not sure I understand. CouchDB won't lose any committed (fsynced) data
when it crashes. And when configured, it won't return a success until it's
committed. It's completely durable.

edit: Oh I see, the stuff written in the clojure side won't be committed. If
he wanted to write each one synchronously to CouchDB, that would work and
CouchDB would still batch up the concurrent updates while writing.

~~~
technoweenie
"As requests come in, instead of connecting immediately to the database, why
not queue them up until we have an optimal number and then do a bulk insert?"

Judging by the code, it looks like he stores about 50 documents before sending
them to CouchDB. I believe the original poster was asking about the durability
of Clojure STM, not Couch DB.

------
jchrisa
This is a follow up to the original post here
<http://dosync.posterous.com/22516635>

This whole thing is making Clojure seem more interesting to me. I'd be curious
to see how close Node.js can come to extracting this much write performance
from CouchDB.

~~~
iamwil
For me, the problem has always been the indexed views taking too long to
generate for a large amount of documents on first view. I know that some
people do a bulk update, then hit the different indexed views to get them
updated incrementally.

What can be done to increase the speed of the first view of new indexed views
that you might have later on? Do you just have to wait for it?

What about ad-hoc views? Are there ways to speed that up?

~~~
jchrisa
The key is: don't experiment / develop on a giant database. Make a small
subset of the database for testing views on. Then they are plenty fast.

As far as generation speed, what matters is that the indexer can keep up with
the insert rate. Unless you are doing a big import from an existing dataset,
you'll have to have A LOT of user activity to generate so much data that you
are outrunning the indexer. In that case you probably have a big enough
project that it makes sense to use a cluster solution like CouchDB-Lounge
(which will divide the generation time by roughly the # of hosts you bring
into the cluster).

Someday soon I hope we'll have an integrated clustering solution (Cloudant is
working to contribute theirs back to the project) so you can just deploy to a
set of nodes and get these benefits without much additional operational
complexity.

------
papaf
If the problem is the overhead of making HTTP connections, surely having a
pool of open connections with keepalive solves the problem in a much more
durable way? This is a setup I've seen in a telecom environment that supported
a good throughput.

~~~
mikealrogers
this was part of my initial reaction.

when CouchDB is under concurrent load it does a "group commit" style write
anyway and can batch nearly as efficiently as bulk docs. the only additional
overhead would be the HTTP headers which is nominal when the documents are a
bit larger than "hello world".

in general CouchDB does better under concurrent write load than single writer
and while these kinds of optimizations on the client side can be impressive
it's not actually what CouchDB is tuned for OOTB.

------
cpinto
although interesting, bulk insert doesn't (didn't?) work across multiple
databases managed by one host. and the ability to havve multiple databases is
one of the things that makes couchdb a lot of fun.

~~~
couchdb
the benefit of a bulk index is the ability to shove everything into a file and
fsync once. so splitting across files won't really give the performance
benefits.

------
jlcgull
What about riak ? How does that compare ?

~~~
rb2k_
riak does support automatic sharding and enables you to tune your CAP
paramters. Getting nodes in and out of a cluster is way easier with riak.
While riak does also support map/reduce, generating something like a couchdb's
"views" will probably end up being a problem since listing all the keys of
large buckets (needed for a view of all your data) is a really expensive
operation in riak. Also: couchdb supports incremental views, so if the view is
fixed, you'll only need to update the changed documents. My main problem with
couchDB was that in systems with a lot of changing data, the compaction was
too slow for me. I actually changed data faster than the compaction could work
on. Also: you have to compact that stuff yourself by sending an explicit http
request.

~~~
mikealrogers
would you be willing to write up how your writes were outrunning the compactor
in a little more detail in either a blog post or in an email to the apache
couchdb dev list?

we've been talking about possible improvements to the compactor and having a
solid use case that outruns it would help focus that.

