
High Performance Erlang – Finding Bottlenecks in a CouchDB Cluster - signa11
http://kowalski.gd/blog/high-performance-erlang-finding-bottlenecks-couchdb-1
======
im_down_w_otp
I would have expected any series attempting to address optimizing Erlang
applications to have introduced and used cprof, eprof, and fprof.

That said, looking forward to additional installments.

Also, I can totally hear somebody saying, "That's premature optimization,
don't worry about it." with respect to that horrible way of getting the
application's version number. Stuff like that drives me nuts, and this
particular issue is a perfect example of why that blanket statement is so
misguided.

Yeah, your broken, slow, locally isolated implementation of a thing is
innocuous in and of itself... but then some functionality in the critical path
uses it and builds on top of it, and then there end up being dozens or
hundreds of similar problems and internal dependencies... before you know it
you've got death by a trillion tiny cuts.

In the pathological case those trillion tiny cuts start to look like noise
rather than signal when profiling because nothing interesting just jumps out
as critically broken, and you're left just assuming, "Well I guess it's slow
just because the language/runtime/machine/whatever is slow, not because I've
ground performance down to a fine dust through accreted questionable
decisions."

------
MCRed
Slightly off topic but the general hacker community seems to somehow missed
it-- the creators of CouchDB formed a company and merged with the creators of
Memcached, and the new company is called Couchbase. This is the best NoSQL
database going. Memcached built in, CouchDB views, scalable (really actually
scalable, not mongodb "scalable") etc.

I've long thought the hacker culture ignored databases and just picked
something because it was popular (eg mysql) even though there were superior
(objectively superior- we are engineers after all) solutions out there.

Erlang is one of those languages that is objectively superior -- I've yet to
meet another language that does concurrency right-- yet many hackers just
ignore it because it's not got java's syntax. Which is silly.

Don't make the same mistake with next generation databases.

~~~
ddorian43
*This is the best NoSQL database going.

Yeah, right. Most of the things about it ARE wrong:

1\. couchdb views, who likes async map-reduce indexes ?

2\. memcached built ok (better would be redis built in)

3\. json documents, even mongodb has bson and not json

4\. the new global-indexes-thing IS NOT scalable because you have to hit every
index-node to do a query

5\. when will you be able to modify a document ? looks like still in beta

~~~
skjhn
Let's see if I can help here.

A lot people like async map-reduce. If you need to perform aggregation on a
lot of data, its constantly growing, and you need the results to be current,
async map-reduce is great. In the best case scenario, the results are
precomputed. In a worst case scenario, they are a few seconds out of date.
However, you have the option of forcing an update if need be. Either way, it's
a hell of lot faster than running the full aggregation every time it's
requested.

Redis is great, but a) the memcached protocol is well established and b) Redis
is more than a simple cache.

BSON vs. JSON, what's the point here?

A query doesn't have to hit every index node. That doesn't make any sense. In
fact, it's quite the opposite. With local indexes, you would, in fact, hit
every single node. With global secondary indexes, you hit the index node with
the right index.

Are you talking about partial updates? If so, yes, that will be available in
the next developer preview. Stay tuned.

~~~
ddorian43
Hi,

1\. For indexes ? But only couchdb has them. If more people would liked them
it would be more popular ?

2\. Yeah. I agree that for distributed-persistent-memcache it's good.

2.5 Json is inefficient.

3\. Yeah, but you usually shard indexes, say by user_id. So when you're
filtering where user_id=x and column_b=y you hit only 1 node.

4\. Things that don't have partial-updates are key-value dbs, right ? If yes,
why don't you call yourself that till you actually have partial-updates ?

~~~
skjhn
There are a handful of databases that implement map-reduce one way or another
- CouchDB, Couchbase, and MongoDB off the top of my head. Views might be a
CouchDB/Couchbase concept, but not incremental map-reduce.

In what way is JSON inefficient? Are we talking about size?

GSI indexes may or may not be partitioned. With GSI, depending on the index
size and resources available, you would most likely NOT partition the index -
that's the recommendation. You can create an index on user_id and column_b,
place it on a specific node, and you'd only be hitting that node for a query.
Especially if it's a covering index. Again, databases without GSI indexes have
an index partition on every single node - that means hitting every single one
for every single query. I'm still not sure what you're trying to get at.

I'm guessing you are referring to MongoDB shards and routers. However, that
example doesn't make sense. If user_id is the shard key, then yes, the router
sends the query to the right node. The same thing happens with Couchbase.
Given the key, you get the document straight from the node that has it.
However, if you have user_id, why are you querying on column_b too? Now, if
user_id is not the shard key, then no, the router does not send the query to
the right node, its sends the query to every single node.

I'm generalizing, but key-value databases are best for key-value operations on
arbitrary data. Document databases understand JSON and, as such, can provide
access via queries. With Couchbase, you can choose from views, N1QL (SQL),
geospatial (built on views), or full-text search (preview). Pretty far off
from a key-value store.

That, and it already has support for partial updates via N1QL. However, my
assumption was that you were talking about partial updates via key-value
operations.

------
tiagobraw
My new year resolution was to learn Erlang. I am implementing a simple REST
service with it and I'm loving its approach to concurrency and terseness. It
is indeed a beautiful language.

~~~
pmarreck
Obligatory "try Elixir too; it has the same semantics, a more Ruby-like syntax
(which is a matter of preference) and actual macros, while still compiling
down to the same BEAM!"

------
siscia
Just being extremely pedantic here, but for the least statistician of us I
believe is important.

Usually just have lower numbers in your benchmark doesn't actually means that
your software runs faster, you should run a statistical test and see.

In this case, if we run a standard test the confidence for improved
performance is basically 1, and it is definitely accepted.

However the claim of performance improved of 8%, well the confidence for that
claim is of around ~83%, which is quite big, but lower than 95% usually
expected in peer-reviewed journals.

Said so, I really enjoy the post :)

------
rdtsc
Very well done. And I like the format -- "here are the steps I took from start
to end".

Looking forward for more posts like this.

------
abrookewood
When I think of optimisation, my first thought isn't to go and patch the
source code for a large open source application! I'm impressed ... just not
sure it would be my first step.

