

In-memory key-value store in C, Go and Python  - nkurz
http://www.darkcoding.net/software/in-memory-key-value-store-in-c-go-and-python/

======
antirez
In order to perform such a comparison you can't write three small trow-away
programs, you need to optimize each version at your best for weeks to start to
be meaningful, and you need an expert in the three systems.

Otherwise the test is still interesting but is: "what is the best language to
write a memcached clone without being an expert in a given language, using a
few hours", that still says something about how different the three languages
are, but does not say much about what is the _best_ language to implement the
system.

Btw in a more serious test another parameter that you did not considered much
is very important, that is, memory usage per-key in the three versions, and in
general, memory behavior.

~~~
willvarfar
As the author points out in the article, the neat thing is that Go does this
for you. That was the major takeaway.

~~~
antirez
This is the default programming model in Go, but it's the same for Node.js,
and you can use an easy-to-use event library in Ruby and Python as well.

A key-value store is system programming. It should be carefully designed, it's
not real-world that you accept the default I/O model of the language you are
going to use.

p.s. in the case of C it's hard to argue what is the default I/O model. It
supports threads, fork, multiplexing, blocking and non blocking I/O, in the
same way basically (in a low level way).

~~~
yummyfajitas
In the real world, defaults and simple-to-use libraries (i.e. not twisted)
matter. Just because you are doing system programming doesn't mean you
necessarily have the time/skill to carefully optimize everything.

System's programming isn't just writing a KV store used by millions. Sometimes
it's writing a dedicated calculation server used in a single company.
Carefully optimizing your I/O model might take longer than doing the project
itself.

~~~
ootachi
There _is no default_ in C, beyond stdio. The C specification makes no
reference to any I/O beyond stdio.

The author was being unfair to C. Use epoll or, better yet, libuv. libuv is a
simple-to-use library.

------
timtadh
Looking at the go version[1] I am not sure if will work as expected. Maps in
go are not thread safe.[2] So it could be the go version is out performing the
others do to the lack of synchronisation. There are several ways to fix this
in go, the easiest might just be to use a mutex around the accesses to the
cache. But it would probably be better to use a readers writers lock.

edit: (that said I am not sure if the C version is thread safe either I
haven't read the docs for the hash table he is using.)

edit 2: (looks like the C version is not thread safe either).

[1] [https://github.com/grahamking/Key-Value-
Polyglot/blob/master...](https://github.com/grahamking/Key-Value-
Polyglot/blob/master/memg.go)

[2] <http://golang.org/doc/go_faq.html#atomic_maps>

~~~
jbarham
You're correct that updating a map from multiple goroutines without
synchronizing is unsafe. But the overhead of using a RWMutex
(<http://weekly.golang.org/pkg/sync/#RWMutex>) in this case should be
negligible compared to the I/O code.

The big win is that Go allows you to write straightforward concurrent code but
under the hood uses high-performance system calls like epoll.

EDIT: Here's a thread-safe version: [https://github.com/jbarham/Key-Value-
Polyglot/blob/master/me...](https://github.com/jbarham/Key-Value-
Polyglot/blob/master/memg.go). Still plenty fast.

~~~
teejae
FYI, In the Go version, the lock can be added more simply, without the rw:

// Synchronize map access between multiple goroutines. type cache struct { m
map[string]string sync.RWMutex }

Then, you can just use:

cache.Lock/Unlock/RLock/RUnlock directly. It's clearer. The beauty of Go's
mixins.

------
irahul
I hand coded an epoll implementation for Python.

[https://github.com/rahulkmr/Key-Value-
Polyglot/blob/master/m...](https://github.com/rahulkmr/Key-Value-
Polyglot/blob/master/memg_epoll.py)

As far as raw benchmark goes, it runs faster than the go version on my
machine:

    
    
        # Go version. Changed test.py to 10000 gets and sets.
        ± $ time python test.py
        python test.py  0.48s user 0.60s system 47% cpu 2.289 total
    
    
        # Python epoll version. Changed test.py to 10000 gets and sets.
        ± $ time python test.py
        python test.py  0.20s user 0.26s system 50% cpu 0.903 total
    

But go version is easier to read and write, compared to Python which requires
the knowledge of epoll.

Standard disclaimer: Please note that this comparison is highly unscientific,
and take the numbers with a grain of salt.

~~~
jlouis
I wonder how a barebones Erlang-implementation will perform on this. It
doesn't look too hard to write :)

------
willbmoss
I added an implementation [1] in diesel [2][3], which uses select.epoll (or
libev, on non-Linux systems) and got a around 150x speedup [4]. I only
repeated the tests a few times (but they were all close) and didn't install
the Go compiler so I could test against Go (I'd be interested to see how this
stacks up on your machine). Like you say in your post, it's nice to have
something wrap up the bother of epoll for you.

[1] <https://github.com/wmoss/Key-Value-Polyglot>

[2] diesel.io

[3] <https://github.com/jamwt/diesel>

[4] The first run is against the diesel one

wmoss@wmoss-mba:~/etc/Key-Value-Polyglot$ time python test.py

real 0m0.134s

user 0m0.040s

sys 0m0.020s

wmoss@wmoss-mba:~/etc/Key-Value-Polyglot$ time python test.py

real 0m20.164s

user 0m0.096s

sys 0m0.072s

~~~
chmike
What is the difference between diesel and gevent ? Note that gevent 1.0 uses
libev.

~~~
willbmoss
In a nutshell, gevent monkey patches the socket library, whereas diesel
doesn't. This means that you can use any (previously) blocking libraries with
gevent, whereas, in diesel you have to write them again. The upside of the
rewriting is that it creates a more coherent (and opinionated) ecosystem.

------
malkia
The "C" code is very poor.

strtok is not reentrant safe. And why use it, when looking only for " ", use
strchr. strlen() is used over and over, instead of keeping lengths somewhere.
Also comparison to "set" / "get" could be than char by char, or by using the
perfect hash generator somewhat faster code (but even by hand it can be made
very fast). 'get ' and 'set ' can directly be checked using one uint32_t
rather than byte by byte comparison....

And let's not talk about the needless hidden calls to memory allocation,
instead of using slabs, or something more appropriate for the task. (strdup so
many places too).

But that's all heresy. I'm a video game programmer, give me such code and I'll
beat it up, except send/recv. So what? So fucking what?

------
caf
In C you are _not_ necessarily "on your own" - you can do something very
similar to Python, turning your socket into a file with fdopen(). You can then
use functions like fgets() and fread() on it, and stdio will take care of the
buffering for you.

------
spullara
Are we seriously discussing a benchmark that only runs 1000 operations? I
don't even understand how it could take 20s to complete in any language on the
server side and be correct code. Implement the Redis protocol and use the
included redis-benchmark to test your server. On a decent Mac you should be
able to hit 500k/s with pipelining and 25k/s without it.

~~~
axbx
I immediately disregard anyone that doesn't show their benchmarks running for
any amount of time that would reasonably be used in production.

I have absolutely no problem with letting my stuff fire requests off on 12+
core machines for hours or days on end. And then repeat it. And again.

Production means 24/7 and when I read your less than an hour benchmark on ANY
test - well, that's just a not right.

~~~
sausagefeet
Totally, you just don't see some things after a long time. I haven't read the
code but people mentioned the author is using malloc in a lot of places, you
wont' see ill effects from memory fragmentation with a quick benchmark, it
could take hours or days to see how large of holes you have in your memory
space.

~~~
ootachi
And in Go, they aren't running long enough to see the GC kick in. I suspect
the overhead of GC will be significant without tuning.

------
comex
Just running (a single instance of) test.py as a benchmark does not make
sense.

epoll is optimized for efficiently handling large numbers of sockets, but here
there is only one socket. There is no reason epoll should be faster at
blocking socket I/O than blocking socket I/O; if it is, I blame the kernel.

(Incidentally, here on OS X where there is no epoll, all the solutions
performed pretty terribly - a few seconds for 50000 iterations.)

~~~
caf
Yes, the explanation given does not seem to pass the sniff test. If all the
wall-time is being taken in recv(), there's no reason why using epoll() would
be any improvement.

~~~
codeape
The reason the python version is slow is (I believe) that the code is very
inefficient. It uses socket.sendall() instead of sockfile.write/flush. Using
sockfile.write/flush speeds it up from 50 requests/second to 7k
requests/second on my machine.

~~~
graham_king_3
I'm the author of the original post. The difference has nothing to do with
epoll, these comments are correct. Thanks particularly to codeape for his pull
request which made me realise this. Sorry everyone. I will fix the post.

Reducing the number of send calls, in both the C and Python versions, makes
them enormously faster. Go is already batching up the writes, hence the
apparent speed advantage.

If you strace the client, you see that the "get" case was replying with two
send calls, one for the "VALUE" line, another with the value and "END". All
the time is consumed with the client waiting to receive that second message.
Depending on the client, and I tried a bunch of ways, it's either in 'poll'
(pylibmc), 'futex' (Go), or 'recv' (basic python socket). That second receive
is about two orders of magnitude slower than the previous recv.

Why does reading that second line take so much longer?

There's more detail here: [https://github.com/grahamking/Key-Value-
Polyglot/pull/5#issu...](https://github.com/grahamking/Key-Value-
Polyglot/pull/5#issuecomment-4667123)

~~~
caf
It could be the Nagle algorithm - setting the TCP_NODELAY socket option on the
sender would be one way to test.

~~~
graham_king_3
Yes, thank you! Adding this line to the two-writes Python version (memg-
slow.py on github) makes it about as fast as the single write version (memg.py
on github):

sock.setsockopt(socket.IPPROTO_TCP, socket.TCP_NODELAY, 1)

It's an interaction between delayed ACKS and the Nagle algorithm, mentioned on
the Nagle algorithm wikipedia page.

I'm learning a lot this week. Thanks again.

------
alexchamberlain
This comparison is rather unfair on C, where you have chosen to use a low
level interface, against Go, where you have chosen to use a high level
interface. It is irrelevant that these are the _default_ interfaces - high
level interfaces for sockets exist in C. You could even integrate into Nginx.

~~~
pjmlp
This is typical in the Go community.

Usually the supposedly Go advantages are presented in a way, as if the same
are not present in other languages.

~~~
drivebyacct2
The author is up front about there existing more optimal and performant
designs for python and C but remarks that this is a fast, easy, naive
implementation. Who here doesn't appreciate those qualities or who here hasn't
worked on a team where there life would have been easier if that _one guy_ had
a little bit harder time screwing things up?

~~~
ootachi
Careful with that argument. The fast, easy, naïve implementation in Go is
quite dangerous (not even memory safe), because he forgot to lock the map. Go
made it easy to write an _incorrect_ implementation of a key-value store...

------
apaprocki
Not everyone has this problem, but Go only works on a portion of platform
configurations that are available in the real world.

C and Python, OTOH, are available pretty much anywhere. Redis builds on say,
Solaris, with no problem because the project is written in C and it is trivial
to add the needed calls. A KV store written in Go can't support Solaris
because Go itself would need to support Solaris first.

Years of tooling centered around C (e.g., autoconf/automake) is what makes
most C programs cross-platform out of the box with little or no OS-specific
code if you are sticking to POSIX. Until the same ecosystem develops around
any new language, authors realize that choice of language alone can
immediately limit their cross-platform capabilities.

~~~
drivebyacct2
Uh where is python available that Go is not and how portable is that C code:

~~~
apaprocki
I am using Python (as well as Perl, OCaml, etc) on Solaris/Sparc, AIX/POWER7,
HP-UX/IA64 on a daily basis. As for C code, If you just want to use redis as
an example, I built and installed it on Solaris/Sparc with no incident. This
is the case for most C POSIX compliant apps.

~~~
tptacek
For a long time, Redis didn't even have a configure script; you just typed
"make" and it worked. It's scrupulously portable code.

~~~
apaprocki
Aside: I noticed that redis/Solaris used select() instead of the fancy new
"event completion framework"[1] in Solaris 10. I figured maybe the new API
would be faster and I could contribute that back, so I ported to both that
framework, poll(), and /dev/poll (I feel the event API just wraps /dev/poll
due to the perf #s I saw, but it isn't clear) and it turned out that select()
is actually faster than any of them, which struck me as a bit odd.

[1]:
[http://developers.sun.com/solaris/articles/event_completion....](http://developers.sun.com/solaris/articles/event_completion.html)

------
codeape
I changed memg.py to use sockfile.write()/.flush() instead of
socket.sendall().

This makes the memg.py server > x100 faster. It outperforms a gevent-based
implementation by 10%.

See [https://github.com/codeape2/Key-Value-
Polyglot/commit/cbc53a...](https://github.com/codeape2/Key-Value-
Polyglot/commit/cbc53af2ed32c3d54f265cf056d5680ce7ea3957)

EDIT: It does not outperform the gevent-based implementation. More performance
testing indicates that gevent is around 2x faster. But it outperforms the
original version by an order of magnitude.

~~~
irahul
sock.sendall() isn't the same as `sock.send(); sock.flush()`

sendall will look something like:

    
    
        def sendall(sock, msg):
            totalsent = 0
            MSGLEN = len(msg)
            while totalsent < MSGLEN:
                sent = sock.send(msg[totalsent:])
                if sent == 0:
                    raise RuntimeError("socket connection broken")
                totalsent = totalsent + sent

~~~
codeape
I don't use socket.send/flush. I use sockfile.write/flush.

------
lucian1900
Comparing a non-blocking API to two blocking APIs. A more fair comparison
would be using Twisted for Python and perhaps libev(ent) for C.

But it's a overly small test anyway.

------
samwilliams
Interesting!

I was actually thinking about writing something very similar as an erlang C
node just a couple of days a go. I noted that the overhead for storing a
mnesia table of 5 million rows of 3 integers was huge - it would take up 1.6gb
in memory! If you know the size of the struct, it should pretty easy to make a
fast lookup system (assuming the keys are sequential) too.

I wonder if I could wrap this instead...

~~~
jlouis
Use an ETS table and mark it as compressed :)

------
nikcub
I just happen to start writing a redis clone in python a few weeks ago. I used
the ioloop from Tornado to take advantage of epoll (you can also use gevent).
Have yet to benchmark it, but I suspect it will bring you closer to the
results you see in Go.

------
SamReidHughes
It's pretty silly to stress-test a C or Go-written program using a client
written in Python. It's very easy for the stress client to be the bottleneck
in benchmarking memcached implementations. Even when the client is written in
C.

------
halayli
Consider using lthread for your c version. You'll save time and gain
performance. (<http://github.com/halayli/lthread>)

------
sdfjkl
Uhm, where are the numbers?

