Hacker Newsnew | comments | show | ask | jobs | submit login
In-memory key-value store in C, Go and Python (darkcoding.net)
186 points by nkurz 1163 days ago | 56 comments



In order to perform such a comparison you can't write three small trow-away programs, you need to optimize each version at your best for weeks to start to be meaningful, and you need an expert in the three systems.

Otherwise the test is still interesting but is: "what is the best language to write a memcached clone without being an expert in a given language, using a few hours", that still says something about how different the three languages are, but does not say much about what is the best language to implement the system.

Btw in a more serious test another parameter that you did not considered much is very important, that is, memory usage per-key in the three versions, and in general, memory behavior.

-----


Yeah, a brief look at his buffering code raised alarms all over the place for me. The use of dynamic memory allocation; the use of strlen to find out if the current buffer is empty - seriously? memset()'ing the entire buffer instead of bothering to zero out just the by after the end of the recv(). That's ignoring various bugs and other stuff.

I haven't looked at the rest of the code, but if it's anything like the buffering code this is nothing like what you'd expect to see in a decent C implementation.

It's fairy enough to demonstrate that the Go version makes it easy to write decent performing code, but the C version is atrocious (though to his credit it does at least buffer - I've seen so much C networking code that murders performance by doing small read()'s that I want to cry, including for the longest time the MySQL client library).

Of course part of his criticism is also down to not bothering to look for the plethora of C networking libraries that does this and does it well.

-----


As the author points out in the article, the neat thing is that Go does this for you. That was the major takeaway.

-----


This is the default programming model in Go, but it's the same for Node.js, and you can use an easy-to-use event library in Ruby and Python as well.

A key-value store is system programming. It should be carefully designed, it's not real-world that you accept the default I/O model of the language you are going to use.

p.s. in the case of C it's hard to argue what is the default I/O model. It supports threads, fork, multiplexing, blocking and non blocking I/O, in the same way basically (in a low level way).

-----


In the real world, defaults and simple-to-use libraries (i.e. not twisted) matter. Just because you are doing system programming doesn't mean you necessarily have the time/skill to carefully optimize everything.

System's programming isn't just writing a KV store used by millions. Sometimes it's writing a dedicated calculation server used in a single company. Carefully optimizing your I/O model might take longer than doing the project itself.

-----


There is no default in C, beyond stdio. The C specification makes no reference to any I/O beyond stdio.

The author was being unfair to C. Use epoll or, better yet, libuv. libuv is a simple-to-use library.

-----


Looking at the go version[1] I am not sure if will work as expected. Maps in go are not thread safe.[2] So it could be the go version is out performing the others do to the lack of synchronisation. There are several ways to fix this in go, the easiest might just be to use a mutex around the accesses to the cache. But it would probably be better to use a readers writers lock.

edit: (that said I am not sure if the C version is thread safe either I haven't read the docs for the hash table he is using.)

edit 2: (looks like the C version is not thread safe either).

[1] https://github.com/grahamking/Key-Value-Polyglot/blob/master...

[2] http://golang.org/doc/go_faq.html#atomic_maps

-----


You're correct that updating a map from multiple goroutines without synchronizing is unsafe. But the overhead of using a RWMutex (http://weekly.golang.org/pkg/sync/#RWMutex) in this case should be negligible compared to the I/O code.

The big win is that Go allows you to write straightforward concurrent code but under the hood uses high-performance system calls like epoll.

EDIT: Here's a thread-safe version: https://github.com/jbarham/Key-Value-Polyglot/blob/master/me.... Still plenty fast.

-----


FYI, In the Go version, the lock can be added more simply, without the rw:

// Synchronize map access between multiple goroutines. type cache struct { m map[string]string sync.RWMutex }

Then, you can just use:

cache.Lock/Unlock/RLock/RUnlock directly. It's clearer. The beauty of Go's mixins.

-----


So much for CSP, if you're using threads and shared state.

-----


I hand coded an epoll implementation for Python.

https://github.com/rahulkmr/Key-Value-Polyglot/blob/master/m...

As far as raw benchmark goes, it runs faster than the go version on my machine:

    # Go version. Changed test.py to 10000 gets and sets.
    ± $ time python test.py
    python test.py  0.48s user 0.60s system 47% cpu 2.289 total


    # Python epoll version. Changed test.py to 10000 gets and sets.
    ± $ time python test.py
    python test.py  0.20s user 0.26s system 50% cpu 0.903 total
But go version is easier to read and write, compared to Python which requires the knowledge of epoll.

Standard disclaimer: Please note that this comparison is highly unscientific, and take the numbers with a grain of salt.

-----


I wonder how a barebones Erlang-implementation will perform on this. It doesn't look too hard to write :)

-----


I added an implementation [1] in diesel [2][3], which uses select.epoll (or libev, on non-Linux systems) and got a around 150x speedup [4]. I only repeated the tests a few times (but they were all close) and didn't install the Go compiler so I could test against Go (I'd be interested to see how this stacks up on your machine). Like you say in your post, it's nice to have something wrap up the bother of epoll for you.

[1] https://github.com/wmoss/Key-Value-Polyglot

[2] diesel.io

[3] https://github.com/jamwt/diesel

[4] The first run is against the diesel one

wmoss@wmoss-mba:~/etc/Key-Value-Polyglot$ time python test.py

real 0m0.134s

user 0m0.040s

sys 0m0.020s

wmoss@wmoss-mba:~/etc/Key-Value-Polyglot$ time python test.py

real 0m20.164s

user 0m0.096s

sys 0m0.072s

-----


I lack knowledge on networking/event-based systems on a fundamental level.

Here's what I don't understand:

* test.py is sequential: It first does 500 sets then 500 gets, all in one thread, using a single connection to the server.

* The socket handling function (memg.py:handle_con/memg-diesel.py:handle_con) is called once. There is no parallell execution going on.

* So why is the memg-diesel.py code so much faster? What makes the code for sending and receiving data to/from the socket so much faster?

Could someone please explain to me why an epoll-based solution is so much faster?

-----


What is the difference between diesel and gevent ? Note that gevent 1.0 uses libev.

-----


In a nutshell, gevent monkey patches the socket library, whereas diesel doesn't. This means that you can use any (previously) blocking libraries with gevent, whereas, in diesel you have to write them again. The upside of the rewriting is that it creates a more coherent (and opinionated) ecosystem.

-----


http://diesel.io/faq

-----


also, the diesel one is half the LOC

/shameless plug :-)

(for the record, on my machine the go comparison was 97ms vs. 173ms, so pure python + diesel was 1.78x slower)

-----


The "C" code is very poor.

strtok is not reentrant safe. And why use it, when looking only for " ", use strchr. strlen() is used over and over, instead of keeping lengths somewhere. Also comparison to "set" / "get" could be than char by char, or by using the perfect hash generator somewhat faster code (but even by hand it can be made very fast). 'get ' and 'set ' can directly be checked using one uint32_t rather than byte by byte comparison....

And let's not talk about the needless hidden calls to memory allocation, instead of using slabs, or something more appropriate for the task. (strdup so many places too).

But that's all heresy. I'm a video game programmer, give me such code and I'll beat it up, except send/recv. So what? So fucking what?

-----


In C you are not necessarily "on your own" - you can do something very similar to Python, turning your socket into a file with fdopen(). You can then use functions like fgets() and fread() on it, and stdio will take care of the buffering for you.

-----


Are we seriously discussing a benchmark that only runs 1000 operations? I don't even understand how it could take 20s to complete in any language on the server side and be correct code. Implement the Redis protocol and use the included redis-benchmark to test your server. On a decent Mac you should be able to hit 500k/s with pipelining and 25k/s without it.

-----


I immediately disregard anyone that doesn't show their benchmarks running for any amount of time that would reasonably be used in production.

I have absolutely no problem with letting my stuff fire requests off on 12+ core machines for hours or days on end. And then repeat it. And again.

Production means 24/7 and when I read your less than an hour benchmark on ANY test - well, that's just a not right.

-----


Totally, you just don't see some things after a long time. I haven't read the code but people mentioned the author is using malloc in a lot of places, you wont' see ill effects from memory fragmentation with a quick benchmark, it could take hours or days to see how large of holes you have in your memory space.

-----


And in Go, they aren't running long enough to see the GC kick in. I suspect the overhead of GC will be significant without tuning.

-----


The slowness actually appears to be down to his client implementation - see the comment by Clément Stenac on the original post.

-----


Just running (a single instance of) test.py as a benchmark does not make sense.

epoll is optimized for efficiently handling large numbers of sockets, but here there is only one socket. There is no reason epoll should be faster at blocking socket I/O than blocking socket I/O; if it is, I blame the kernel.

(Incidentally, here on OS X where there is no epoll, all the solutions performed pretty terribly - a few seconds for 50000 iterations.)

-----


Yes, the explanation given does not seem to pass the sniff test. If all the wall-time is being taken in recv(), there's no reason why using epoll() would be any improvement.

-----


The reason the python version is slow is (I believe) that the code is very inefficient. It uses socket.sendall() instead of sockfile.write/flush. Using sockfile.write/flush speeds it up from 50 requests/second to 7k requests/second on my machine.

-----


I'm the author of the original post. The difference has nothing to do with epoll, these comments are correct. Thanks particularly to codeape for his pull request which made me realise this. Sorry everyone. I will fix the post.

Reducing the number of send calls, in both the C and Python versions, makes them enormously faster. Go is already batching up the writes, hence the apparent speed advantage.

If you strace the client, you see that the "get" case was replying with two send calls, one for the "VALUE" line, another with the value and "END". All the time is consumed with the client waiting to receive that second message. Depending on the client, and I tried a bunch of ways, it's either in 'poll' (pylibmc), 'futex' (Go), or 'recv' (basic python socket). That second receive is about two orders of magnitude slower than the previous recv.

Why does reading that second line take so much longer?

There's more detail here: https://github.com/grahamking/Key-Value-Polyglot/pull/5#issu...

-----


It could be the Nagle algorithm - setting the TCP_NODELAY socket option on the sender would be one way to test.

-----


Yes, thank you! Adding this line to the two-writes Python version (memg-slow.py on github) makes it about as fast as the single write version (memg.py on github):

sock.setsockopt(socket.IPPROTO_TCP, socket.TCP_NODELAY, 1)

It's an interaction between delayed ACKS and the Nagle algorithm, mentioned on the Nagle algorithm wikipedia page.

I'm learning a lot this week. Thanks again.

-----


This comparison is rather unfair on C, where you have chosen to use a low level interface, against Go, where you have chosen to use a high level interface. It is irrelevant that these are the default interfaces - high level interfaces for sockets exist in C. You could even integrate into Nginx.

-----


This is typical in the Go community.

Usually the supposedly Go advantages are presented in a way, as if the same are not present in other languages.

-----


This is actually typical in most programming language communities that I've observed.

-----


The author is up front about there existing more optimal and performant designs for python and C but remarks that this is a fast, easy, naive implementation. Who here doesn't appreciate those qualities or who here hasn't worked on a team where there life would have been easier if that one guy had a little bit harder time screwing things up?

-----


Careful with that argument. The fast, easy, naïve implementation in Go is quite dangerous (not even memory safe), because he forgot to lock the map. Go made it easy to write an incorrect implementation of a key-value store...

-----


Not everyone has this problem, but Go only works on a portion of platform configurations that are available in the real world.

C and Python, OTOH, are available pretty much anywhere. Redis builds on say, Solaris, with no problem because the project is written in C and it is trivial to add the needed calls. A KV store written in Go can't support Solaris because Go itself would need to support Solaris first.

Years of tooling centered around C (e.g., autoconf/automake) is what makes most C programs cross-platform out of the box with little or no OS-specific code if you are sticking to POSIX. Until the same ecosystem develops around any new language, authors realize that choice of language alone can immediately limit their cross-platform capabilities.

-----


gccgo does support Solaris, and should support any operating system that uses ELF binaries.

The gc Go compiler currently supports FreeBSD, Linux, Mac OS X, and Windows. That's at least 95% of the servers out there (probably more). There is code to support NetBSD, OpenBSD, and Plan 9, but we have held off polishing it for Go 1.

-----


Is it a huge amount of work to support non-ELF formats? I haven't looked at the code, but I guess it is not using bfd? AIX uses XCOFF, not ELF. I'd be interested in finding out what it would take to add support.

-----


Gold is an ELF linker. I don't think it was designed to support other object formats, so it's non-trivial at the least.

-----


gccgo works without gold, albeit not as well, some people have tried it even on Irix. iant told me even Windows might work as is, though nobody tried. In any case, it should be very easy to make Windows work as well.

-----


Uh where is python available that Go is not and how portable is that C code:

-----


I am using Python (as well as Perl, OCaml, etc) on Solaris/Sparc, AIX/POWER7, HP-UX/IA64 on a daily basis. As for C code, If you just want to use redis as an example, I built and installed it on Solaris/Sparc with no incident. This is the case for most C POSIX compliant apps.

-----


For a long time, Redis didn't even have a configure script; you just typed "make" and it worked. It's scrupulously portable code.

-----


Aside: I noticed that redis/Solaris used select() instead of the fancy new "event completion framework"[1] in Solaris 10. I figured maybe the new API would be faster and I could contribute that back, so I ported to both that framework, poll(), and /dev/poll (I feel the event API just wraps /dev/poll due to the perf #s I saw, but it isn't clear) and it turned out that select() is actually faster than any of them, which struck me as a bit odd.

[1]: http://developers.sun.com/solaris/articles/event_completion....

-----


He just said where.

-----


I changed memg.py to use sockfile.write()/.flush() instead of socket.sendall().

This makes the memg.py server > x100 faster. It outperforms a gevent-based implementation by 10%.

See https://github.com/codeape2/Key-Value-Polyglot/commit/cbc53a...

EDIT: It does not outperform the gevent-based implementation. More performance testing indicates that gevent is around 2x faster. But it outperforms the original version by an order of magnitude.

-----


sock.sendall() isn't the same as `sock.send(); sock.flush()`

sendall will look something like:

    def sendall(sock, msg):
        totalsent = 0
        MSGLEN = len(msg)
        while totalsent < MSGLEN:
            sent = sock.send(msg[totalsent:])
            if sent == 0:
                raise RuntimeError("socket connection broken")
            totalsent = totalsent + sent

-----


I don't use socket.send/flush. I use sockfile.write/flush.

-----


Comparing a non-blocking API to two blocking APIs. A more fair comparison would be using Twisted for Python and perhaps libev(ent) for C.

But it's a overly small test anyway.

-----


Interesting!

I was actually thinking about writing something very similar as an erlang C node just a couple of days a go. I noted that the overhead for storing a mnesia table of 5 million rows of 3 integers was huge - it would take up 1.6gb in memory! If you know the size of the struct, it should pretty easy to make a fast lookup system (assuming the keys are sequential) too.

I wonder if I could wrap this instead...

-----


Use an ETS table and mark it as compressed :)

-----


I just happen to start writing a redis clone in python a few weeks ago. I used the ioloop from Tornado to take advantage of epoll (you can also use gevent). Have yet to benchmark it, but I suspect it will bring you closer to the results you see in Go.

-----


It's pretty silly to stress-test a C or Go-written program using a client written in Python. It's very easy for the stress client to be the bottleneck in benchmarking memcached implementations. Even when the client is written in C.

-----


Consider using lthread for your c version. You'll save time and gain performance. (http://github.com/halayli/lthread)

-----


Uhm, where are the numbers?

-----




Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | DMCA | Apply to YC | Contact

Search: