In order to perform such a comparison you can't write three small trow-away programs, you need to optimize each version at your best for weeks to start to be meaningful, and you need an expert in the three systems.
Otherwise the test is still interesting but is: "what is the best language to write a memcached clone without being an expert in a given language, using a few hours", that still says something about how different the three languages are, but does not say much about what is the best language to implement the system.
Btw in a more serious test another parameter that you did not considered much is very important, that is, memory usage per-key in the three versions, and in general, memory behavior.
Yeah, a brief look at his buffering code raised alarms all over the place for me. The use of dynamic memory allocation; the use of strlen to find out if the current buffer is empty - seriously? memset()'ing the entire buffer instead of bothering to zero out just the by after the end of the recv(). That's ignoring various bugs and other stuff.
I haven't looked at the rest of the code, but if it's anything like the buffering code this is nothing like what you'd expect to see in a decent C implementation.
It's fairy enough to demonstrate that the Go version makes it easy to write decent performing code, but the C version is atrocious (though to his credit it does at least buffer - I've seen so much C networking code that murders performance by doing small read()'s that I want to cry, including for the longest time the MySQL client library).
Of course part of his criticism is also down to not bothering to look for the plethora of C networking libraries that does this and does it well.
In the real world, defaults and simple-to-use libraries (i.e. not twisted) matter. Just because you are doing system programming doesn't mean you necessarily have the time/skill to carefully optimize everything.
System's programming isn't just writing a KV store used by millions. Sometimes it's writing a dedicated calculation server used in a single company. Carefully optimizing your I/O model might take longer than doing the project itself.
Looking at the go version I am not sure if will work as expected. Maps in go are not thread safe. So it could be the go version is out performing the others do to the lack of synchronisation. There are several ways to fix this in go, the easiest might just be to use a mutex around the accesses to the cache. But it would probably be better to use a readers writers lock.
edit: (that said I am not sure if the C version is thread safe either I haven't read the docs for the hash table he is using.)
edit 2: (looks like the C version is not thread safe either).
You're correct that updating a map from multiple goroutines without synchronizing is unsafe. But the overhead of using a RWMutex (http://weekly.golang.org/pkg/sync/#RWMutex) in this case should be negligible compared to the I/O code.
The big win is that Go allows you to write straightforward concurrent code but under the hood uses high-performance system calls like epoll.
As far as raw benchmark goes, it runs faster than the go version on my machine:
# Go version. Changed test.py to 10000 gets and sets.
± $ time python test.py
python test.py 0.48s user 0.60s system 47% cpu 2.289 total
# Python epoll version. Changed test.py to 10000 gets and sets.
± $ time python test.py
python test.py 0.20s user 0.26s system 50% cpu 0.903 total
But go version is easier to read and write, compared to Python which requires the knowledge of epoll.
Standard disclaimer: Please note that this comparison is highly unscientific, and take the numbers with a grain of salt.
I added an implementation  in diesel , which uses select.epoll (or libev, on non-Linux systems) and got a around 150x speedup . I only repeated the tests a few times (but they were all close) and didn't install the Go compiler so I could test against Go (I'd be interested to see how this stacks up on your machine). Like you say in your post, it's nice to have something wrap up the bother of epoll for you.
In a nutshell, gevent monkey patches the socket library, whereas diesel doesn't. This means that you can use any (previously) blocking libraries with gevent, whereas, in diesel you have to write them again. The upside of the rewriting is that it creates a more coherent (and opinionated) ecosystem.
strtok is not reentrant safe. And why use it, when looking only for " ", use strchr.
strlen() is used over and over, instead of keeping lengths somewhere.
Also comparison to "set" / "get" could be than char by char, or by using the perfect hash generator somewhat faster code (but even by hand it can be made very fast).
'get ' and 'set ' can directly be checked using one uint32_t rather than byte by byte comparison....
And let's not talk about the needless hidden calls to memory allocation, instead of using slabs, or something more appropriate for the task. (strdup so many places too).
But that's all heresy. I'm a video game programmer, give me such code and I'll beat it up, except send/recv. So what? So fucking what?
In C you are not necessarily "on your own" - you can do something very similar to Python, turning your socket into a file with fdopen(). You can then use functions like fgets() and fread() on it, and stdio will take care of the buffering for you.
Are we seriously discussing a benchmark that only runs 1000 operations? I don't even understand how it could take 20s to complete in any language on the server side and be correct code. Implement the Redis protocol and use the included redis-benchmark to test your server. On a decent Mac you should be able to hit 500k/s with pipelining and 25k/s without it.
Totally, you just don't see some things after a long time. I haven't read the code but people mentioned the author is using malloc in a lot of places, you wont' see ill effects from memory fragmentation with a quick benchmark, it could take hours or days to see how large of holes you have in your memory space.
Just running (a single instance of) test.py as a benchmark does not make sense.
epoll is optimized for efficiently handling large numbers of sockets, but here there is only one socket. There is no reason epoll should be faster at blocking socket I/O than blocking socket I/O; if it is, I blame the kernel.
(Incidentally, here on OS X where there is no epoll, all the solutions performed pretty terribly - a few seconds for 50000 iterations.)
The reason the python version is slow is (I believe) that the code is very inefficient. It uses socket.sendall() instead of sockfile.write/flush. Using sockfile.write/flush speeds it up from 50 requests/second to 7k requests/second on my machine.
I'm the author of the original post. The difference has nothing to do with epoll, these comments are correct. Thanks particularly to codeape for his pull request which made me realise this. Sorry everyone. I will fix the post.
Reducing the number of send calls, in both the C and Python versions, makes them enormously faster. Go is already batching up the writes, hence the apparent speed advantage.
If you strace the client, you see that the "get" case was replying with two send calls, one for the "VALUE" line, another with the value and "END". All the time is consumed with the client waiting to receive that second message. Depending on the client, and I tried a bunch of ways, it's either in 'poll' (pylibmc), 'futex' (Go), or 'recv' (basic python socket). That second receive is about two orders of magnitude slower than the previous recv.
Why does reading that second line take so much longer?
This comparison is rather unfair on C, where you have chosen to use a low level interface, against Go, where you have chosen to use a high level interface. It is irrelevant that these are the default interfaces - high level interfaces for sockets exist in C. You could even integrate into Nginx.
The author is up front about there existing more optimal and performant designs for python and C but remarks that this is a fast, easy, naive implementation. Who here doesn't appreciate those qualities or who here hasn't worked on a team where there life would have been easier if that one guy had a little bit harder time screwing things up?
Careful with that argument. The fast, easy, naïve implementation in Go is quite dangerous (not even memory safe), because he forgot to lock the map. Go made it easy to write an incorrect implementation of a key-value store...
Not everyone has this problem, but Go only works on a portion of platform configurations that are available in the real world.
C and Python, OTOH, are available pretty much anywhere. Redis builds on say, Solaris, with no problem because the project is written in C and it is trivial to add the needed calls. A KV store written in Go can't support Solaris because Go itself would need to support Solaris first.
Years of tooling centered around C (e.g., autoconf/automake) is what makes most C programs cross-platform out of the box with little or no OS-specific code if you are sticking to POSIX. Until the same ecosystem develops around any new language, authors realize that choice of language alone can immediately limit their cross-platform capabilities.
gccgo does support Solaris, and should support any operating system that uses ELF binaries.
The gc Go compiler currently supports FreeBSD, Linux, Mac OS X, and Windows. That's at least 95% of the servers out there (probably more). There is code to support NetBSD, OpenBSD, and Plan 9, but we have held off polishing it for Go 1.
Is it a huge amount of work to support non-ELF formats? I haven't looked at the code, but I guess it is not using bfd? AIX uses XCOFF, not ELF. I'd be interested in finding out what it would take to add support.
gccgo works without gold, albeit not as well, some people have tried it even on Irix. iant told me even Windows might work as is, though nobody tried. In any case, it should be very easy to make Windows work as well.
I am using Python (as well as Perl, OCaml, etc) on Solaris/Sparc, AIX/POWER7, HP-UX/IA64 on a daily basis. As for C code, If you just want to use redis as an example, I built and installed it on Solaris/Sparc with no incident. This is the case for most C POSIX compliant apps.
Aside: I noticed that redis/Solaris used select() instead of the fancy new "event completion framework" in Solaris 10. I figured maybe the new API would be faster and I could contribute that back, so I ported to both that framework, poll(), and /dev/poll (I feel the event API just wraps /dev/poll due to the perf #s I saw, but it isn't clear) and it turned out that select() is actually faster than any of them, which struck me as a bit odd.
I was actually thinking about writing something very similar as an erlang C node just a couple of days a go. I noted that the overhead for storing a mnesia table of 5 million rows of 3 integers was huge - it would take up 1.6gb in memory! If you know the size of the struct, it should pretty easy to make a fast lookup system (assuming the keys are sequential) too.
I just happen to start writing a redis clone in python a few weeks ago. I used the ioloop from Tornado to take advantage of epoll (you can also use gevent). Have yet to benchmark it, but I suspect it will bring you closer to the results you see in Go.
It's pretty silly to stress-test a C or Go-written program using a client written in Python. It's very easy for the stress client to be the bottleneck in benchmarking memcached implementations. Even when the client is written in C.