As I said earlier on the mailing list, I suspect (without looking into it) that the cause of the I/O performance slowdown relative to C++ is something related to buffering—perhaps the I/O is not being buffered, or the buffer isn't functioning right. This would be consistent with the serialization-based I/O leading to larger slowdowns, because there would be more calls to write(2) then. If so, then this should be fixable.
I'm pleasantly surprised to see the object mode slightly faster than C++.
I should note that the Rust version currently omits some features of the C++ version, such as read-limiting and the actual counting of the throughput. These things are cheap, but they may explain why Rust is faster in that one case.
We did some benchmarking on IRC today in light of this post and we found that Rust's stdio is currently quite slow due to a flag not being set properly in libuv which causes it to punt to a thread pool. There is currently a fix in the queue: https://github.com/mozilla/rust/pull/10558
In case you get some time would appreciate a blog post or a comment describing what the issue was on the rust side as well as on the libuv side. Upvoted in advance.
I have been thought all the serialization formats such as Protobuf, Thrift, BSON, or MessagePack are using compression as much as possible because saving amount of I/O is ultimate win for overall performance rather than fast calculation by memory alignment.
When I see this Capnproto, I am confusing that which one is right approach. Alignment is not exotic technique, and why didn't they align the data if there's no reason to save I/O?
Or am I totally misunderstanding these implementations?
Well, it depends on the environment. If you are doing interprocess communication, then I/O bandwidth is obviously not a concern at all. On the other hand, over the internet, it clearly is the biggest concern. For intra-datacenter traffic on a 10Gbit NIC, it's harder to say, but it _probably_ isn't the bottleneck. Cap'n Proto supports both cases by making additional packing optional, so you can choose the best trade-off for your application.
Regarding the other formats you mention, I think you may be imagining that the designers of these protocols thought more carefully about them than they really did. Protobuf, for example, was designed pretty ad-hoc to solve an immediate problem in Google's search infrastructure, and then stuck mostly because as more and more things used it, it was easier to keep using it than start over. The designers readily acknowledge that it is not an ideal format -- in fact, there are other ways they could have done the encoding which would have taken no more space but would have saved significant CPU time.
(Disclosure: I was the maintainer of protobufs for a long time, though not the original creator. I am also the author of Cap'n Proto.)
I compiled libcapnp with the latest Clang g++ that ships with XCode. It perhaps would be more fair to also compile the C++ benchmarks with Clang, instead of the Macports gcc4.8 that I'm using, but unfortunately Clang barfs on some template hackery in the benchmark driver.
That would be quite useful. Otherwise it's very hard to tease apart the differences between optimization back ends of GCC and LLVM vs the Rust and C++ front ends. I feel like I have seen benchmarks showing differences around 5-20% between LLVM and GCC in speed, so the it will have an effect.
I'm pleasantly surprised to see the object mode slightly faster than C++.