
Simple Binary Encoding - dekayed
http://mechanical-sympathy.blogspot.com/2014/05/simple-binary-encoding.html
======
jamesaguilar
Curious to see how this stacks up with other generic binary formats. Off the
top of my head, protocol buffer can be made to have this kind of wire format
(root fields in front, variable fields in the back), but I don't know if
partial decoding or streaming decoding is possible, and it definitely requires
some user effort to make it this way.

~~~
rch
> ~25X greater throughput than Google Protocol Buffers (GPB) with very low and
> predictable latency

Seems like it might take more than structure tweaking to overcome the quoted
delta. I'd still like to see a general comparison too though, especially to
Cap'n Proto on the one side and HDF5 on the other.

~~~
jdf
Good point, although it's not like the stock protobuf implementation (rather
than protocol) is the best that can be had. For example, this implementation
of protobuf deserialization is quite a bit faster:

[https://github.com/haberman/upb](https://github.com/haberman/upb)

[http://blog.reverberate.org/2011/04/upb-status-and-
prelimina...](http://blog.reverberate.org/2011/04/upb-status-and-preliminary-
performance.html)

~~~
haberman
Hi there, upb author here, thanks for the shout-out!

The stock protobuf implementation is pretty close to optimal given what it
does. Where upb is faster, it is faster by doing less.

For example, I can beat protobuf if you don't write the parsed data into a
tree structure (or only write a few fields). I can beat it if your input has
unknown fields and you don't care about preserving them. But if what you need
is a tree structure that contains 100% of the input data, protobuf is hard to
beat speed-wise (though there are still a few tricks, like arena allocation,
that can beat it given some usage patterns).

------
haberman
This sounds very similar in concept to Cap'n Proto, and given how much hard-
earned experience Kenton has and how much time he's put into it, I'd put my
bets on Cap'n Proto if you're looking for something in this design space.

I haven't looked at SBE in detail, but given the high-level description I
expect it would have a similar set of trade-offs vs. Protocol Buffers as Cap'n
Proto does: it will have unparalleled serialize/deserialize performance (since
they are basically no-ops), but sparse messages (where only a few of many
defined fields are set) will be arbitrarily larger than protobufs, so a
separate compression step will be necessary to be size-competitive, at which
point some of the speed advantage is lost.

~~~
kasey_junk
Kenton Varda deserves the respect you are giving him when it comes to
serialization libraries. Cap'n Proto is a great project that I'd love to work
with anytime. That said, Martin Thompson whose blog this is a link to is also
very well respected in his particular domain (low latency messaging systems
especially on the JVM and JVM/C++ hybrid space).

All that's to say, the Cap'n Proto and SBE libraries have exceedingly similar
design philosophies and pedigrees. I expect them to have near performance
parity in the shared language of C++. But if you are looking for a JVM/C#
solution I think you can pretty reasonably choose to start working with SBE
today as it provides those language bindings out of the gate and there is
unlikely to be such a performance mismatch between SBE and Cap'n Proto in the
future for it to matter long term.

~~~
haberman
I can totally believe that Martin Thompson is highly capable given what I read
in the article. I guess my bias is against a proliferation of similar-but-
incompatible messaging formats. But I suppose that if Cap'n Proto doesn't
support JVM/C#, then that doesn't leave someone on those platforms a lot of
options.

~~~
kentonv
There's been a lot of false starts on getting Java support implemented (people
saying they'd do it, and then not), and one or two for C# as well. I wish I
had time to just sit down and crank out a Java implementation myself -- I
think it would take a week or two -- but right now I really need to be putting
everything into Sandstorm.io. :/

It will happen eventually, though.

~~~
kasey_junk
I think one problem with this is that a JNI solution would be "trivial" but is
wholly unsatisfying (JNI being somewhat incompatible with the performance
goals). The Unsafe and/or ByteBuffer solution on the other hand would be a lot
of work to create and maintain (so much so that I wouldn't sign up for the
job).

SBE on the other hand is targeting a space dominated by C++/Java/C#. They can
basically ignore every other language binding without losing any of their
target audience. It also doesn't hurt that they have a patron with a financial
motivation to back them...

------
ubercode5
I would be interested to see a binary size comparison along with the speed
comparison. Is there a trade-off of speed for byte size?

~~~
kentonv
Compared to Protobuf, there is definitely going to be a trade-off. Protobuf
uses variable-width integer encoding whereas SBE uses only fixed-width
integers which means there will be a lot of zero-valued high-order bytes and
padding taking up space. But variable-width encoding is costly because it
involves a lot of unpredictable branching -- this is exactly where SBE's speed
advantage comes from. The exact amount of overhead will vary depending on
usage; it could be about the same size, or it could be many times larger.
(This is a general problem with benchmarks of serialization systems: the
difference is highly dependent on use case, so you should never assume your
own application will see the same kind of performance as any particular
benchmark.)

Cap'n Proto also uses fixed-width encoding, but offers an optional "packing"
step that is basically a really fast compression pass that aims only to remove
zeros. I've found that this usually achieves similar sizes to Protobufs while
still being faster, because the packing algorithm can be implemented it a
tight loop whereas a Protobuf encoder usually involves sprawling generated
code that does all kinds of things. SBE could pretty easily adopt Cap'n
Proto's packing algorithm.

SBE is probably slightly leaner than Cap'n Proto because they don't use
pointers. Instead, SBE outputs the message tree in preorder. Only fixed-width
elements are found at fixed offsets; the rest have to be found by traversing
the tree. The down side of this is that you want to find one element in a
list, you must first traverse all of the elements before it (and all of their
transitive children). Cap'n Proto's approach using pointers allows you to
process input in the order that is most convenient for you (rather than in the
exact order in which it was written) and makes it easy to seek directly to one
particular element (which is awesome when you're mmap()ing a huge database),
but does have a slight cost in terms of message size.

~~~
kasey_junk
I'd add that SBE is targeting the FIX community, where there are many, but
bounded message types and the incidence of lists (or lists of lists) is rare
and already segmented to the sections of systems that people associate with
lower performance requirements.

Cap'n Proto does not seem to have this focus (forgive me for assumptions about
intentions) and therefore is more designed around supporting the general data
structure case.

I'd say that in the cases where the performance differences between these 2
systems really matter you would be unable to make a fair judgement without
implementing your real world problem in both and testing the differences. That
said, I had an argument today with a colleague about stop bit encoding vs
fixed width encoding and without testing real world solutions I was hesitant
to make any claims, even though my experience and bias would be toward fixed
width encoding.

As for Protobuf, it has been a while since I've worked with them, but when I
did I had 2 specific performance bottle necks other than the variability (on
the JVM) 1) was encoding non-primitive fields and 2) was not being able to do
allocation free programming with them.

