
Protocol Buffers, Avro, Thrift & MessagePack - igrigorik
http://www.igvita.com/2011/08/01/protocol-buffers-avro-thrift-messagepack/
======
hesdeadjim
I work at an indie game shop and I have pushed us to use it (almost)
everywhere we need to define a file format. Originally we were using JSON for
everything, which is great for the quick and dirty approach -- but as our code
base (primarily C++) has grown I absolutely love the guarantees I get with
protobufs:

\- Strongly typed, no boilerplate error checking if someone set my "foo" field
on my object to an integer instead of a string

\- Easy to version and upgrade, just create new fields, deprecate the old
ones, and move on with life.

\- Protobuf IDLs are the documentation and implementation of my file format --
no docs to write about what fields belong in what object and no issues with
out of sync documentation/code.

\- Reflection support, don't use this a lot, but when I need it it's awesome.

\- Variety of storage options. For instance the level editor I wrote recently
uses the human-readable text format when it saves out levels. But when I am
ready to ship, I can trivially convert these level files to binary and
immediately improve the performance of my app.

\- Tons of language bindings. Our engine code base is C++, but any build
scripts I write are done in Python and if my script needs to touch protobuf
files I don't have to rewrite my file parsing routines -- it just works.

I looked into using Apache Thrift as well, but their text-based format is not
human readable so it was a non-starter for us.

------
jleader
There's a way to encode a protobuf schema in a protobuf message, making it
possible to send self-describing messages (i.e. include a serialized schema
before each message). I'm not sure if anyone actually does this. See
[http://code.google.com/apis/protocolbuffers/docs/techniques....](http://code.google.com/apis/protocolbuffers/docs/techniques.html#self-
description) for details.

~~~
sigil
Indeed. Here's how from the command line:

    
    
        protoc -o schema.pb schema.proto
    

The meta-schema for compiled .proto files can be found here [1]. In fact, I
know of several protobuf implementations [2] [3] that avoid parsing .proto
files by working with their compiled message form.

I use protobuf as a storage format for record streams. It's nice for binary
data (documents, images, thumbnails, etc) where JSON string escaping would be
wasteful. Each protobuf record in the file is preceded by a length and a magic
number specifying record type, and the first record in the file embeds the
compiled .proto schema.

This means it's possible to read and work with a record stream _without_
specifying the schema separately, and without generating any code.

Disclaimer: I'm the author of the lwpb Python protobuf library.

[1]
[http://code.google.com/p/protobuf/source/browse/trunk/src/go...](http://code.google.com/p/protobuf/source/browse/trunk/src/google/protobuf/descriptor.proto)

[2] <https://github.com/acg/lwpb>

[3] <https://github.com/haberman/upb>

------
mikeklaas
> (Word of warning: historically, Thrift has not been consistent in their
> feature support and performance across all the languages, so do some
> research).

Conversely, we chose Thrift over protobuf for this reason. Protobuf's python
performance was _abysmal_ — over 10x worse than Thrift.

~~~
sigil
> Protobuf's python performance was abysmal — over 10x worse than Thrift.

This is still true of the official implementation from Google, which is pure
Python. That's why I wrote the Python half of lwpb [1], and why the Greplin
guys wrote fast-python-pb [2]. Both are about 10x faster than the Google
Python implementation, so I'd guess on par with Thrift now.

[1] <https://github.com/acg/lwpb>

[2] [https://github.com/Greplin/fast-python-
pb/tree/master/benchm...](https://github.com/Greplin/fast-python-
pb/tree/master/benchmark)

~~~
thadeus_venture
Does anyone by any chance know if there is a version of the riak python driver
that does not use google's protobuf implementation? I'm looking into using
riak and hearing this does not exactly instill confidence, as riak uses
google's protobuf implementation and http as its two communication protocols.

~~~
grncdr
Take this for what it is (anectdotal evidence) but I found that write
performance with python+riak was abysmal, while read performance wasn't _as_
bad.

Unfortunately I didn't have time to hack the riak driver to use a non-google
protocol buffer implementation before my deadlines, but it _should_ be do-
able.

------
haberman
> Hence it should not be surprising that PB is strongly typed, has a separate
> schema file, and also requires a compilation step to output the language-
> specific boilerplate to read and serialize messages.

I've spent the last two years working on a Protocol Buffer implementation that
does _not_ have these limitations. Using my implementation upb
(<https://github.com/haberman/upb/wiki>) you can import schema definitions at
runtime (or even define your own schema using a convenient API _instead_ of
writing a .proto file), with no loss of efficiency compared to pre-compiled
solutions. (My implementation isn't quite usable yet, but I'm working on a Lua
extension as we speak).

I'm also working on making it easy to parse JSON into protocol buffer data
structures, so that in cases where JSON is _de facto_ typed (which I think is
quite often the case) you can use Protocol Buffers as your data analysis
platform even if your on-the-wire data is JSON. The benefits are greater
efficiency (protobufs can be stored as structs instead of hash tables) and
convenient type checking / schema validation.

This question of how to represent and serialize data in an interoperable way
is a path that began with XML, evolved to JSON, and IMO will converge on a mix
of JSON and Protocol Buffers. Having a schema is useful: you get evidence of
this from the fact that every format eventually develops a schema language to
go along with it (XML Schema, JSON Schema). Protocol Buffers hit a sweet spot
between simplicity and capability with its schema definition.

Once you have defined a data model and a schema language, serialization
formats are commodities. Protocol Buffer binary format and JSON just happen to
be two formats that can both serialize trees of data that conform to a .proto
file. On the wire they have different advantages/disadvantages (size, parsing
speed, human readability) but once you've decoded them into data structures
the differences between them can disappear.

If you take this idea even farther, you can consider column-striped databases
to be just another serialization format for the same data. For example, the
Dremel database described by Google's paper
([http://static.googleusercontent.com/external_content/untrust...](http://static.googleusercontent.com/external_content/untrusted_dlcp/research.google.com/en/us/pubs/archive/36632.pdf))
also uses Protocol Buffers as its native schema, so your core analysis code
could be written to iterate over either row-major logfiles or a column-major
database like Dremel _without having to know the difference_ , because in both
cases you're just dealing with Protocol Buffer objects.

I think this is an extremely powerful idea, and it is the reason I have put so
much work into upb. To take this one step further, I think that Protocol
Buffers also represent parse trees very well: you can think of a domain-
specific language as a human-friendly serialization of the parse tree for that
DSL. You can really nicely model text-based protocols like HTTP as Protocol
Buffer schemas:

    
    
        message HTTPRequest {
          enum Action {
            GET = 0;
            POST = 1;
            // ...
          }
          optional Action action = 1;
          optional string url = 2;
          message Header {
            optional string name = 1;
            optional string value = 2;
          }
          repeated Header header = 3;
          // ...
        }
    

Everything is just trees of data structures. Protocol Buffers are just a
convenient way of specifying a schema for those data structures. Parsers for
both binary and text formats are just ways of turning a stream of bytes into
trees of structured data.

~~~
jhammerb
Avro defines both a row-major and a column-major layout for files:
<https://issues.apache.org/jira/browse/AVRO-806>.

------
zoowar
This article provides a nice discussion on "protocol buffer misfeatures"

<http://blog.golang.org/2011/03/gobs-of-data.html>

------
angstrom
Having used Thrift and PB I find myself partial to Avro because it's very
similar to what I worked with in the past. Key points about Avro:

Schema is separate from data. Network bandwidth is expensive, there is no
reason to transfer the schema with each request.

Data format can change allowing server to support multiple client versions in
a reasonable range. Rather than using Thrift's model of always adding fields
and making them Null when no longer necessary.

It's possible to layer a higher layer abstraction above a lower layer request
abstraction to support more complex objects without generating complex
serialization.

------
ryanpers
Avro is interesting but it is VERY complex. The schema negotiation is
something I never felt comfortable with in an RPC context. You want something
simple, and therefore bug free. This is why we like HTTP and JSON.

My vote goes for Thrift, but there are some features of protobuf which are
interesting, notably the APIs for dealing with messages you dont know
everything about, and ways to preserve unknown fields when you copy messages
from A->B, this is in part how dapper works.

------
rlpb
Not requiring a schema isn't necessarily an advantage. If I'm receiving data
from something that isn't in the same security domain, then I _want_ a schema.

------
frsyuki
Technically, there are 2 important differences:

\- Statically typed or dynamically typed

\- Type mapping between language's type system and serializer's type system
(Note: these serializers are cross-language)

The most understandable difference is "statically typed" vs "dynamically
typed". It affects that how to manage compatibility of data and programs.
Statically typed serializers don't store detailed type information of objects
into the serialized data, because it is explained in source codes or IDL.
Dynamically typed serializers store type information by the side of values.

\- Statically typed: Protocol Buffers, Thrift, XDR

\- Dynamically typed: JSON, Avro, MessagePack, BSON

Generally speaking, statically typed serializers can store objects in fewer
bytes. But they they can't detect errors in the IDL (=mismatch of data and
IDL). They must believe IDL is correct since data don't include type
information. It means statically typed serializers are high-performance but
you must strongly care about compatibility of data and programs.

Note that some serializers have original improvements for the problems.
Protocol Buffers store some (not detailed) type information into data. Thus it
can detect mismatch of IDL and data. MessagePack stores type information in
effective format. Thus its data size becomes smaller than Protocol Buffers or
Thrift (depends on data).

Type systems are also important difference. Following list compares type
systems of Protocol Buffers, Avro and MessagePack:

\- Protocol Buffers: int32, int64, uint32, uint64, sint32, sint64, fixed32,
fixed64, sfixed32, sfixed64, double, float, bool, string, bytes, repeated,
message [1]

\- Avro: int, long, float, double, boolean, null, float, double, bytes, fixed,
string, enum, array, map, record [2]

\- MessagePack: Integer, Float, Boolean, Nil, Raw, Array, Map (=same as JSON)
[3]

Serializers must map these types into/from language's types to achieve cross-
language compatibility. It means that some types supported by your favorite
language can't be stored by some serializers. Or too many types may cause
interoperability problems. For example, Protocol Buffers doesn't have map
(dictionary) type. Avro doesn't tell unsigned integers from signed integers,
while Protocol Buffers does. Avro has enum type, while Protocol Buffers and
MessagePack don't have.

It was necessary for their designers. Protocol Buffers are initially designed
for C++ while Avro for Java. MessagePack aims interoperability with JSON.

I'm using MessagePack to develop our new web service. Dynamically typed and
JSON interoperability are required for us.

[1]
[http://code.google.com/apis/protocolbuffers/docs/proto.html#...](http://code.google.com/apis/protocolbuffers/docs/proto.html#scalar)

[2] <http://avro.apache.org/docs/1.5.1/spec.html#schema_primitive>

[3] <http://wiki.msgpack.org/display/MSGPACK/Format+specification>

~~~
frsyuki
I've described about the difference between MessagePack and BSON:

"Performant Entity Serialization: BSON vs MessagePack (vs JSON)"
[http://stackoverflow.com/questions/6355497/performant-
entity...](http://stackoverflow.com/questions/6355497/performant-entity-
serialization-bson-vs-messagepack-vs-json)

------
famousactress
Wasn't aware of Avro, but I often find myself wanting something in exactly
that sweet-spot. Hopefully it'll be a more successful project than Caucho's
Hessian has been.. which I've found pretty poorly stewarded.

------
equark
How aggressively is the Apache stack moving to Avro?

~~~
igrigorik
AFAIK, Hadoop is converging on Avro -- not sure about the state of each
component within Hadoop with respect to avro, but in long term, I expect it
will be _the_ protocol.

~~~
ryanpers
Hadoop is NOT converging on avro, there are significant political barriers,
and at the very least Owen is dead set against it. The comments on the latest
JIRAs are rather pessimistic.

~~~
ryanpers
eg: <https://issues.apache.org/jira/browse/HADOOP-6659> still open and no
movement for over a year

Patches have been rolled back, commits vetoed, and the direction seems to be
providing a pluggable thrift/protobuf/? RPC option, which seems weird to me.
In any case, there was a lot of back pressure to avro.

Also given it's complexity I dont think I can recommend it in the RPC area.

