Protocol Buffers, Avro, Thrift & MessagePack

hesdeadjim · on Aug 1, 2011

I work at an indie game shop and I have pushed us to use it (almost) everywhere we need to define a file format. Originally we were using JSON for everything, which is great for the quick and dirty approach -- but as our code base (primarily C++) has grown I absolutely love the guarantees I get with protobufs:

- Strongly typed, no boilerplate error checking if someone set my "foo" field on my object to an integer instead of a string

- Easy to version and upgrade, just create new fields, deprecate the old ones, and move on with life.

- Protobuf IDLs are the documentation and implementation of my file format -- no docs to write about what fields belong in what object and no issues with out of sync documentation/code.

- Reflection support, don't use this a lot, but when I need it it's awesome.

- Variety of storage options. For instance the level editor I wrote recently uses the human-readable text format when it saves out levels. But when I am ready to ship, I can trivially convert these level files to binary and immediately improve the performance of my app.

- Tons of language bindings. Our engine code base is C++, but any build scripts I write are done in Python and if my script needs to touch protobuf files I don't have to rewrite my file parsing routines -- it just works.

I looked into using Apache Thrift as well, but their text-based format is not human readable so it was a non-starter for us.

jleader · on Aug 1, 2011

There's a way to encode a protobuf schema in a protobuf message, making it possible to send self-describing messages (i.e. include a serialized schema before each message). I'm not sure if anyone actually does this. See http://code.google.com/apis/protocolbuffers/docs/techniques.... for details.

sigil · on Aug 1, 2011

Indeed. Here's how from the command line:

    protoc -o schema.pb schema.proto

The meta-schema for compiled .proto files can be found here [1]. In fact, I know of several protobuf implementations [2] [3] that avoid parsing .proto files by working with their compiled message form.

I use protobuf as a storage format for record streams. It's nice for binary data (documents, images, thumbnails, etc) where JSON string escaping would be wasteful. Each protobuf record in the file is preceded by a length and a magic number specifying record type, and the first record in the file embeds the compiled .proto schema.

This means it's possible to read and work with a record stream without specifying the schema separately, and without generating any code.

Disclaimer: I'm the author of the lwpb Python protobuf library.

[1] http://code.google.com/p/protobuf/source/browse/trunk/src/go...

[2] https://github.com/acg/lwpb

[3] https://github.com/haberman/upb

mikeklaas · on Aug 1, 2011

> (Word of warning: historically, Thrift has not been consistent in their feature support and performance across all the languages, so do some research).

Conversely, we chose Thrift over protobuf for this reason. Protobuf's python performance was abysmal — over 10x worse than Thrift.

sigil · on Aug 2, 2011

> Protobuf's python performance was abysmal — over 10x worse than Thrift.

This is still true of the official implementation from Google, which is pure Python. That's why I wrote the Python half of lwpb [1], and why the Greplin guys wrote fast-python-pb [2]. Both are about 10x faster than the Google Python implementation, so I'd guess on par with Thrift now.

[1] https://github.com/acg/lwpb

[2] https://github.com/Greplin/fast-python-pb/tree/master/benchm...

thadeus_venture · on Aug 2, 2011

Does anyone by any chance know if there is a version of the riak python driver that does not use google's protobuf implementation? I'm looking into using riak and hearing this does not exactly instill confidence, as riak uses google's protobuf implementation and http as its two communication protocols.

grncdr · on Aug 2, 2011

Take this for what it is (anectdotal evidence) but I found that write performance with python+riak was abysmal, while read performance wasn't as bad.

Unfortunately I didn't have time to hack the riak driver to use a non-google protocol buffer implementation before my deadlines, but it should be do-able.

X-Istence · on Aug 2, 2011

Google has an experimental but in my experimenting very stable version of PB that is a C++ extension for Python so you get pretty much the same speed as C++.

nikcub · on Aug 1, 2011

which is interesting since appengine uses protobuf in python throughout the api

you'd think they have nailed the implementation

haberman · on Aug 1, 2011

> Hence it should not be surprising that PB is strongly typed, has a separate schema file, and also requires a compilation step to output the language-specific boilerplate to read and serialize messages.

I've spent the last two years working on a Protocol Buffer implementation that does not have these limitations. Using my implementation upb (https://github.com/haberman/upb/wiki) you can import schema definitions at runtime (or even define your own schema using a convenient API instead of writing a .proto file), with no loss of efficiency compared to pre-compiled solutions. (My implementation isn't quite usable yet, but I'm working on a Lua extension as we speak).

I'm also working on making it easy to parse JSON into protocol buffer data structures, so that in cases where JSON is de facto typed (which I think is quite often the case) you can use Protocol Buffers as your data analysis platform even if your on-the-wire data is JSON. The benefits are greater efficiency (protobufs can be stored as structs instead of hash tables) and convenient type checking / schema validation.

This question of how to represent and serialize data in an interoperable way is a path that began with XML, evolved to JSON, and IMO will converge on a mix of JSON and Protocol Buffers. Having a schema is useful: you get evidence of this from the fact that every format eventually develops a schema language to go along with it (XML Schema, JSON Schema). Protocol Buffers hit a sweet spot between simplicity and capability with its schema definition.

Once you have defined a data model and a schema language, serialization formats are commodities. Protocol Buffer binary format and JSON just happen to be two formats that can both serialize trees of data that conform to a .proto file. On the wire they have different advantages/disadvantages (size, parsing speed, human readability) but once you've decoded them into data structures the differences between them can disappear.

If you take this idea even farther, you can consider column-striped databases to be just another serialization format for the same data. For example, the Dremel database described by Google's paper (http://static.googleusercontent.com/external_content/untrust...) also uses Protocol Buffers as its native schema, so your core analysis code could be written to iterate over either row-major logfiles or a column-major database like Dremel without having to know the difference, because in both cases you're just dealing with Protocol Buffer objects.

I think this is an extremely powerful idea, and it is the reason I have put so much work into upb. To take this one step further, I think that Protocol Buffers also represent parse trees very well: you can think of a domain-specific language as a human-friendly serialization of the parse tree for that DSL. You can really nicely model text-based protocols like HTTP as Protocol Buffer schemas:

    message HTTPRequest {
      enum Action {
        GET = 0;
        POST = 1;
        // ...
      }
      optional Action action = 1;
      optional string url = 2;
      message Header {
        optional string name = 1;
        optional string value = 2;
      }
      repeated Header header = 3;
      // ...
    }

Everything is just trees of data structures. Protocol Buffers are just a convenient way of specifying a schema for those data structures. Parsers for both binary and text formats are just ways of turning a stream of bytes into trees of structured data.

jhammerb · on Aug 3, 2011

Avro defines both a row-major and a column-major layout for files: https://issues.apache.org/jira/browse/AVRO-806.

zoowar · on Aug 2, 2011

This article provides a nice discussion on "protocol buffer misfeatures"

http://blog.golang.org/2011/03/gobs-of-data.html

angstrom · on Aug 1, 2011

Having used Thrift and PB I find myself partial to Avro because it's very similar to what I worked with in the past. Key points about Avro:

Schema is separate from data. Network bandwidth is expensive, there is no reason to transfer the schema with each request.

Data format can change allowing server to support multiple client versions in a reasonable range. Rather than using Thrift's model of always adding fields and making them Null when no longer necessary.

It's possible to layer a higher layer abstraction above a lower layer request abstraction to support more complex objects without generating complex serialization.

ryanpers · on Aug 2, 2011

Avro is interesting but it is VERY complex. The schema negotiation is something I never felt comfortable with in an RPC context. You want something simple, and therefore bug free. This is why we like HTTP and JSON.

My vote goes for Thrift, but there are some features of protobuf which are interesting, notably the APIs for dealing with messages you dont know everything about, and ways to preserve unknown fields when you copy messages from A->B, this is in part how dapper works.

rlpb · on Aug 1, 2011

Not requiring a schema isn't necessarily an advantage. If I'm receiving data from something that isn't in the same security domain, then I want a schema.

frsyuki · on Aug 2, 2011

Technically, there are 2 important differences:

- Statically typed or dynamically typed

- Type mapping between language's type system and serializer's type system (Note: these serializers are cross-language)

The most understandable difference is "statically typed" vs "dynamically typed". It affects that how to manage compatibility of data and programs. Statically typed serializers don't store detailed type information of objects into the serialized data, because it is explained in source codes or IDL. Dynamically typed serializers store type information by the side of values.

- Statically typed: Protocol Buffers, Thrift, XDR

- Dynamically typed: JSON, Avro, MessagePack, BSON

Generally speaking, statically typed serializers can store objects in fewer bytes. But they they can't detect errors in the IDL (=mismatch of data and IDL). They must believe IDL is correct since data don't include type information. It means statically typed serializers are high-performance but you must strongly care about compatibility of data and programs.

Note that some serializers have original improvements for the problems. Protocol Buffers store some (not detailed) type information into data. Thus it can detect mismatch of IDL and data. MessagePack stores type information in effective format. Thus its data size becomes smaller than Protocol Buffers or Thrift (depends on data).

Type systems are also important difference. Following list compares type systems of Protocol Buffers, Avro and MessagePack:

- Protocol Buffers: int32, int64, uint32, uint64, sint32, sint64, fixed32, fixed64, sfixed32, sfixed64, double, float, bool, string, bytes, repeated, message [1]

- Avro: int, long, float, double, boolean, null, float, double, bytes, fixed, string, enum, array, map, record [2]

- MessagePack: Integer, Float, Boolean, Nil, Raw, Array, Map (=same as JSON) [3]

Serializers must map these types into/from language's types to achieve cross-language compatibility. It means that some types supported by your favorite language can't be stored by some serializers. Or too many types may cause interoperability problems. For example, Protocol Buffers doesn't have map (dictionary) type. Avro doesn't tell unsigned integers from signed integers, while Protocol Buffers does. Avro has enum type, while Protocol Buffers and MessagePack don't have.

It was necessary for their designers. Protocol Buffers are initially designed for C++ while Avro for Java. MessagePack aims interoperability with JSON.

I'm using MessagePack to develop our new web service. Dynamically typed and JSON interoperability are required for us.

[1] http://code.google.com/apis/protocolbuffers/docs/proto.html#...

[2] http://avro.apache.org/docs/1.5.1/spec.html#schema_primitive

[3] http://wiki.msgpack.org/display/MSGPACK/Format+specification

frsyuki · on Aug 2, 2011

I've described about the difference between MessagePack and BSON:

"Performant Entity Serialization: BSON vs MessagePack (vs JSON)" http://stackoverflow.com/questions/6355497/performant-entity...

frsyuki · on Aug 2, 2011

(Note: I'm author of MessagePack)

famousactress · on Aug 1, 2011

Wasn't aware of Avro, but I often find myself wanting something in exactly that sweet-spot. Hopefully it'll be a more successful project than Caucho's Hessian has been.. which I've found pretty poorly stewarded.

equark · on Aug 1, 2011

How aggressively is the Apache stack moving to Avro?

igrigorik · on Aug 1, 2011

AFAIK, Hadoop is converging on Avro -- not sure about the state of each component within Hadoop with respect to avro, but in long term, I expect it will be the protocol.

ryanpers · on Aug 2, 2011

Hadoop is NOT converging on avro, there are significant political barriers, and at the very least Owen is dead set against it. The comments on the latest JIRAs are rather pessimistic.

ryanpers · on Aug 2, 2011

eg: https://issues.apache.org/jira/browse/HADOOP-6659 still open and no movement for over a year

Patches have been rolled back, commits vetoed, and the direction seems to be providing a pluggable thrift/protobuf/? RPC option, which seems weird to me. In any case, there was a lot of back pressure to avro.

Also given it's complexity I dont think I can recommend it in the RPC area.

zoowar · on Aug 2, 2011

Cassandra recently stripped out Avro support.