

Protocol Buffers by Google - koenigdavidmj
http://code.google.com/p/protobuf/

======
JoelPM
Someone's bound to say "But it doesn't support RPC!"

No, it doesn't, and that's a good thing. Use ZeroMQ as your transport with a
protobuff payload and do remote message passing* instead of RPC - it'll scale
more efficiently than straight RPC and you won't end up having to pool
connections or queue requests in your client.

*Give each message request an identifier and have each response identify what message it's answering.

~~~
alexgartrell
A research project with which I am involved recently moved from google
protocol buffers + TCP to thrift RPC, because they found that thrift was much
faster. Google Protocol buffers encode the keys and field types with varints,
which are basically series of bytes of arbitrary length to represent ints. If
the 8th bit is set, move on to the next byte. It's pretty straightforward

    
    
      int parse(uchar *msg) {
        int x = 0;
        do {
          x <<= 7;
          x |= *msg;
        } while(*msg & 0x80);
        return x;
      }
    

Thrift, on the other hand, uses vanilla 4-byte ints for the same purpose. This
has the effect of limited the number of fields to 2^29 (I think they use the
bottom 3 bits to specify field type, but I don't remember off the top of my
head). On the flip side, it's pretty instant to parse, as opposed to the
looping thing.

Further, thrift makes it pretty trivial to write a thrift server/client in any
language.

In short, thrift > protocol buffers.

Edit: One caveat -- Async calls WITH return values are more or less a no-go in
thrift.

~~~
nostrademons
I suspect that this is heavily dependent upon the relative speeds of network
vs. CPU, as well as the usage that these encoded messages will get afterwards.
Protocol buffers are making a speed vs. space tradeoff, minimizing the size of
a message at the cost of some processing power. That probably made a lot more
sense when Google was young and 100BaseT was new than in the era with
ubiquitous gigabit Ethernet that we've got now.

Also remember that for most production services, you'll want to log requests &
responses. As soon as you start hitting the disk, increased space usage gets
expensive pretty fast. You'd probably want to convert your 32-bit ints to
varints then anyway, and then you've lost all the CPU that you saved in the
first place.

In short, YMMV. Consider your particular use-case carefully before making any
blanket judgments of thrift > protobufs or protobufs > thrift.

~~~
alexgartrell
You're talking about saving 3 bytes per int. I'd imagine it's pretty hard to
find a workload where you can achieve 75% compression by using exclusively
varints that are less than 128 in magnitude. In fact, I'd imagine that most of
the data shipped in most workloads is mostly strings/binary, which have very
little if any savings.

So let's talk about the ideal situation where we have exclusively fields that
are varints that are limited to 128. This means that each will be 2 bytes (1
for field data + 1 for the varint itself, both of which are under 128 in
magnitude). How many of these can we fit? Well, if we use the bottom 3 bits
for specifying the type (a conservative estimate) that leaves only 4 bits
(remember, subtract 1 for varint or not) to ID them, which means 16. So if a
message has 16 fields and we save 6 bytes per request, we're talking a savings
of 96 bytes. For the next 2048 entries, we're saving, 5 bytes per (5kb vs
15kb), then beyond that, we're saving 4 bytes per (and eventually none). This
is also an incredibly arbitrary situation where you expect ALL of your fields
to be small ints.

Same with disk, it's unlikely that your workloads will allow you to save a lot
by using varints.

So, you're right that, for some workloads, this may be better, but if someone
put a gun to my head and said "protocol buffers or thrift?" I'd have to go
with thrift.

~~~
124816
Going to reply to all your comments at this rate. ;-)

Even with single-byte tags, the overhead can be quite significant; thus the
packed = true option. Unfortunately I don't have any public protocols or data
to show you, but by switching to packed = true (combined with some extra
tricks) I achieved ~40-60% savings on some of our datasets. We use protobufs
for our stable storage, and this change certainly saved us more than my yearly
compensation package.

> For the next 2048 entries

Yikes! I hope you never have to work with a message type with anywhere near
2048 fields! :-)

In practice, more than half message types have less than 16 fields, which is
why the "incredibly arbitrary situation" is justified, and really does end up
saving a ton of storage. But, for simplicity, nothing can beat just ntohl and
a cast.

------
joshstaiger
Protocol buffers are great, but the Python implementation provided by Google
is written in native Python and very slow for serializing/deserializing
protobufs (see <http://news.ycombinator.com/item?id=767882>).

I've heard you can get around this by SWIG-wrapping the C++ implementation,
but I haven't gotten around to trying it yet.

------
nathanmarz
A great use case for Protocol Buffers or Thrift is adding schema to other
technologies which just use raw byte arrays, like Hadoop files, Cassandra
columns, and Gearman parameters

------
scrame
I've recently moved to protocol buffers and am quite happy with the results.
Before that, we had a few tomcat apps with ad-hoc urls that return xml
responses. Since we already have a good http request/response channel, I was
able to create a java bridge for the message objects, as well as a simple
python library that can communicate with all our servers.

While there was a little overhead in getting it integrated into the build
system, it works great, and gives a much better interface, and much better
control for messaging individual servers.

The project itself is also very well put together. Hell, even the autoconf
generated makefile has targets like "uninstall", which seems to be a rarity
when building packages from source.

------
catch23
Another good alternative is msgpack at <http://msgpack.org/>

It has it's own rpc mechanism as well that competes with thrift, although it
is not IDL-based (however, their IDL is coming soon).

------
superk
See also Apache Thrift (from Facebook) and Caucho Hessian

~~~
yurisagalov
There's a (pretty) good comparison of Thrift to Protobuf on
<http://wiki.github.com/eishay/jvm-serializers/>

I think Thrift supports a lot more languages, but based on benchmarks, I
believe it does perform (sometimes significantly) slower

I've never used Hessian though, will have to check it out at some point.

------
amalcon
What sort of versioning capability does this give? That's the main reason to
bother using something like this.

~~~
124816
Both protocol buffers and thrift are primarily designed to handle older or
newer messages (than your program was compiled with) gracefully. Speed and
compactness appear to be secondary goals. (Especially for Thrift, which does
not seem concerned with compactness.)

Rather than version your protocol, the generated code is able to parse newer
messages. If you modify a field on that message and serialize it, even the
parts of the message your binary is unaware of will be retained. (As opposed
to dropping them.) This means you don't need to push all changes at once, and
you can have binaries talk with older and new binaries without problems.

The basic wire format is [(tag, payload bytes)*]. The tag identifies both the
field "number", and the wire-type of the field. Wire types include fixed 32
and 64, length-delimited ranges, and, for protos, variable length integers.
The parser knows how to skip fields of all wire types, so any unknown tags are
skipped (and retained in memory) during parsing.

Some caveats apply -- you can modify your protocol in several backwards-
incompatible ways. For example: add a new required field. Your new binary will
reject all serialized messages from your older binaries. Alternatively, change
the type of a field. These are pretty easy to avoid, but given enough people
on your team, it is bound to happen once or twice.

------
revoltingx
I use rabbitmq for my android app. It's just a matter of sending a byte array
between the client and the server. Then it's just a matter of reading the
bytes into the appropriate fields. (both server and client) This may be a bit
of work but it's really fast and efficient.

~~~
124816
After several releases of your app, you'll have several wire formats in the
wild. One of the main features of libraries like Protocol Buffers is that they
can parse old messages, so you don't need to maintain an increasingly hairier
parsing routine.

~~~
revoltingx
Well, I deny access to the old clients. I consider this an advantage because I
force users to keep up to date. If you're careful you can just add items to
the end of the stream and the old clients will work just fine. I don't support
this though, but I do it in practice.

But yeah, protobuf wins in that domain. Specially if it avoids a high
serialization/deserialization overhead penalty. It'd be cool to use rabbitmq
to route the protobuf objects around with a cleaner interface.

Shameless plug for my dev blog: <http://developingthedream.blogspot.com/>

