

Greplin (YC W10) open sources 10-15x faster protocol buffers for Python - rwalker
https://github.com/Greplin/fast-python-pb

======
haberman
For a long time (much longer than I expected it would take) I've been working
on a protobuf implementation in C that does _not_ use Google's C++
implementation at all. I've been through about three rewrites and I finally
have the interface right. I'm hoping it will be usable with Python soon
(weeks).

<https://github.com/haberman/upb/wiki>

(if anyone's looking at the code, I'm working on the src-refactoring branch at
the moment)

The benefits of my approach are:

* you can avoid depending on a 1MB C++ library. upb is more like 30k compiled.

* you can avoid doing any code generation. instead you just load the .proto schema at runtime, so you don't have to get a C++ compiler involved.

* Google's protobuf library does have a dynamic/reflection option that avoids my previous point, but it is ~10x slower than generating C++ code. My library, last time I benchmarked it, was 70-90% of the speed of generated C++.

~~~
jsarch
Can you clarify what you mean by "70-90% of the speed of generated C++"?

Suppose that the generated C++ takes 1.0 seconds. Does your implementation
take 0.7-0.9s or 1.7-1.9s or something else?

~~~
haberman
If the generated C++ can parse 1MB/s, I can parse at 700-900kB/s. 70-90% of
the speed, not the time. So 1.1-1.4 seconds, in your example.

------
sigil
I too have a speedy Protocol Buffer implementation in Python:

<https://github.com/acg/lwpb>

It clocks in at 11x faster than json, the same speedup reported by fast-pb.
Only with lwpb:

* There's no codegen step -- which is a disgusting thing in a dynamic language, if you ask me.

* You're not forced into object oriented programming, with lwpb you can decode and encode dicts.

Most of haberman's remarks apply to lwpb as well, ie it's fast, small, and
doesn't pull huge dependencies. The lwpb C code was originally written by
Simon Kallweit and is similar in intent to upb.

~~~
ssnot
fast and small footprint, as components should be

------
atamyrat
We (<http://connex.io/>) use Protocol Buffers quite heavily, and Python
implementation was the performance bottleneck in many places.

I was working on same thing, CyPB, which is 17 times faster than Google's
Python implementation. <https://github.com/connexio/cypb>

This one seems more complete though at the moment. I might just mark the
ticket in our tracker as closed and switch to fastpb :-/

------
nostrademons
Nifty. I've passed it along to the appropriate folks.

Google uses SWIG-wrapped C++ proto bindings in Python pretty extensively, so
I'm not sure how much this gets over that approach. I checked out the source;
it's basically using Jinja templates to autogen Python/C API calls. Basically
like SWIG, but not using SWIG.

~~~
slewis
When I was at Google I worked with very large structured protocol buffers in
Python at one point. A single piece of data could be hundreds of MB in total,
consisting of millions of smaller protocol buffers. I was doing a pass over
the whole structure so needed to access each smaller PB from Python.

One day I decided my program was too slow so I profiled it and saw that the
hot spots were in the Python protocol buffer implementation. "Easy", I
thought, "I'll used SWIGed c++ PBs instead." Made some changes and ran the
program again. Almost the exact same run time as before! I profiled again and
found that this time the hot spots were in the SWIG layer. I was making so
many calls through SWIG to c++ (because I was walking millions of objects),
that using SWIGed PBs v. native Python PBs made no difference to my run time.
Maybe I could have done some more custom SWIG work to lower the call overhead,
but I remember being convinced at the time that SWIG wasn't going to do the
trick.

So I ended up writing a 30 line Python extension that processed the protocol
buffers in c++ and put the data into Python data structures. Run time was
reduced by a factor of 10, hooray!

------
apotheon
It doesn't appear to actually be open source:

> # Copyright 2010 Greplin, Inc. All Rights Reserved.

Where's the license?

I think the term you want is "publishes", and not "open sources".

~~~
rwalker
Good catch - updating now. It'll be under Apache 2.0. (edit: done)

~~~
apotheon
Excellent! I mean, Apache's a bit of an overly complex license implementation
for what it does, but what it does is pretty good.

Thanks.

------
peterlai
I hope to see these changes incorporated within Google's official
implementation.

As of right now, deserialization of json and xml are way faster in Python:
[http://stackoverflow.com/questions/499593/whats-the-best-
ser...](http://stackoverflow.com/questions/499593/whats-the-best-
serialization-method-for-objects-in-memcached)

------
dirtae
This is very welcome, but I hope Google fixes this problem in the official
protobuf distribution.

It looks like protobuf 2.4.0 has experimental support for backing Python
protocol buffers with C++ via the PROTOCOL_BUFFERS_PYTHON_IMPLEMENTATION
environment variable:

<http://protobuf.googlecode.com/svn/trunk/CHANGES.txt>

------
traviscline
Is this really a better approach than using Cython to wrap a c++ or c
implementation?

------
sigil
You should add cPickle to the benchmark as well -- I bet fast-pb still comes
out ahead, and that may be an eye opener for many Python devs.

------
andrewvc
MessagePack is up to 4x faster* than protobuf, and easier to work with btw
IMHO.

<http://msgpack.org/>

I used it as the native format for DripDrop
(<https://github.com/andrewvc/dripdrop>)

* In Some tests

~~~
haberman
> * In Some tests

In that test protobuf is forced to _copy_ the 512 byte string (200,000) times,
while it appears that MessagePack is _referencing_ it.

Granted it's a bummer that protobuf can't do this easily (my protobuf library
upb can -- see above post), but I think it's dishonest not to mention that a
large portion of the difference (if not all of it) is just memcpy() that
protobuf is doing but MessagePack is not.

It reminds me of when I worked at Amazon and we had a developer conference
with several speakers. One speaker was plugging Erlang and showed a graph
comparing C++ processes with Erlang processes, and the graph showed C++ being
much slower or bigger. Scott Meyers was in the audience and raised his hand to
ask "what are the Erlang processes not doing, to explain the difference?" The
guy couldn't answer that question directly.

After a bit of digging, you realize that an Erlang "process" is a lightweight,
interpreter-level abstraction that is implemented inside a regular OS process.
So naturally it doesn't have any of the overhead that is associated with an OS
process, and you don't have to make a system call to perform IPC.

So when you're posting benchmark comparisons, I think it's only right to
mention any inherent differences in how much work you're doing.

------
sigil
Has anyone managed to run the fast-pb tests in benchmark.py? I'm not sure
where this switch is coming from:

    
    
      protoc --fastpython_out

~~~
rwalker
Have you installed both protocol buffers and the fast-python-pb module? Feel
free to email me: robbyw@(the-company-mentioned-in-the-title).com

~~~
sigil
Thanks Robby, got the benchmark working, was trying to do a homedir install of
fast-python-pb earlier.

I added a couple more tests to the benchmark, here are the results:

    
    
      JSON
      3.57209396362
    
      Protocol Buffer (fast-python-pb)
      0.325706005096
    
      Protocol Buffer (native)           
      4.83730196953
    
      Protocol Buffer (lwpb)
      0.32919216156
    
      cPickle
      0.837985038757
    

As you can see, lwpb and fast-python-pb are neck and neck. And I should point
out that lwpb isn't using C++ codegen at all, just the compiled schema in the
.proto file. Of course, if completeness of implementation was the critical
thing, you'd probably want to stay closer to google's official implementation.
There's a lot of the google implementation that I never use though, like the
RPC stuff.

Also notable that both lwpb and fast-python-pb outperform cPickle by almost
3x. It would be interesting to know why a portable, cross-language
serialization format beats out the language-specific one.

Here's a fork with the patched benchmark code: <https://github.com/acg/fast-
python-pb>

~~~
ot
> It would be interesting to know why a portable, cross-language serialization
> format beats out the language-specific one.

Because protobuf parses messages with a fixed schema in a very structured
format. Pickle, OTOH, is an interpreted bytecode microlanguage used to
describe arbitrary python objects (for instance, pickle can call python
functions: <http://nadiana.com/python-pickle-insecure>)

Also, pickle supports references (so if an object is referenced two times in
the same serialization stream, it is serialized only once) and this has a cost
at serialization time (need to keep the set of seen references)

~~~
sigil
Perl's Storable can also serialize code references which get eval'ed during
deserialization, and can also serialize multiple references once, though you
have to be more explicit about that. And Storable is still 2x-3x faster than
the already quite fast JSON::XS.

It would seem the bytecode interpreter architecture in Pickle is the limiting
factor. If anybody has some good profiling data on Pickle though, I'd love to
see it.

