
Pyrobuf: A Cython Alternative to Google's Python Protobuf Library - rileyberton
https://github.com/appnexus/pyrobuf
======
pixelmonkey
Thrift is a very similar system for message serialization to Protobuf.

The thriftpy project implements a pure Python parser for thrift files that
dynamically generates Python modules _without_ a build/codegen step. Also has
Py2, Py3, and PyPy support.

Awesome library for this sort of thing. Includes a Cython impl of Thrift
parsing/serialization, too:

[https://github.com/eleme/thriftpy](https://github.com/eleme/thriftpy)

~~~
moonchrome
What is the benefit of using something like Thrift to something like CapNProto
or FlatBuffers ?

~~~
pixelmonkey
I suspect it comes down to the details of the serialization format.

I know that Thrift and Protobuf were developed / publicly released around the
same time (~2008). They both have serialization and RPC approaches. They both
have an IDL format and a compiler for codegen in static languages like C++ and
Java.

Thrift was adopted by the Apache O/SS community and so it can be found in some
Apache projects, e.g. Thrift is used as an interop format in Apache Storm and
as a serialization backend supported by Apache Parquet.

Thrift has had a lot of client libraries for different languages come out over
the years, so it tends to work everywhere.

IIRC, CapNProto and FlatBuffers were each built as fresh takes on protobuf. So
I imagine they are similar, with IDLs and serialization approaches of their
own. I took a quick look at CapNProto's Python module documentation and it
looks very similar to thriftpy in philosophy, but I don't know its intricate
details as I haven't used it.

This guide to Thrift covers all the important details of its schema approach
and is meant as a cross-language guide. It should be the official docs, but
alas, it's not!

[https://diwakergupta.github.io/thrift-missing-
guide/](https://diwakergupta.github.io/thrift-missing-guide/)

------
aeroevan
If this gets python 3 support before google's official implementation I would
be much more inclined to give it a shot, but it doesn't look like it's
compatible yet either.

~~~
tburmeister
Try now - I just merged in some changes to make it Python 3 compatible.

------
vvanders
If your looking for a speedy alternative to protobuf then you owe it to
yourself to try flatbuffers.

[https://google.github.io/flatbuffers/](https://google.github.io/flatbuffers/)

~~~
rkwasny
Don't bother, go directly to cap'n proto:
[https://capnproto.org](https://capnproto.org)

~~~
moonchrome
There are a couple of reasons I chose FB over Can'n Proto, Python and full C++
doesn't work on MSVC/Windows (this might have changed recently especially
given the Clang compiler integration into MS backend)

Second problem is Capnproto doesn't have structs, ie. you want something like
struct Vec3 {x: float, y: float, z: float} in CapnProto you have to either
make Vec3 a ref type (+ pointer size and lookup indirection for each instance)
or inline it manually (especially annoying given that you must manually number
CapnProto fields, maybe this is a benefit in some use cases but it's of no use
to me and only creates work).

CapnProto is much bigger as well - it includes it's own RPC/distributed object
protocol and library - the serialization part isn't separate they are in one
library - that's taking in _a lot_ of useless code that you would have to
figure out if you decided to maintain it.

Also worth noting FlatBuffers on Python are actually not that fast because
they are pure-python - there are Cython versions in the PR (at least last time
I checked) - this doesn't matter in my use case but worth pointing out. And
the Python API is very very unpythonic (C style API with even the naming
completely off from PEP8) compared to CapnProto which is excellent !

I needed a cross language typed serialization format for binary files and
FlatBuffers is better for that than CapnProto IMO.

~~~
dwrensha
> the serialization part isn't separate they are in one library

Although Cap'n Proto's C++ implementation is hosted in a single git
repository, it does compile to several distinct libraries. You can use just
the core serialization/deserialization part, libcapnp, if that's all you need.
There are separate components, libcapnpc and libcapnp-rpc, for dynamic
reflection and the object-capability remote procedure call system.

------
JulianWasTaken
More protobuf implementations is certainly good, Google's have been really
really lacking for us.

It would be _really_ nice if there was a pure Python implementation that
didn't use tons of unnecessary metaprogramming.

~~~
haberman
It's really interesting to me that the top two comments on this story both
wish Protocol Buffers were different, but seem to be asking for opposite
things.

Your comment wishes that there was a pure-Python implementation that doesn't
do metaprogramming. That would imply doing code generation, but having the
generated code be more "concrete", so-to-speak (containing all of the actual
implementation).

However, a different comment
([https://news.ycombinator.com/item?id=10762101](https://news.ycombinator.com/item?id=10762101))
admires the thriftpy project which doesn't have a codegen step at all. This is
in some sense the opposite of what you want: _everything_ becomes
metaprogramming.

I work on Protocol Buffers at Google, and I frequently observe these kinds of
opposing feelings about how protobufs ought to work. One thing I've learned is
how incredibly difficult it is to please everybody. Another good example of
this is whether to be pure-Python: some people (like you) want that. But it's
nearly impossible to make it very fast, and lots of users are sensitive to
speed (you can find various comments and blog articles complaining about the
speed of Python protobuf).

~~~
JulianWasTaken
I've certainly observed the same I think :) and yeah pleasing everyone is
impossible -- I think though there's a somewhat coherent way to put at least
part of those two things together.

Not wanting code generation might come from people who don't like the
additional build step and artifact deployment. I don't like that either, but
what I mean by "less metaprogramming" is that the code that's _generated_
currently makes copious use of descriptors, metaclasses, and other "complex"
things in a way that makes it extremely JIT-unfriendly on PyPy. We had to
completely abandon it in favor of using protoc-c wrapped via CFFI. So I care
about speed (almost entirely -- protobuf powers systems for us that do ~350K
messages per second).

(I do though think it would be possible to satisfy _both_ complaints
simultaneously. I spent about 5 minutes trying to write a codegen-less, pure
Python implementation:
[https://github.com/Julian/pb/tree/master/pb](https://github.com/Julian/pb/tree/master/pb)
but I didn't find the Proto3 docs mature enough to figure out the binary
protocol at the time. Not sure there's enough there to see what direction I
was trying to go in, but I have given this a shot before).

And thanks for the reply, I do agree that there is _some_ conflicting concerns
involved here.

~~~
haberman
Ah, that's very interesting -- your desire for more "pure" Python comes from a
desire for speed. (Some people who wish the generated code was more concrete
want this so the generated code is more readable).

I am curious what led you to the conclusion that descriptors and metaclasses
were making Python protobuf JIT-unfriendly. I ask because most of the "meta"
stuff happens at startup, when you first import the module. It uses a
metaclass to generate a bunch of Python methods, but after that they are just
regular Python methods, and should be as easy for PyPy to optimize as anything
else. There is a bit of reflection happening in some code-paths (like the
__init__ method does loop over a list of fields to decide how to initialize
them), but the metaclass at least is pretty much gone after import.

For this reason, I suspect that even if we ditched all the metaclass stuff,
you'd see PyPy performance pretty close to what you're seeing now. The only
thing I could see potentially making a more significant difference is if there
were generated parsing code that switches on field number, instead of looking
up fields by number in a dictionary. But given that Python doesn't have an
actual switch statement, this might not be an improvement at all.

~~~
fijal
Hi

The python code that's generated is ok, but the problem is that everything is
very dynamic. If you want this to work fast on PyPy, it should really generate
a bunch of getters and setters that use attribute access and not go through
generic __getattr__ that does some dictionary lookups. Additionally a lot of
code is written in a way that creates a lot of temporary lists iterating over
all fields and double checking stuff - this is bound to be way slower than a
simple C/Java stuff that does the very simplest "check type - attribute
access" sort of stuff.

------
Dowwie
Reduce your objects and validate using Marshmallow , de(en)code however you
like.

------
dacompton
Is support for python worse than go or clojure? It seems unimaginable.

------
diffraction
why do we keep reinventing the wheel and not use asn.1?

~~~
asn1argh
The spec is too big, and it's too hard to implement correctly (let alone
efficiently).

~~~
diffraction
you are using it all the time without realizing it. [http://www.marben-
products.com/asn.1/market.html](http://www.marben-
products.com/asn.1/market.html)

protobuf, swift, etc happened when an advertising company thinks it is an
engineering company

~~~
slavik81
Perhaps they should do more marketing, because speaking as a C++ developer: I
picked between protobufs, flatbuffers and capt'n proto because they were easy
to use, had active communities, and they had websites which explained how and
why I should use their compiler/library/protocol.

When I search the web for information about asn.1, I find very little that is
of practical use. What library should I use? Why should I use it? How does it
benchmark in comparison to the other tools? I've seen a few asn.1 library
webpages and they all seem to take it as a given that I'm just looking for
some way to deal with asn.1 data. They don't bother to try to convince me that
their tool is efficient, or even that asn.1 is the right choice for my data in
the first place.

~~~
diffraction
you are dismissing an old and proven technology because you feel it is not
marketing to you effectively, which is the backend to the global telecom
system (ss7) for the past two decades? why is this a movement in computer
science to throw out old things that work? this is confusing to me like nosql.
but "good marketing" is something that is called cap'n proto, sounds like a
joke?

if you are curious here is a good blog post
[https://ttsiodras.github.io/asn1.html](https://ttsiodras.github.io/asn1.html)

i am working with asn.1 the protocol MMS

~~~
detaro
Well, the only location many programmers nowadays stumble over ASN.1 is in
certificates. And ASN.1 parsing from them has a history of massive security
issues and results in massive warnings of "do not touch!".

And yes, "good marketing" in the form of readily available and well-documented
libraries for the languages we use is a very important factor.

I bet the telecom sector has their battle-tested libraries for ASN.1, or at
least for the parts they use. Are they open-source? Are they available for all
languages wanted? No? Then why would I use ASN.1, just to use a "standard", if
it means using worse code or writing it myself?

