Hacker News new | past | comments | ask | show | jobs | submit login
Demystifying the protobuf wire format (kreya.app)
63 points by CommonGuy 20 days ago | hide | past | favorite | 71 comments



"Demystifying" is a big word for what the original docs document quite well, and is also not like you couldn't read and understand that in few hours, if you are not totally foreign to protocol design and serialization? This post gives even much less information?!


Yeah, this page is quite clear: https://protobuf.dev/programming-guides/encoding/

Nothing mystical about it


I did a lot of research on binary serialization at the University of Oxford. One of the papers I published is a comprehensive review of existing JSON-compatible serialization formats (https://arxiv.org/abs/2201.02089). It touches on Protocol Buffers (and >10 other formats) and I'm analyzing the resulting hexadecimals close to how the OP is doing.

I also published a space-efficiency benchmark of those same formats (https://arxiv.org/abs/2201.03051) and ended up creating https://jsonbinpack.sourcemeta.com as a proposed technology that does binary serialization of JSON using JSON Schema.


I like the notion of fig. 3 but it doesn't seem to capture evolution of uses over time.


As a counterpoint to the horror stories, I've had a few relatively good experiences with protocol buffers (not gRPC). On one project, we had messages that needed to be used across multiple applications, on a microcontroller, on an SBC running Python, in an Android app, in a web service, and on web UI frontend. Being able to update a message definition in one place, and have it spit out updated code in half a dozen languages while allowing for incremental rollout to the various pieces was very handy.

Sure - it wasn't all guns and roses, but overall it rocked.


To be fair, if that's what you need ProtoBuf isn't the only option. Cap'n Proto[1], JSON Schema[2], or any other well supported message-definition language could probably achieve that as well, each with their own positives and negatives.

[1]: https://capnproto.org/

[2]: https://json-schema.org/


Big fan of Cap'n Proto here, but to be fair it doesn't support as many languages as Protobuf/gRPC yet.


I'm currently building a Protocol Buffers alternative that uses JSON Schema instead: https://jsonbinpack.sourcemeta.com/. It was proven on research to be as or more space-efficient than any considered alternative (https://arxiv.org/abs/2211.12799).

However, it is still heavily under development and not ready for production use. Definitely looking for GitHub Sponsors or other type of funding to support it :)


This is what I used it for, and it was great. Especially if you were feeding data to a third party and had to agree on an exchange format anyway.


We built a backend heavily using protobufs/grpc and I highly regret it.

It ads an extra layer of complexity most people don't need.

You need to compile the protobufs and update all services that use them.

It's extra software for security scans.

Regular old http 1 rest calls should be the default.

If you are having scaling problem only then should you consider moving to grpc.

And even then I would first consider other simpler options.


Personally I'll never go back to REST because you lose important typing information. The complexity doesn't go away. In the RPC/codegen paradigm the complexity is in the tooling and is handled by machines. In REST, it's in the minds of your programmers and gets sprinkled throughout the code.

> You need to compile the protobufs and update all services that use them.

You need to update all the services when you change your REST API too right? At least protobufs generates your code automatically for you, and it can do it as part of your build process as soon as you change your proto. Changes are backwards compatible so you don't even need to change your services until they need to change.


Its silly to think protobufs code gen is a advantage. I can take a json object/xml/csv from any API and plug it into a website that will spit out models in any language I want.

The only real advantage of grpc and protobufs have are speed and reduced data transmission.

And hey fair enough man if those are your bottle necks.


How will you know the generated model actually represents everything that can be in the API if you don't have a schema?


Sounds really like you used them for the wrong use case. If you are in need of a binary compact serialization, they are not prefect (is there any) but fair enough.


We ended up wrapping them in envoy so that our UI can convert the grpc to regular old http 1. And that's where they get the most use.

And by doing that we've added extra layers and it ended up slower than it would have been had we just used regular rest.

Further more now we need to keep evoy up to date.

Occasionally they break their API on major versions. Their config files are complicated and confusing.

So, imo, grpc should only be used for service to service communication where you don't want to share the code with a UI and speed and throughput is very very important.

And speed of http 1 rarely is the bottleneck.


But... Why? How is envoy, a proxy, related to your choice to use protobuf? How are your gripes with envoy relevant? This just smells like bad design


Yes, using envoy was bad design in our situation. It was premature optimisation.

We could either maintain a grpc API and a rest API , or a grpc API plus envoy, or 1 rest API.

I am saying we should have picked 1 rest API and only switched to grpc if and when we ran into scaling problems.

Avoiding having to maintain grpc compilers and envoy in our security updates.


Grpc isn’t just for scaling though. It’s a schema format that comes with code generation you can technically avoid the code generation if you so please.

Idk I think people are expecting either too much or too little of these tools.


So why did you choose to use grpc if you were just going to have to convert it?


We use a few of the endpoints in the backend. Service to service communication.

Those same endpoints are used with envoy. By our UI.

That choice was made to reduce code bloat.

Rather than maintain grpc and envoy it's easier to just maintain 1 rest API.

The service to service communication was never a bottle neck.

So it was highly prematurely optimized.

We spend way more time keeping evoy and our grpc compilers up to date and free of security issues than I would like.

It's just extra software and thus extra attack surfaces we didn't need. In retrospect


You don't need to compile the protobufs. The alternative, for all serialization formats, is to either load the schema dynamically, or write the handling logic manually yourself (or write your own generator/compiler).

gRPC supports HTTPv1 and can be mapped to a RESTful API (e.g. https://google.aip.dev/131).


It was also used for Farsight's tunnelled SIE called NMSG. I wrote a pure python protobuf dissector implementation for use with Scapy (https://scapy.readthedocs.io/en/latest/introduction.html) for dissecting / tasting random protobuf traffic. I packaged it with an NMSG definition (https://github.com/m3047/tahoma_nmsg).

I re-used the dissector for my Dnstap fu, which has since been refactored to a simple composable agent (https://github.com/m3047/shodohflo/tree/master/agents) based on what was originally a demo program (https://github.com/m3047/shodohflo/blob/master/examples/dnst...) because "the people have spoken".

Notice that the demo program (and by extension dnstap_agent) convert protobuf to JSON: the demo program is "dnstap2json". It's puzzlingly shortsighted to me that the BIND implementation is not network aware it only outputs to files or unix sockets.

The moment I start thinking about network traffic / messaging the first question in my mind is "network or application", or "datagram or stream"? DNS data is emblematic of this in the sense that the protocol itself supports both datagrams and streams, recognizing that there are different use cases for distributed key-value store. JSON seems punctuation and metadata-heavy for very large amounts of streaming data, but a lot of use cases for DNS data only need a few fields of the DNS request or response so in practice cherry picking fields to pack into a JSON datagram works for a lot of classes of problems. In my experience protobuf suffers from a lack of "living off the land" options for casual consumption, especially in networked situations.


Why not just use cap’n’proto? It seems superior on every metric and has very impressive vision.

Honestly the biggest failing for those guys was not making a good Javascript implementation. Seems C++ aint enough these days. Maybe emcscripten works? Anyone tried it ?

https://news.ycombinator.com/item?id=25585844

kenton - if you’re reading this - learn the latest ECMAScript or Typescript and just go for it!


> kenton - if you’re reading this - learn the latest ECMAScript or Typescript and just go for it!

I mean, if I had infinite time, I'd love to! (Among infinite other projects.)

But keep in mind Cap'n Proto is not something I put out as a product. This confuses people a bit, but I don't actually care about driving Cap'n Proto adoption. Rather, Cap'n Proto is a thing I built initially as an experiment, and then have continued to develop because it has been really useful inside my other projects. But that means I only work on the features that are needed by said other projects. I welcome other people contributing the things they need (including supporting other languages) but my time is focused on my needs.

My main project (for the past 7 years and foreseeable future) is Cloudflare Workers, which I started and am the lead engineer of. To be blunt, Workers' success pays me money, Cap'n Proto's doesn't. So I primarily care about Cap'n Proto only to the extent it helps Cloudflare Workers.

Now, the Workers Runtime uses Cap'n Proto heavily under the hood, and Workers primarily hosts JavaScript applications. But, the runtime itself is written in C++ (and some Rust), and exposing capnp directly to applications hasn't seemed like the right product move, at least so far. We did recently introduce an RPC system, and again it's built on Cap'n Proto under the hood, but the API exposed to JavaScript is schemaless, so Cap'n Proto is invisible to the app:

https://blog.cloudflare.com/javascript-native-rpc

We've toyed with the idea of exposing schemaful Cap'n Proto as part of the Workers platform, perhaps as a way to communicate with external servers or with WebAssembly. But, so far it hasn't seemed like the most important thing to be building. Maybe that will change someday, and then it'll become in Cloudflare's interest to have really good Cap'n Proto libraries in many languages, but not today.


>Maybe emcscripten works?

It does with minor hacks, I have C++ application compiled with Emscripten using CapnProto RPC over WebSockets. That is, if you are mad enough to write webapps in C++...

My gripe with CapnProto is that it is inconvenient to use it as internal applications structures, either you write boilerplate to convert from/to application objects, or deal with clunky Readers, Builders, Orphanages, etc. But again, I probably gone too far by storing CapnProto objects inside database.


Reddit moved to gRPC and protobuff from Thrift a couple years ago. I wonder how it is going for them. https://old.reddit.com/r/RedditEng/comments/xivl8d/leveling_...


For the ones looking for a minimal and conservative binary format, there is BARE [1]. It is in the process of standardization.

[1] https://baremessages.org/


Why would one use BARE over the handful of more well known serialization libraries at this point? For example, what makes it stand out over protobuf? It's not jumping out to me.


I wish DevTools had an API to let extensions display content in the network tab that is something besides JSON or XML. Or add a few things like protobuf.


I thought this was going to be about physically storing memory in wires, i.e. core memory.


Core didn't store memory in the wires, though. You're thinking of delay lines.


Yes, its delay line memory


Eh, I struggle to say that pb has a "wire" format. A binary encoding sure.

To me wire format implies framing etc, enough stuff to actually get it across a stream in a reasonable way. For pb this usually means some sort of length delimited framing you come up with yourself.

Similarily pb doesn't have a canonical file format for multiple encoded buffers.

For these reasons I rarely use pb as an interchange format, it's great for internal stuff and good if you want to do your own framing or file format but if you want to store and eventually process things with other things then you are better off with stuff like Avro which does define things like the Object Container Format.


There’s a method to write the object with the size at the front. That’s all you need. I’ve been on teams streaming terabytes of protobuf every day. Is fine.


I find it interesting that the folks running away screaming from protobuf are using it in conjunction with gRPC. Is the problem really with the wire format or is it a problem with all of the stuff above?

I've been using protobuf for a (non-web) hobbyist project for some time now and find it fairly straightforward to use, especially when working across multiple implementation languages. For me, it seems to be a nice middle-ground between the ease of JSON and the efficiency of a hand-rolled serialization format.


We've been using GRPC to talk between a Java UI and a C# backend for several years now. Apart from some upgrade issues tying it more heavily to ASP.NET, and some around connection management, it's been completely fine.

Mind you, I can see why people used to weakly typed languages would prefer to just slam everything into JSON.


Generally agree. I know of at least two protocols using protobuf payloads over raw UDP and seems pretty good that way (dnstap and some networking asic's in-band telemetry). My biggest gripe was protobuf v3 changing things like not including default values and not being able detect field presence which I found very annoying.


Fortunately, you can continue to use proto2, if you decide it's superior, or use a more recent version of proto3 that supports field presence.


Protobuf has a JSON serialization format. You can use it to help define and validate JSON schemas, which is quite nice. And of course, clients don't need protobuf bindings to read the resulting JSON objects (although you lose the automatic type checking/field unpacking, etc).


Context: I started with proto2 in C++ internal to Google and still use proto3 from some Go and Rust projects over a decade later.

I can't say the wire format has ever been a problem for me directly. Newer formats have reduced some CPU overheads, but haven't pulled it all together the way official protobuf and gRPC ecosystems have.

From what I've seen the biggest problem with the wire format is that the framing for a nested message requires a varint size. You don't know how many bytes to set aside for that integer until you know how many bytes the nested message will serialize to, and this applies recursively. Without a hacky size cache [1], you get exponential runtime. Even BSON did better here; its nested document framing is fixed-size so it can always go back and patch it later with just an external size stack, no need for an intrusive size cache.

There are still benefits to the wire format, especially over JSON. For example, you get real 64-bit unsigned integers, and you can disable varint encoding for them (fixed64). It gives you a lot of opportunities for both accuracy and efficiency.

The bad news is that normal idiomatic use of protobuf and gRPC infect your code. It's designed for the code to be generated in very predictable standard ways, but those aren't necessarily the ways you actually want. Even if you decide to isolate proto to a corner of your project and use your own model the rest of the time, the transformation between proto types and your own types can cost you more memory allocations and copy more memory instead of sharing existing memory. So if you care about performance, you often have to design a whole project around protobuf end-to-end, infecting your code even more than usual.

With JSON in either Go or Rust, you can make your own custom types that serialize to JSON and these types instantly feel and work first-class. You own how the schema is mapped to code; often the only living schema is defined in code anyway. In most cases you can use your own types throughout your project and serialize them to JSON as needed without JSON itself infecting your code. This helps even more if you also involve formats like BSON because they can all coexist just fine for the same types, unlike protobuf which insists on its own types and generated code.

Even if you fully embrace protobuf throughout your project, there are other problems and limitations with the generated code. For example, in official protobuf for Go, there's no way to avoid heap allocating each individual nested or repeated message, and there's no way to avoid an absurd 5-machine-word overhead in every single message. (Hopefully will be brought down to 3 soon, but it's been 5 for years).

If you're designing a proto schema around these problems, you can seriously compromise readability and maintainability just to work around poor implementation decisions in Go protobuf itself. I'm guilty of that kind of optimization, but when my team saw the benchmarks numbers they agreed it was worth it. This is not the kind of decision you want to have to make in a technical project, but I emphasize that it can still be the right choice in many circumstances.

The prost crate for Rust is not official but already gives you more control over your schema. You can technically use your own Rust code instead of the generated code, though I don't see anyone actually do this and it doesn't seem to be encouraged. In any case, my biggest issue with prost is that it makes it difficult (or perhaps currently impossible) to share sub-messages with Arc [2], which on the other hand is trivial with Go just using *. While prost avoids allocations in more cases than Go protobuf, my experience has been that it avoids copies less, and some of those copies require allocations anyway.

I'm encouraged that Google's upcoming Rust library seems to be modeled after the C++ one and not the Go one. I haven't seen the latest work on it but I trust it's in a good place given how many collective decades of experience in protobuf implementation are going into it.

In summary, for a project that was explicitly designed for efficiency, in practice it can limit your code in ways that hurt efficiency more than they ever helped. And while generating code for many languages is a handy feature, that generated code is unlikely to be what you want, and the more you embrace proto throughout your project the more places you pay efficiency and maintainability tolls.

[1] https://github.com/protocolbuffers/protobuf-go/blob/1d4293e0...

[2] Not within one message to be sent to a client, because that information would be redundant. More for sharing some nested messages across bigger messages sent to multiple clients.


The sizes are computed by traversing the object tree before writing it. If you wanted a super fast implementation you could reserve space for a, say, three byte VarInt and then write it as three bytes no matter what. A VarInt that is four bytes can still validly represent the number 5, even though that would normally take one byte.

The thing that pisses me off about protobuf is that the wire format doesn’t distinguish between different types of binary data: if it said “this is an object binary” then we could decompose it, even if we didn’t have the protocol definition. As it is, could be a string or an array or an object.


On overhead, being modeled after Go also unfortunately hurts otherwise excellent .NET gRPC tooling. A lot of types could have been there based on top of standard library types rather than protobuf's - RepeatedField<T> is just List<T> except forces buffer reallocation. Could have been T[] too and just require the users to procure it themselves. Or a pooled type underneath.

For short-lived RepeatedFields this is non-issue - they die in Gen0 heap and GC just shrugs them off, but the cost is definitely there and is felt with all these new vector DB libraries passing 1536-long buffers of F32s.


Good point about Gen0, because Go's GC isn't even generational or compacting so you really do pay for all of these nested and repeated messages in full every time. At best the allocation is serviced from the thread-local cache, but the deallocation will almost always be done by a separate GC thread that now has to do a ton of fenced loads from main memory.

If the GC fully keeps up, it's all outside your critical path and you don't really notice it. But if it doesn't keep up, then the routines doing the allocation are tasked to assist the GC to catch up, adding up to a 10ms pause in the critical path of the routines actually serving requests. This is quite an exception to the claims that Go's GC is good for low-latency applications, and it's yet another reason to contort the entire schema to minimize heap allocations.


Interesting, TIL!

.NET's GC has a similarity in allocations being serviced from thread-local allocation context and only ever going into GC when such can't be serviced, when that happens, most workloads using SRV GC, would go into a short stop-the-world pause to collect Gen 0. While STW does sound scary to many, such pauses can be easily sub-millisecond in reality given sufficiently GC-friendly allocation patterns even under full allocation throughput saturation[0].

[0] Made short example that demonstrates that GC just frees up Gen0s as soon as they are full, which is very cheap, even if it has to be done very frequently: https://gist.github.com/neon-sunset/62115b5d9aa5027b22fa00f8...

(GC used here is the latest SRV GC + DATAS mode which is planned to become default later on, practically speaking, it has little impact under saturation and more interesting under moderate to light allocation rates as it solves the historical issue with SRV GC being quite happy to hoard memory pages from the kernel for a long time even if the actual heapsize was very small)


Yes, I wonder if anyone uses Protobuf encoded payloads over plain old HTTP REST calls.


I've used Twitch's Twirp before and it does that. It is a great middleground between gRPC and plain-HTTP services. :)

https://github.com/twitchtv/twirp


Yes! its amazing, it really should be the default people reach for. Vast majority of people dont need or want the complexity of grpc, they just want protos.


We do flatbuffers (super similar) over websockets/http rest. Works beautifully. gRPC is the culprit here.


Yeah we stream a bunch of telemetry and GPS/attitude data in protobufs over a websocket and it works beautifully.


The Connect RPC protocol is pretty much that: https://connectrpc.com/docs/protocol


The one really annoying decision in GRPC's design is tying it to HTTP/2 without an official choice. It optimizes for complicated high-throughput and bidirectional cases, while making it slightly harder to use for simple one-at-a-time "client server unary" RPC calls.


> plain old HTTP REST calls

Please don't add to the confusion around the term REST. These days most people just mean they use the GET/POST/PUT/DELETE verbs specifically, which is just using the HTTP protocol itself, no REST about it.

https://htmx.org/essays/how-did-rest-come-to-mean-the-opposi...


I have experience doing so with .NET simply because writing RPC / Contract definition in a .proto file once and then having gRPC tooling generate all boilerplate associated with it is much better than dealing with existing OpenAPI generators - this one slots right into .csproj where you just add a package reference and a reference to .proto file and you are ready to go.



yes! Twirp - last two companies ive worked at have used it to great success. Protos over plain ol http, without all the weird bespoke network stuff of gRPC. We just use regular dns, loadbalancers, etc etc. It should be way more popular IMHO.


I've done this in a lot of projects -- it works great. Protobuf is nice in a lot of ways and pretty simple, while gRPC is overkill (imo) for a simple web server that doesn't see tons of traffic.


We use it in a client-facing application to keep state of a complex configuration, primarily as a means of having a URL-safe way to encode that configuration. It works great, very happy with it.


Google’s AdX was entirely protobuf, but now they offer json too.


Performance and the promise of seamless type safety?


Yeah protobuf is a good IDL and encoding. Unfortunately gRPC makes some choices that make sense for internal RPCs in a large engineering org, but it's not good for external clients IMO.


We had a client choose protobufs / grpc which totally stalled the developers and created alot of problems and complexity. The client insisted for whatever reason and eventually ran out of money. Their unfinished code is sitting in some Github repository somewhere.

Run very fast from it, unless you have a VERY good reason to use it.


Could you explain more. I'm using protobuf (without grpc) in a project right now. Yes, getting your toolchain set up to compile and update everything when you make edits to your .proto files takes a little bit of work and you need to do some planning ahead when it comes to your data model since you're not quite as flexible as JSON, but on the whole it has been not much trouble at all.

And on the performance side, parsing our data takes ~1 second in JSON vs ~0.03 seconds with protobuf (in python).


Sure that happens if the people working on your project are incompetent, and then start blaming you.


I've never used grpc, but my experience with protobuf is that it's quite easy to write and integrate. easier than xml or json at least.


What was the source of the complexity? I haven't used protobuf for any production code, but the concept seems pretty straight forward.

I do see how it could be premature optimization, as JSON is even quicker to get up and running, and the overhead of bigger payloads and parsing costs isn't relevant until you've achieved some scale.


Its a great question-- our developers had never used it before and found it counterintuitive and hard to find good information on how to use. Projects are under time pressure so it's hard sometimes to have the mental space to properly grok a new framework, especially if its quite different.

I'm sure it works well for Google.

Also the support tools are lacking generally compared to say JSON or SQL or Python or any other technology.

Bugs were hard to diagnose-- I'm assuming if you were a grpc pro this would be ok.


> Bugs were hard to diagnose

What sort of bugs - bugs in code using GRPC, or in the GRPC client/server code itself?

At least in the languages I've used with it you can dump GPRC messages to json if you need to get aggressive with logging detail to find something.


> our developers had never used it before and found it counterintuitive and hard to find good information on how to use.

This heavy says your developers might be the problem…


I've used it for marshalling data and it's worked quite well for us. I did not use grpc, though.

Perhaps my use case was far more simple.


Protobuf is rarely the issue, it's always grpc. If you see someone complaining about protobuf it's almost always because they used grpc. You can see it in this thread plain to see. Anyone who used it on its own is happy with it.


Yeah, I've used protobufs in production for at least a decade but haven't ever used gRPC in anger. It's come up a few times over the years and each time I look at it I just shudder and think "nope, that's not a level of complexity this project needs"




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: