Hacker News new | past | comments | ask | show | jobs | submit login
Protobuffers Are Wrong (reasonablypolymorphic.com)
229 points by based2 on Dec 24, 2019 | hide | past | favorite | 210 comments

This was like 2 or 3 sentences worth of good points sprinkled in between a bunch of stuff that I'm pretty sure isn't an actual problem for anyone. If anything, the things pointed out are issues for people trying to write code that manipulates protos generically, which is not what most people spend their time writing and is probably exactly the wrong thing to optimize.

The main good point: Google's problems are probably not your problems, don't just blindly adopt Google tech for no reason.

Also: calling people amateurs without really substantiating is a huge smell IMO. The average Google engineer isn't a genius or particularly amazing by any stretch, but especially for something as core/foundational as protobuf, the answer is much more likely something like "these decisions made more sense for Google internally, especially when weighing against the cost of significantly re-architecting how proto works". The ad-hominem at the beginning reeks of someone who had an email chain that went like:

"You guys are doing proto wrong, don't you realize protos should obviously be like XYZ?"

"Well actually we'd like to do X but it would've been too hard, I'm not actually sure Y is a net-win, ..."

(omg what amateurs...)

> calling people amateurs without really substantiating is a huge smell IMO. The average Google engineer isn't a genius or particularly amazing by any stretch

Note that Jeff Dean and Sanjay Ghemawat -- the original creators of Protobuf -- aren't average Google engineers. They are literally the highest-ranked engineers at Google (Level 11, "Senior Fellow", a title assigned only to the two of them last I heard), and they basically invented MapReduce, BigTable, Spanner, and a variety of other foundational distributed systems technologies. Jeff now leads the AI division while Sanjay continues to focus on systems infrastructure.

So yeah, "amateurs".

(Disclosure: I wrote Protobuf v2, but it was just a fresh implementation of the same design.)

While I agree with alecbenzer's original point, I don't think that Dean and Sanjay created protobuf is a signal that they made the best decisions. Everyone has strengths and weaknesses even within the same job family.

I've encountered plenty of genius-level engineers when it comes to algorithms, architectures, and distributed system design that wrote GOD-AWFUL unreadable and unmaintainable code. I wouldn't trust them to design an API or a common framework optimized for usability.

In fact I almost wonder if intelligence is a hinderence in such cases. When you're too much like Cypher, and "don't even see the code", all code and all library choices feel equivalent.

Sure, but you didn't address any of the points in TFA.

Whether Dean and Ghemawat are high ranking or not, do the points stand? Is the design of protobuf solid?

It's possible to criticize one part of the article (calling people amateurs) without criticizing other parts of the article.


We've banned your accounts in this thread. Please don't create accounts to break HN's guidelines with; it eventually gets your main account banned as well.


Also, please don't create accounts for every few comments you post. We ban accounts that do that. This is in the site guidelines too.

HN is a community. Users needn't use their real name, but do need some identity for others to relate to. Otherwise we may as well have no usernames and no community, and that would be a different kind of forum. https://hn.algolia.com/?query=by:dang%20community%20identity...

You want HN to ge a community, start obeying the rules yourself.

HN is not a community, it’s a cargo cult where only certain thoughts are allowed and you ban people who think differently, no matter how well they present their argument or how polite they are.

Further, you engage in ad hominem and dishonest attacks, make calling and violation of every other rule in your rule book you capriciously enforce against everyone else.

Because of this, you have no standing to expect anyone to ever respect your rules.

This is why this site is considered a joke to the rest of the world and the tech community.

It’s the epitome of Bay Area smelling-ones-own-farts, and you’re oblivious.

You need to seperate publicity statements or interview answers from technical contributions. They are so incompatible as to be meaningless comparisons and are intended for totally different uses. I doubt he or any senior engineer at any company would make a statement like that in a design discussion or product review, but they'd all turn around wave hands at the future like that over beers with a journalist.

What is the point of your justification? That it's ok to spout nonsense during interviews? If so that's a weird position to take.

Alright, I'll condense: High level statements about vague possibilities are perfectly fine in some settings, but not in otheres. They sure smell like bullshit in a highly technical setting, or to people who always take the most technical angle.

You don't get to lead thousands of people by making publicly offensive statements, even accidentally. So yes, that does lead to more boring answers to newspapers.

Politicians don't swear in public, but they sure do in private. I'm sure a smaller conversation with Jeff would be fascinating.

> You don't get to lead thousands of people by making publicly offensive statements ...

That's not correct. There are (current) examples of people leading thousands (or more) people by making publicly offensive statements.

It's just the statements are polarising (on purpose), to create feelings of inclusion among the supporters + and blame [social problems?] on the people being pointed at.

If you're going to be an edgelord and call Jeff Dean an "idiot", have the guts not to do it on a throwaway account.

Why? So you can tar, feather and throw him into the fire for crimethink?

TIL filling time in an interview instead of being as concise as possible makes you an idiot. I thought it was just called 'being a good interview subject'.

Does that retroactively make protobuffers stupid too?

It doesn't but I don't get why saying someone is smart makes their work better when smart people obviously make stupid decisions and give stupid answers. I've used protobuffs, they're fine. I was riffing on Kenton's justification.

My only point was that it's absurd to call them "amateurs", and that the author damages his own credibility by doing so.

Because being smart is related to making less stupid decisions? Think about it in terms of probability.

Smarter -> less probability of making stupid decisions and giving stupid answers.

Dumber -> more probability of making stupid decisions and giving stupid answers.

That's like 100 words to say "Yes."

As a Googler who has seen gPRC and Protobufs continually rammed into places where they clearly don't belong (ahem, embedded systems), I actually have a lot of sympathy with this article. I wouldn't call the authors of protobufs amateurs, but I do think there's flaws there, many of them around the type system, as this article points out, but also a lot around the language APIs.

My biggest concern with protobuf though is that it ends becoming the proverbial "I have a hammer, now everything looks like a nail" scenario. Every Googler goes through orientation with protobufs and gRPC, and then they proceed to stamp it everywhere... including places it may not belong. And then take it with them when they leave Google.

I think the article tone is inflammatory but most of the points are solid.

> many of them around the type system, as this article points out, but also a lot around the language APIs.

I think there's a lot not to love about gRPC/protobufs (used them at Google and again now at a startup) but I don't feel like this did a good job highlighting those issues.

I wrote my own serialization library for embedded systems recently. It serializes data that can be defined and initialized natively in C. The intention is for data section only data, not heap data. So for example, you can have variable length arrays, but you must have a maximum length to match the preallocated C array declaration.

The serialized format mimicks JSON, except that it supports tables- arrays of structs. This saves space compared with actual JSON, since the column names only have to be given once. Also tables where the columns are primitive types can be loaded directly into common spreadsheet applications. This is useful for my specific application.

The application has a CLI that allows you to browse the data with xpath like expressions. You can set or get any field of the hierarchy (using the serialization format), and since the type is known it can parse and schema check the user input.

I use the serialized format for storing a copy of the data in flash memory, so that it can be restored on boot up.

The metadata with the type information is all marked as const. In embedded systems, this data is placed in flash memory instead of precious RAM.

Sounds cool, is it open source?

> sprinkled in between a bunch of stuff that I'm pretty sure isn't an actual problem for anyone

I'd push back at this. After starting to use protos (even at Google) these issues smack pretty much everybody right in the face.

Also worth calling out: protos have evolved a lot since they were originally created, and there are very few people that are actually aware of the deep, dark corners of protos (think extensions, JS, Android...). (Got a complaint about what Google's open sourced for protos? You don't even want to imagine some of the batshit crazy stuff we did with it before that...)

> extensions

As someone who has done extensive work with Protobuf extension fields in Python, I can only echo that the API is nightmarish.

Using the word "amateur" to mean "inexperienced" or "unskilled" is a smell, or, more precisely, an equivocation: It lumps not being paid to do something with doing that thing poorly, which definitely aren't the same thing.

>This was like 2 or 3 sentences worth of good points sprinkled in between a bunch of stuff that I'm pretty sure isn't an actual problem for anyone. If anything, the things pointed out are issues for people trying to write code that manipulates protos generically, which is not what most people spend their time writing and is probably exactly the wrong thing to optimize.

Handling a serialization scheme "generically" is the "wrong thing to optimize"?

Sounds like the #1 thing anybody would want from it...

Have you ever worked in a polyglot ecosystem with rapidly evolving schemas?

Tools like protobuf and thrift were designed to facilitate schema evolution since interfaces in these ecosystems evolve quickly and independently. Generics undermine this by creating strict dependencies on a few types, making it difficult to evolve a single type without breaking things.

Poorly implemented generics would undermine one of the design goals of this project. In addition, there aren't nearly as many opportunities for generics in an IDL as in a programming language, so what would the upside even be?

Obviously you need certain core pieces of infra that handle protos generically (like serializing and deserializing) but

a) total SLOC for that logic is much, much less than SLOC for code that works with _specific_ protocol buffers

b) "consumers" of protocol buffers as a tool/technology mostly don't worry about what's going on under the hood of the generic serialization, etc. code

So it will often make sense to to make that core generic logic even significantly more complicated if it means making the stuff that everyone has to write over and over again even a little bit easier.

The author lost me at “make all fields required” for some nice type system properties.

required was a mistake, in my opinion and in the proto 3 spec’s opinion. Capn’ proto has a nice write up here too With essentially the same points I would make but written better: https://capnproto.org/faq.html#how-do-i-make-a-field-require...

I think protos might just be being used for the wrong thing in the author’s example. You shouldn’t replace your application’s data structures with protos everywhere, in my experience protobufs are for when you want to serialize and you write a bunch of backwards compatible serialization code by hand. This code is hard to generate because it encapsulates all the changing requirements needed to work across different versions, so the lack of general type system tools doesn’t really offer opportunity to cut down on the schlep. If you don’t have these problems, don’t think you will have these problems, evaluate whether the tech is right for you. I’ve worked on projects before at Google that have made this mistake and threw away the nice data model expressable in a language to use proto interfaces where there was no need for serialization. I don’t think the solution is to have protos expanded to be comparable to that in every language.

Disclaimer: Googler who is forced to use a lot of protos, my opinions are my own and and I didn’t design or ever work on them directly. Probably also just an amateur :D

> I think protos might just be being used for the wrong thing in the author’s example. You shouldn’t replace your application’s data structures with protos everywhere

Yeah, I really scratched my head at that part. Even in my small line-of-business TypeScript apps, I often have separate interfaces for the "API models" (what I get as JSON from the API) and the "app models" (the models my app uses to represent the internal state of things).

In the big .NET platform I work on, we have multiple namespaces of Models and Dto's to represent different things at different interface boundaries.

Why on earth would you try to re-use a network-layer interface in levels several layers higher? That's crazy talk!

> Why on earth would you try to re-use a network-layer interface in levels several layers higher? That's crazy talk!

It's theoretically ugly but in practice it can be really convenient. Translating large structures between different formats is tedious and error-prone... sometimes it's just a lot faster and easier to leave it as a Protobuf.

Yes but proto 3 was also a mistake. Throwing away presence entirely was wrong and not preserving unknowns was also wrong. Proto 2 forever, imho.

I agree. Preserving unknowns fortunately was recognized as mistake, and fixed in some more recent version, and presence can be worked around: it is still preserved for message types, so you only need to wrap your primitives into objects[1], Java style.

[1] - https://developers.google.com/protocol-buffers/docs/referenc...

That's not "agree" with "not preserving unknowns was also wrong."

Sorry, I meant not preserving unknowns was also wrong. Proto2 did preserve them, then proto3 didn’t, and then they realized it was a bad idea and went back to preserving them.

This gets to the article's issue but from the dynamic typing angle as well.

By not allowing presence checks, one has to use convention _in every single class_ to determine basic things like PATCH semantics (https://github.com/protocolbuffers/protobuf/issues/359). This makes it impossible to treat protobuf as a general data format and requires object-specific logic to properly composite data structures. In some cases it's impossible to even do PATCH correctly without excluding sentinel values from the allowed range and having the application developer know about it.

There are so many other problems with Proto v3, but this one is glaring.

It can be approximated with a single-item `oneof` field. It's ugly and boilerplatey, but at least it's binary-compatible with proto2 and gives the original behaviour.

My main problem with proto2 these days is that I needed to interface with some C# code, and there is no proto2 library for C#!

Personally I even like this more since this feels much more explicit. The idea of depending on being set to the null value or unset at all gets my spider sense shivering and shuddering; someone is going to miss this, probably me.

Dude, there are tons of packages for doing Protobuf in C#, I've been writing systems in C# that used proto for over three years I think at this point. Here's one for starters: https://www.nuget.org/packages/protobuf-net

Worth calling out that Google literally compiled a list of outages caused by `required` fields before killing that field parameter.

This is the wrong argument. Who cares about the type system of a binary packaging format? The joy is how these messages can be used as rows in storage systems as well as RPC. Complicating the type system limits the domain applicability and increases the porting cost. No.

Protobuffers are shit coz they don't support zero copy and you have to deserialize the whole thing even if you are interested in one field or an outer envelope, causing memory churn in your JVMs. Cap'n'proto and flat buffers attack this real problem. The expressivity of the type system is a minor issue, hence no credible competition.

Note grpc abandoned required fields! Nothing is required over a decade, backward compatibility is important! Required should be enforced at application later not the binary packing layer. It is a property of the version of the code processing the blob, not the blob representation itself.

Ex googler and equally happy openAPI spec user.

> Who cares about the type system of a binary packaging format?

The people who have to write them and map them to actual domain data structures. Monomorphizing by hand and working around oneof+repeated's crap is an absolute joke. Tools in widespread use should do better.

This is going to sound sarcastic but it's not: Can we get back to just putting the members of C structures into network byte order and sending that over the wire in binary, à la 1995?

This is more or less what https://capnproto.org/ does.

I think this is also what Rust's bincode crate does. It's very sane and you can just open up files in a hex editor and see what's there.


From the capnproto docs:

>Isn’t this all horribly insecure?

>No no no! To be clear, we’re NOT just casting a buffer pointer to a struct pointer and calling it a day.

Isn't this a direct contradiction to your claim? Or have I misunderstood them?

IIRC: capnproto generates messages that you could deserialize by casting them to the right struct, but refrains from actually doing it that way. Instead it generates a bunch of accessor methods that parse the data, as if you were reading something that's not basically a c-struct, like a protobuff.

That's basically correct. Cap'n Proto generates classes with inline accessor methods that do roughly the same pointer arithmetic that the compiler would generate for struct access.

There's a couple subtle differences:

* The struct is allowed to be shorter than expected, in which case fields past the end are assumed to have their schema-defined default values. This is what allows you to add new fields over time while remaining forwards- and backwards-compatible.

* Pointers are in a non-native format. They are offset-based (rather than absolute) and contain some extra type information (such as the size of the target, needed for the previous point). Following a pointer requires validating it for security.

(Disclosure: I'm the author of Cap'n Proto.)

Re-read the comment I think. It doesn't say casting a struct pointer. It says putting the members of the struct into network byte order over the wire. I read that as individually serializing each member in a portable, safe way.

Anyway even if you do choose the struct pointer hack (which I do not see advocated here) it can be done relatively well albeit requiring language extensions and a bit of care. Pragmas and attributes to ensure zero padding and alignment between members. No pointer members. Checking sizes and offsets after a read (the hardest part).

"As of this writing, Cap’n Proto has not undergone a security review, therefore we suggest caution when handling messages from untrusted sources."

Something like that has to be rigorously tested or proven to be free of buffer overflows. It's so easy to attack with malformed messages. Parsers for remote messages are a classic source of vulnerabilities. It's hard to test this, because it's a code generator.

This looks promising as an attack vector for a big system built on microservices. If you can find an exploit in this that lets you overwrite memory, and can break into some service of a set of microservices by other means, you can leverage that into a break-in of other services that thought their input was a trusted source.

The "zero overhead" claim goes away as soon as you send variable length items. Then there has to be some marshaling.

> As of this writing, Cap’n Proto has not undergone a security review

This is outdated, I should remove it. Cap'n Proto has been reviewed by multiple security experts, though not in a strictly formal setting. I trust it enough to rely on it for security in my own projects, but yeah, I am cautious about making promises to others...

> Something like that has to be rigorously tested or proven to be free of buffer overflows.

I've done a bunch of fuzz testing with AFL and by hand. I've also employed static analysis via template metaprogramming to catch some bugs. See:


(That was... almost five years ago.)

> The "zero overhead" claim goes away as soon as you send variable length items. Then there has to be some marshaling.

Space for messages is allocated in large blocks. The contents of the message are allocated sequentially in that space and constructed in-place. So once built, the message is already composed of a small number of contiguous memory segments (usually, one segment), which can then be written out easily. Or, if you're mmaping a file, you can have the blocks point directly into the memory-mapped space and avoid copying at all -- hence, zero-copy.

So no, there is no marshaling.

Capn is better than C at struct layout. We are not, under any circumstances, going back to the 90s. We are moving forward and learning from mistakes.

I would like to submit apples archaic “Rez”[1] as a great language for declaring binary formats. It was designed to be able to describe c and pascal structures.

[1] http://preserve.mactech.com/articles/mactech/Vol.14/14.09/Re...

The wire encoding for protos is much more compact than the in-memory representation, especially for sparsely populated messages (very common especially in mature systems).

You'd still have to figure out some way to serialize nested messages. Note that you can have recursive message definitions.

> Can we get back to just putting the members of C structures into network byte order and sending that over the wire in binary, à la 1995?

I hope not (I know you are being sarcastic). We should use something that's trivial to implement correctly, as well as easy to read and to debug.

Is that less of a configuration mess than WCF was? JSON isn't "The Magical Elixir" of data exchange and I'm more than open to something better but at least we (in the .NET community) have moved past the WCF configuration nightmares.

WCF is an unmitigated dumpster fire. We have actually written a non-WCF client that uses a raw HttpClient implementation with StringBuilder to compose SOAP envelopes around cached XMLSerializers in order to talk to other WCF services. First request delay went from 1-2 seconds down to a few milliseconds. Memory overhead is negligible now. Prior, you could watch task manager and immediately recognize when WCF is "warming up". Additionally, the XML serializer in .NET seems almost pathologically determined to ruin everything you seek to accomplish.

By comparison, JSON contracts are an absolute joy to work with. We still practice strong-typing on both sides of the wire (we control both ends), and have pretty much nothing to complain about. If you are concerned with space overhead w/ JSON, simply passing it through gzip can get you down to a very reasonable place for 99% of use cases. I understand that there are arguments to be made against JSON for extremely performance sensitive applications, but I would counter-argue that these are extremely rare in practice.

Isn’t the problem with WCF rather than XML?

Protobufs have a pretty nice variable encoding integer wire format. This gives you the flexibility of saving space without doing compression.

While zero copy is nice, you cannot make it work when using compression.

That variable integer encoding is however slow for encoding/decoding. The space savings are of questionable worth.

They are the same size as UTF-8 numbers but much slower to decode. I think the more-bit format is the only glaring mistake in proto that can never be fixed.

This gets hairy the moment you want to add new fields.

Both protobuf and plain C structs are append only formats if you put the message-type and size at the start of the C-struct.

C structs do not compose extensively. Protobufs do. You can't put variable length data into a struct, and hence you can't put extensible structs into it either.

You can put variable length data into a struct:


Only one field can be variable length, and it must be last. I'll pass.

You definitely can but it's not as obvious, make a separate message type for list elements and append them on the wire. If you only have one list at the tail, you can use a flexible array[] at the end but it's finicky to deal with if you need more than one.

You can build large hierarchical structures of messages with lists contained therein. It's pretty much how .mov/.mp4/many, many media container formats work. The technique dates back to the Amiga days.

This is practically exactly what Protobuffers are. Except that they actually are defined clearly enough for multiple services written in multiple languages can work with them.

Definitely not, protobuf's strange wire format becomes apparent if you ever look at the hexdump of one or the profiler output of your favourite protobuffer-decoding C/C++ application.

They're actually kind of performance heavy for no benefit.

I have once looked at a benchmark that compared protobuffer, message pack, json and a variety of other serialization formats. In terms of reducing bytes per message gzipped json was ahead of all of them at the cost of increased CPU time for gzip. Protobuffer did pretty poorly, the only benefit was decreased CPU usage. I'm sure you could use some other compression algorithm like LZMA to get both good compression and good performance for JSON messages.

> In terms of reducing bytes per message gzipped json was ahead of all of them

Try gzipping the protobuf. Binary encoding and compression are different things which can be stacked. Gzipped protobuf should be smaller and faster than gzipped json in basically all cases.

I use LZ4 (with "best" compression) for packet captures and replay with great results.

I get about a 37% compression ratio with extremely fast decoding, like 10 million packets per second off an SSD.

It was better than snappy, gzip, and bz2 for the trade-off of compression time, decompression time and file size.

As for protobuf: flatbuffers, capn proto, HDF5, and plain C structs all deliver much, much faster decoding time. It's really not the best answer for any serialization at this point but it's still inexplicably popular.

I thought that's what capt'n proto was, not protobuffers.

> This is practically exactly what Protobuffers are.

Not really. Encoding/decoding protobufs is straightforward, but not nearly as simple as you’re suggesting.

Sure, but anything you're trying to transport between languages which don't even agree on endianess will end up like this.

Dumping a struct on a wire is just a wishful dream that turns into a nightmare as soon as you need to send that to a service written in another language or running on another architecture.

Don't get me wrong - there's plenty of insanity in protobufs. But trying to cover the same use-case will not create a simple protocol.

Cap'n proto didn't "end up like this" and works across languages.

Cap’n’proto isn’t well supported apart from C or Rust.

Python library is an absolute nightmare. Their tests used to catch Exception, and what they ended up testing was basically whether their test try to access nonexistant attributes.

The issue is that capnproto is relatively more complex, and as such is harder to implement well.

XDR and ONC/RPC for the win!

The memory layout of a C struct is ABI and compiler dependent.

Some compilers conform to same ABI in same system or similar system and work almost exactly the same, so you may grow old thinking that's how it is until it's too late. I think gcc, clang and Intel work almost the same in Linux and OSX.

Indeed, that's why I specified putting the members of the C structure on the wire, not the structure as a whole, so it's just basic types in network byte order (i.e. consistent endian-ness) being sent.

I've worked on an application where that was the standard data transfer scheme, and then while working with protobuf on another project felt that after looking under protobuf's covers it was doing something very similar but wrapping an entire API around it.

No, not really. #pragma pack and/or __attribute__((packed)) have been supported for eons now and guarantee the alignment of struct members between compilers.

In newer C++ specs, you can also static assert that the struct is a POD type to statically ensure that there's no accidental vtable pointer.

This argument pops up every time someone mentions this and every time it's completely uninformed.

Though it should be noted that packed structures cause compilers to produce absolutely garbage code when accessing them (because most of the accesses become unaligned) and it becomes incredibly memory-unsafe (as in "your program may crash or corrupt memory") to take pointers of fields inside the struct because they are (usually) presumed to be aligned by the compiler.

Explicit alignment doesn't suffer from this problem nearly as badly (yeah, you might have to add some padding but that's hardly the end of the world -- and if you have explicit padding fields you can reuse them in the future).


Why even put them in network byte order? Every modern system is little endian, if you standardize on that, only exotic systems would have to deserialize anything.

If you force the most common system to translate byte order, then you'll have some confidence that your code is performing the translation correctly. If instead you rely on hoping that everyone added the correct no-op translation calls everywhere, you'll find your code doesn't work as soon as you port it to another CPU.

This is a nice side effect of network byte order being the opposite of the dominant cpu order, though obviously it was never intended.

Because when someone builds a hugely popular exotic system in the future, because it is one (1) cent cheaper, you'd end up with code that has to check to see if it's running on such a system.

This doesn't make any sense for multiple reasons, but especially because you wouldn't be checking anything in the first place. A big endian system would would reorder bytes and a little endian system would just use it directly from memory without another copy or reordering anything.

There's not a library pattern for host to little endian, or little endian to host, like we have with hton and ntoh. Which makes it more likely to be messed up.

Ha - I just had this exact argument yesterday. Why indeed?

I maintain a couple of protobuf-based libraries, and the issue I've seen with its anemic type system is that it inevitably creates an impedance mismatch between itself and the host language's type system. To make the library usable, you end up having to wrap the autogenerated code in a bunch of boilerplate, which defeats one of the major selling points of gRPC/protobuf in the first place.

Exactly. I use Typescript with strict null checks. I would love to directly use the Proto objects I get from gRPC calls, but since the type of every string field is actually string|null, I have to do some validation and then turn it into an object with a regular string field (or else check that it’s not null every time I use it).

I get that forcing this validation is a good thing, especially for bigger/distributed teams. But I’m the guy who wrote the backend and I know it will return an error if it cannot set some value in that string field. Therefore I consider it “required”, and I resent the Protobuf authors’ insistence that I am wrong to ever use that concept.

That sounds like a quirk (I'd argue, a misfeature) of the specific Protobuf implementation you are using.

In most implementations, the getter methods for optional fields will return a default value if the field isn't set. If you want to explicitly check for presence, you call a separate "has" method.

There are several reasons for this design:

* Convenience of avoiding null checks when you know the field is always set.

* (Sometimes) Easy backwards-compatibility -- new fields can be declared with a default value that is appropriate when dealing with older senders.

* Security: It shouldn't be trivially easy for a malicious client to omit a field that the server is expecting will be there, causing the server to crash or throw an exception.

The C++ and Java implementations of Protobuf, at least, have always worked this way. It sounds like the TypeScript implementation you are using does not, unfortunately.

(Dislosure: I wrote the C++ and Java implementations of proto2.)

If you are looking for 0 copy alternative, look at https://google.github.io/flatbuffers/

Do you know if flatbuffers can be used as data structures in applications? That seems to be a shortcoming of protobuf, all the ser/de code makes it suboptimal for in-app data transport.

Zero-copy formats usually turn out to be worse for use as in-app data structures, because they have to carefully control memory allocation and layout in a way that makes it hard to mutate an already-constructed message.

E.g. in Cap'n Proto, all objects in a message are allocated sequentially within the larger message buffer. If you remove an object or change its size (e.g. overwrite a string field with a new value of different length), the new value needs to be allocated on the end and the memory space for the old value cannot be reused -- it is wasted.

I'm not super-familiar with FlatBuffers, but I believe it uses a model where messages must strictly be constructed in bottom-up order, such that all pointers point in the same direction. This seems to imply that you can't modify a message at all after construction, but I haven't actually played with it so I could be mistaken.

(I'm the author of Cap'n Proto.)

> Zero-copy formats usually turn out to be worse for use as in-app data structures, because they have to carefully control memory allocation and layout in a way that makes it hard to mutate an already-constructed message.

Oh, I never thought of it that way, but that makes perfect sense. I guess I just assumed messages would be collections of pointers and buffers living "somewhere" in memory, but of course the actual layout can make a ton of difference.

I guess there is an implicit rule that if you are dealing with inbound structures in a read-only fashion, passing around the serial structure is OK for when the field access cost is minor compared to the copy cost, but if you want to mutate it, or doing lots of access operations where that isn't trivial, it makes sense to copy into your own data structure.

> Protobuffers are shit coz they don't support zero copy and you have to deserialize the whole thing even if you are interested in one field or an outer envelope, causing memory churn in your JVMs. Cap'n'proto and flat buffers attack this real problem. The expressivity of the type system is a minor issue, hence no credible competition.

I'm not terribly familiar with protobuffers, but I'm a little surprised by this. Some ASN.1 encodings (BER, CER, DER), by contrast, use nested tag-length-value triads. This allows you to skip parts of the message that aren't interesting. (This is, by the way, not that uncommon.)

> Required should be enforced at application later not the binary packing layer. It is a property of the version of the code processing the blob, not the blob representation itself.

This might depend a bit. Suppose you wanted to make sure that messages could be read (if not exactly decoded) without a schema. The structure of the message could, in principle, include this information in the form of bit-field preambles that indicate the number of fields, extensibility, and so forth.

I don't suppose that's strictly necessary for most applications: embedding message structure in message content seems like a bit of an anti-pattern, but I bet you could come up with a use case that makes sense in some context.

> nested tag-length-value triads. This allows you to skip parts of the message that aren't interesting.

Protobuf does that too. But in order to seek horizontally through an array, you need to inspect the tag/length of each element in order to skip to the next element. So you can't rapidly seek through a massive array.

In order to allow such seeking, you either need statically-sized elements (which is often too restrictive), or you need pointers to data represented out-of-line. Cap'n Proto uses pointers. Using pointers is pretty uncommon among serialization formats, which is strange considering how ubiquitous they are for in-memory data structures.

(Disclosure: I'm the author of Cap'n Proto, and also Protobuf v2.)

"ASN.1/DER sucks -- everyone knows that. Let's build a new, better thing. How shall we encode things? Oh oh oh! I know! Let's have a type tag, a length, and a value encoding. Perfect! So so much better than ASN.1/DER!!!"

Those who don't study the past...

Look, ASN.1 is a crappy (but not awful!) schema language, but it supports many many encodings, and some of them are dumb, stupid, and bad, like BER, DER, and CER, and some are clever (PER), and some are awesome (OER), and it even supports things like XML (XER) and even JSON. So what's so bad about ASN.1? Not much, really, just that the first generation of encoding rules (BER/DER/CER) for it were.

But no, people don't look. They jump without looking, and then they re-create things, and do it badly.

The only thing since ASN.1 that doesn't suck is XDR. That's because XDR very much resembles ASN.1's PER/OER, but with 4-byte units, so XDR is very ergonomic. EDIT: I should also mention flatbuffers as not sucking.

> Look, ASN.1 is a crappy (but not awful!) schema language,

TBH I think it's awful. The type system is woefully overcomplicated, and the syntax is totally unrelated to any popular programming language, making it hard to learn.

The fact that there are so many encodings creates confusion for the average developer who frankly cares much more about the schema language accessibility, tooling, API, language support, and documentation than the actual encoding. Protobuf is way, way ahead on all of those compared to ASN.1.

> The only thing since ASN.1 that doesn't suck is XDR.

Sun XDR, from the 80's? Is that actually newer than ASN.1?

> > Look, ASN.1 is a crappy (but not awful!) schema language,

> TBH I think it's awful. The type system is woefully overcomplicated, and the syntax is totally unrelated to any popular programming language, making it hard to learn.

Is it? How? What is an alternative you like better?

> > The only thing since ASN.1 that doesn't suck is XDR.

> Sun XDR, from the 80's? Is that actually newer than ASN.1?

The first ASN.1 specs are from 1984 (I guess it goes back a bit further). XDR/NFS are from 1986-1987. ASN.1 probably did not inform XDR in the least. Other RPC technologies from that time probably did (thinking of Apollo and such). What's interesting is that even if XDR is completely uninformed by ASN.1, it's essentially a subset of ASN.1 with different syntax and a PER-like encoding with 4-octet unit size and alignment -- the similarities are striking! And even more interesting is that the ASN.1 crowd in 1984 felt that TLV encodings were easy and non-TLV encodings like PER difficult, but XDR shows that non-TLV is not very difficult at all.

The lesson of the ASN.1 experience is that TLV == bad, and PER-like == good, though flatbuffers is probably the best. And more than that, the real lesson is that open source tooling is essential. It took too many decades for ASN.1 to have decent open source tooling.

Another lesson of the ASN.1 experience is that non-free standards really suck for pervasive and essential technologies. It took way to long for the ITU-T to make the ASN.1 specs available for free downloads. Now, I do understand that the ASN.1 specs are extremely well-written -- it's clear that it cost quite a lot of money to produce them, and somehow that has to be paid for, for the IETF model is much more accessible, and that is much more important than the high quality of specifications that the ITU-T is able to produce (much better than IETF RFCs, IMO).

> Is it? How?

There's like a million built-in types and too many unnecessary options for specifying constraints.

> What is an alternative you like better?

I think Protobuf has a simpler, more practical type system and more accessible syntax compared to ASN.1.

Cap'n Proto is very close to Protobuf in terms of type system and syntax, but adds some polish on both. (But Cap'n Proto is my own design, so obviously I think it's the best.)

In terms of encoding, Cap'n Proto is, of course, completely different from Protobuf. I guess it is closer to PER and OER... but not particularly close.

Protobufs is a TLV encoding no better than DER. If it's just the syntax, then it's not much of an upgrade. Syntax is not a big deal, but semantics is, and there hasn't been much, if anything at all that's new semantics-wise since ASN.1.

(Of course, that the ASN.1 syntax is difficult to parse with LALR(1) parsers is a problem. But the syntax doesn't have to change much to be easier to parse.)

You don't have to know about and use all the built-in universal types in ASN.1 -- they're there if you need them.

Protobufs is just a history-repetition disaster.

IDK about Cap'n Proto, but I'm glad it's closer to PER/OER, if it is.

I'm saying both the syntax and semantics (i.e. type system design) of protobuf are much superior to ASN.1 (mostly, by being much simpler and easier to understand).

I believe syntax and type system are, in fact, much more important than the encoding details. To most application developers, there is no difference whatsoever between protobuf encoding, BER, DER, PER, OER, etc., because they never see the encoding. The library and tools handle that part. As long as the data gets through end-to-end with acceptable performance, nobody cares how it is represented in the interim.

Cap'n Proto's encoding is different in that by being zero-copy it actually enables new use cases, like mmap()ing a very large file for random access. Still, I'd certainly choose protobuf's syntax and encoding over ASN.1's syntax paired with Cap'n Proto's encoding.

You can just skip messages, strings, and byte arrays you don’t need or can’t decode. They are length-prefixed. Also there’s nothing about proto that prevents it from being aliased to its network buffers (zero copy). The GPs complaint stems from their own ignorance, nothing to do with the nature of protos.


Ok some of the things I said couldn't be done, can be done, but it's situational

> these messages can be used as rows in storage systems

Storage needs a strong schema language, much more so than RPC does, because every mistake becomes permanent. A schema should only allow messages that make sense within the problem domain, and Maguire is right that protobuf is not good at this.

I agree with a lot of this post, although the tone isn’t great. The problems we ran into with protobufs at my job include:

1. The schema evolution claims don’t really hold water for our systems.

2. The type system isn’t very expressive (e.g. no generics means you have to write the same error wrapper for all your endpoints) and lots of our devs found it unintuitive, especially oneofs.

3. The “default value”/nullable field feature turns out to be a recipe for postmortems and data quality degradation. Making everything nullable isn’t good.

4. The python library doesn’t have mypy typing and the generated objects aren’t... super pythonic.

I (along with some colleagues) built a library to paper over protobuf and address these issues. Notably, it includes a very well-specified algorithm to automatically assign version numbers to schemas during development, as well as provide operational instructions to avoid bumping a version without causing downtime if possible. And all the codegenned models have mypy types!

You can read more about it here:


It’s so far turned out really really well for us.

In particular, “schema evolution” is a property of a particular distributed system and there aren’t universally safe rules; schemas for historical machine learning datasets and rpc services, say, have to evolve differently cos the data flow is different. Also, there’s no version bumping algorithm built in, and nullable/optional fields are a pain to program against for data scientists and client devs alike.

You can generate stubs for mypy with this tool but yes this should be something supported out of the box.


re: (3) - nullability is more or less required for backwards compatibility. If you have existing data and add a new field going forward, your options are to make the old data invalid until you backfill, or give your code a way to detect "this field doesn't exist" and deal with it accordingly.

I opted to go for “pinning” based on the version number, so if you make a breaking change, like adding a required field, IDOL copies your schema into a v2 (say) namespace and then applies the change, leaving v1 untouched.

At this point we just have separate types for separate versions and tools in the host language can help you deal with that.

This turns out to be much better for data quality and client code than adding lots of nullable fields, at the cost of making breaking changes to APIs a bit more work. It seems to have been worth it so far.

Going forward, the service author has to support the “old” versions until we can determine that there’s no old data sitting around (so all clients are on the new version, all serialized data has been backfilled or dropped, or whatever’s appropriate), at which point they can delete the old schema. And we have some simple tools to verify this, since we stick the version number onto the models / serialized data.

Couldn’t find IDOL on Affirm’s github. Is any of it open source? I’d love to take a look if so.

It isn’t... yet!

it would be a bit of work to open source it (rip out any lingering affirm bits, spruce it up some) and my team is real pressed for time at the moment.

but I definitely want to do it sooner rather than later!

A good research paper would first explain what the protobuffer design goals are before explaining why they are misguided, inapplicable, or aren't achieved. But I guess this is just a blog post.

As it is, it's unclear whether the author of the blog post even understood the reasons behind protobuffer design decisions.

100%. But it's easier to just say "it was designed by amateurs" than actually explore why design decisions were made in a particular way, so shrug.

I finally feel safe to suggest that I think the cargo-culting of gRPC on to projects these days is also wrong. One of the best (and to be fair, worst) parts about http is it's flexibility, and it's like people just completely skipped over `Content-Type` and other simple options.

Throwing out standards-compliant HTTP (whether 1,2 or 3) with the bathwater that is JSON decoding was a mistake. JSON + jsonschema + swagger/hyperschema should be good enough for most projects, and for those where it isn't good enough, swap out the content type (but keep the right annotations) and call it a day! Use Avro, use capnproto, use whatever without tying yourself into the grpc ecosystem.

Maybe gRPC's biggest contribution is the more polished and coherent tooling -- in combining three solutions (schema enforcement, binary representation and convenient client/server code generation), they've created something that's just easier to use. I personally would have preferred this effort to go towards tools that work on standards-compliant HTTP1/2/3.

I'm not necessarily saying gRPC is the solution to everything, but I don't see why HTTP is so great? It's a protocol for transferring, primarily text over networks. Most backend systems operates in binary, so serializing binary data into a text format seems to be unnecessary overhead.

One pro of HTTP is that the methods are barebones and error codes standardized, while there are plenty of battle tested front ends for your tx/rx endpoints that might touch the service. Basically works everywhere.

The con is that you can do that with the protocol of your choice directly and you don't need to bolt HTTP to whatever you're building.

gRPC works over HTTP.

That said, the http body and response are perfectly fine being binary. It's only the headers that are text based (in http 1. Http 2 turns those headers into binary as well.)

It is great because it has quality implementations in every language. Much like protobuffs.

HTTP also has a vast range of proxies, transport encodings, cryptographic layers, solutions for client/server clock skew, tracing and a whole bunch of other things like rerouting and aliasing baked in.

The processor usage of serialization is almost never the bottleneck, usually it's bandwidth. Despite that, unless you're sending floats or large integers over the wire, the difference probably isn't usually worth the engineering investment over gzipped json until you're "web-scale".

>It's a protocol for transferring, primarily text over networks.

Who told you that? You can specify an arbitrary content type. Not just text.

Whilst I tend to agree, the fact that a gRPC service is very unlikely to be designed to be ‘RESTful’ to the point of obtuseness is a huge plus. It might not be the best tool for the job but it’s a lot better than the other most cargo-culted option.

The biggest problem with HTTP is the way developers tie themselves into knots with their HTTP clients. I've seen a lot of bad decisions, including nonsensical timeout and retry logic, nonstandard use of headers, bodies on GET requests, query strings over a megabyte in size, and performance bottlenecks caused by manual management of HTTP connections and threads.

The biggest advantage of an RPC is that it takes most of that out of the hands of the developers. Developers can just focus on business logic and leave the connection and request management to the standard library.

gRPC is literally just calling conventions with HTTP2/3

For people who prefer JSON to protobuf, gRPC is serialization-agnostic. For folks who prefer REST verbs to gRPC methods, proto3 has native support for encoding REST mappings and tools like envoy and grpc web can do the REST <-> gRPC proxy translation automatically

I'm a bit confused about the type system rant - and someone correct me if I'm wrong: The whole point of protobuffs is that they're easily usable in multiple programming languages, so it seems to me that they kinda have to end up being the smallest common subset of typing features. If you try to do it strongly, they'll be hard to use in some languages (e.g. Java, the favorite beating horse of the OP and other language purists) or they'd have to restrict the amount of programming language targets.

Where am I wrong?

>The whole point of protobuffs is that they're easily usable in multiple programming languages,

While that's true in theory there are issues in practice. It's especially true if you've had the misfortune of working with the Python or PHP compilers. Documentation isn't particularly great and I do recall a time when the Python compiler was generating code that was broken and required manual tweaks. Again in the case of Python things go even further downhill if you're trying to get everything working in a Docker container.

Things are of course significantly better if you're working in Go or Java.

Python3 relative import syntax was flat out broken for years [0][1]. It's workable finally (not sure which PR fixed it) but it's still a monster to try to get protoc/grpc plugin to emit _pb2.py files which a) have correct import syntax b) have a top-level package name c) are readily packageable for pip install d) do all the above in a reproducible and uninstallable manner e) also be able to import 3rd party protos f) all the above without any post-processing.

Like, yes python constrains package names to the folder names. But why not check a flag so I can let the emitted python structure dominate the folder structure? Or, easier, just let me ignore the folder structure and specify a dang top-level-package in the emitted _pb2.py and let me wrangle it with setup.py?

[0] - https://github.com/protocolbuffers/protobuf/issues/90

[1] - https://github.com/protocolbuffers/protobuf/issues/1491

I had the displeasure of working with gRPC and Python as an intern. I ended up writing a makefile that would generate the files and then immediately run sed on them to fix their imports. It felt like a terrible hack and I hated having to do it.

What’s worse is my overall task was to come up with ways of doing this type of thing reproducibly in a bunch of different languages that had poor gRPC support so that the team could distribute consistent (and verifiably working) API bindings to other teams. At that point it felt like we probably should have conceded that gRPC sucks and not used it at all. I’m 99% sure it was just resume driven development by the lead dev.

Yep, that make + sed combination is exactly what I do every time I use protobuf in Python.

You could have the type model mentioned in this post in nearly all languages as well.

Yes, you could and that would certanly improve things (I'm not a fan of these restrictions either).

But you'd still have a rather Java-ish type system right?

Don't forget also the Protobuf C++ compiler's failure to properly namespace user-level identifiers vs. library-level identifiers.

For example, if your Protobuf has both a "foo" and "has_foo" field (which is perfectly legal by the Protobuf language definition! and works fine with e.g. the Python binding!), you will get a C++ compiler error due to a "has_foo()" method being generated on behalf of both "foo" and "has_foo".

This naming clash could have been avoided simply by prepending all generated method names with a defined prefix, but the implementors either didn't recognize this issue, or chose not to do anything about it.

(Everything else in the article rings true for me. I've been hoping years for someone to write this article.)

Yes, we were very much aware of this, and chose not to do anything about it.

This problem almost never manifests in practice. The issue is raised all the time, but it's basically always observed only as a theoretical problem (by someone who invariably thinks they are sooooo smart for discovering it), not as a real problem preventing compilation of a real schema.

Prepending all generated method names with a prefix would be a rather extreme solution that no one would like. Have you ever tried to read libstdc++'s STL implementation, where absolutely everything is prefixed with __? It's really quite awful. I wouldn't want to use a serialization framework that did that.

The right solution, in my opinion, is to provide annotations that allow the developer to rename a particular field for the purpose of a particular target language, so that e.g. you can say that "has_foo" should be renamed to "has_foo_" (or whatever) in C++ generated code. Yeah, it's an ugly hack, but it gets the job done.

I can't remember if this ever got implemented in Protobuf, because, again, it's almost never actually needed. Cap'n Proto does have such annotations, though.

(Disclosure: I'm the author of Protobuf v2 and Cap'n Proto.)

> (by someone who invariably thinks they are sooooo smart for discovering it), not as a real problem preventing compilation of a real schema

That's a pretty dismissive view of your users. This has actually bitten me in practice, so consider their foresight vindicated.

(Notably, it was actually the inability to easily distinguish between a missing and empty array, which caused us to resort to using "has_foo" fields, only later to hit the issue with the C++ compiler.)

If you dismiss this as a valid concern, how can I be confident that there are not other similar issues you simply dismissed as unimportant?

Say what you will about STL, but the level of attention to detail there assures me that I'm not likely to get bitten by some weird issue the developers chose to turn a blind eye to.

> you can say that "has_foo" should be renamed to "has_foo_" (or whatever) in C++ generated code.

This is fine, even if it's a transformation predetermined by the language.

Sorry for the snark. This issue is a sore spot for me because so many people have reported it without having actually been affected by it, and because they tend to assume the designers were stupidly unaware of the issue, rather than that the issue is actually rather hard to solve in a satisfying way.

However, if you actually were affected by it, then you are right to be annoyed by it.

The particular case where someone developed a protocol mostly in one language and then later on started targeting a new language is indeed a case that I do worry about. The idea of language-specific annotations defining language-specific renames was designed for that use case.

I haven't worked on Protobuf in almost a decade, but Cap'n Proto does address this issue as I said -- without making everyone's code horribly ugly.

> This naming clash could have been avoided simply by prepending all generated method names with a defined prefix, but the implementors either didn't recognize this issue, or chose not to do anything about it.

I'd much rather live with not being able to have fields name "has_foo" in my protobuf than to have to prepend a prefix to every single access method.

While arbitrary, that restriction would be fine if it were part of the language definition, and enforced by the Protobuf compiler.

(I say "arbitrary" because "has_" is just the nomenclature the C++ bindings happen to use. The syntactic peculiarities of other host languages may dictate a different prefix. Which then forces the question of whether to amend the list of "prohibited" field names, and potentially break existing code.)

Regardless, the restriction you propose is not currently (to my knowledge) part of the language, so you can get into the situation where you've been developing with Protobufs in Python for years, and then decide to add some C++ code, and everything breaks because now you have naming clashes which force you to rename the field and go through and edit all the existing use sites of said field.

"The solution is as follows:

Make all fields in a message required. This makes messages product types."

Except it also breaks backwards compatibility, one of the most powerful and sought-after features of protobufs.

> Except it also breaks backwards compatibility, one of the most powerful and sought-after features of protobufs.

It doesn't have to. Just add row types to handle unknown content, ie. if an intermediary knows only of fields foo and bar, then they can process any data with such fields if given a type like "type SomeRecord = { foo : int, bar : string | r }", where 'r' represents the remainder of the record.

The article's criticisms are valid and there are typed solutions to most of the objections that have been raised against it.

I'm not sure that's simple enough to be a "just", but in any case the primary problem is the other direction. If I add `required baz: int` to my service's definition of a protobuf, all protobufs that have ever been generated before become invalid because they don't contain a value for baz.

That fact doesn't change if you eschew types. Backward-compatible schema evolution has rules.

Right, that's the point. The article's suggestion to "make all fields in a message required" fundamentally misunderstands the issues at hand, because no matter how appealing it is from a type theory perspective, following that suggestion would make it impossible to ever add a field in a backwards compatible manner.

> The article's suggestion to "make all fields in a message required" fundamentally misunderstands the issues at hand, because no matter how appealing it is from a type theory perspective, following that suggestion would make it impossible to ever add a field in a backwards compatible manner.

You absolutely could in multiple ways:

1. You make every accepted product type have a row type at your service interface if you expect schema evolution.

2. If you have to add a field unexpectedly, ie. where you did not have a row type, then you must deprecate the old API. If this seems onerous to you, then your service infrastructure is probably insufficiently flexible.

Option 1 seems like it defeats the point. If you're going to declare a field with a more permissive type than currently allowed, aren't you just hacking weak types back into your strong type system?

Option 2... look. I've seen a lot of API deprecations, across multiple teams in multiple companies, and every one of them was very onerous in ways that had little to do with the service infrastructure. If you've done easy API deprecations, more power to you, but I don't think your experience is representative.

Protocol buffers already do that; serialized fields that are not recognized by an older message definition are parsed and can be accessed via the "unknown fields" API, exactly as "r" above. Intermediaries can pass these through trivially, or inspect them to see what they didn't understand.

The problem with making fields required is that older serialized protocol buffers parsed by newer message definitions may be missing newly added required fields, which will break things.

Protobuf does not do this via a typed interface, but via runtime checking.

You can't statically typecheck deserialized data. You must validate that deserislized value matches the schema, and you can only do so at runtime.

In other words, proto has a typed interface, but you must runtime check that a given bag of bytes conforms to that typed interface.

This is true for any io.

> You must validate that deserislized value matches the schema, and you can only do so at runtime

I assume you mean serialised data, not deserialized. And yes, deserializing includes type checking. The point is that this happens once and the need for a separate API for dynamic data shouldn't be needed.

What do you mean by a separate api for dynamic data?

The data under discussion isn't "dynamic", it's still static, it just isn't known to the schema in question at runtime (since it's only known to a different schema). That means you can't access it by name, since the field names aren't known.

The lesson is: when you start wrong, you stay wrong.

Unfortunately I was turned off by the angry and obnoxious tone. Seems to be getting more common to get traction on HN homepage. But yeah, even though author makes some good points, the argument loses effectiveness in my book because of things like calling people amateurs.

The angry, pissed-off coder rant is occasionally pulled off well, but in general it grew tiresome for me fifteen years ago. Not everyone is Hunter S Thompson (well, no one is, now), and not every technical annoyance is the Kentucky Derby and thus worthy of such treatment.

To this day, I’ll still forgive a well-crafted MongoDB rant, though.

Previous discussion, leading with a comment from one of the protobuf authors: https://news.ycombinator.com/item?id=18188519

The way dweis never responded to batmansmk was disappointing to say the least.

kentonv was willing to engage on the points and came out much more reasonable in the whole thing.

dweis isn't a designer, so he'd be the wrong person to answer those things.

Sanjay, Jeff, and Kenton are probably the three best to answer such questions.

Presumably the top few concerns for protos are wire performance (decode/encode speed and cost, wire size), compatibility for changes (what this suggestion just totally breaks), and cross language usability.

Some other tradeoffs might be non-wire perf (I believe protos beat flatbuffers here, at the cost of worse on wire perf), but it's not clear that that was intentional.

I guess I'll copy/paste the comment I made last time this was posted: https://news.ycombinator.com/item?id=18190005


Hello. I didn't invent Protocol Buffers, but I did write version 2 and was responsible for open sourcing it. I believe I am the author of the "manifesto" entitled "required considered harmful" mentioned in the footnote. Note that I mostly haven't touched Protobufs since I left Google in early 2013, but I have created Cap'n Proto since then, which I imagine this guy would criticize in similar ways.

This article appears to be written by a programming language design theorist who, unfortunately, does not understand (or, perhaps, does not value) practical software engineering. Type theory is a lot of fun to think about, but being simple and elegant from a type theory perspective does not necessarily translate to real value in real systems. Protobuf has undoubtedly, empirically proven its real value in real systems, despite its admittedly large number of warts.

The main thing that the author of this article does not seem to understand -- and, indeed, many PL theorists seem to miss -- is that the main challenge in real-world software engineering is not writing code but changing code once it is written and deployed. In general, type systems can be both helpful and harmful when it comes to changing code -- type systems are invaluable for detecting problems introduced by a change, but an overly-rigid type system can be a hindrance if it means common types of changes are difficult to make.

This is especially true when it comes to protocols, because in a distributed system, you cannot update both sides of a protocol simultaneously. I have found that type theorists tend to promote "version negotiation" schemes where the two sides agree on one rigid protocol to follow, but this is extremely painful in practice: you end up needing to maintain parallel code paths, leading to ugly and hard-to-test code. Inevitably, developers are pushed towards hacks in order to avoid protocol changes, which makes things worse.

I don't have time to address all the author's points, so let me choose a few that I think are representative of the misunderstanding.

> Make all fields in a message required. This makes messages product types.

> Promote oneof fields to instead be standalone data types. These are coproduct types.

This seems to miss the point of optional fields. Optional fields are not primarily about nullability but about compatibility. Protobuf's single most important feature is the ability to add new fields over time while maintaining compatibility. This has proven -- in real practice, not in theory -- to be an extremely powerful way to allow protocol evolution. It allows developers to build new features with minimal work.

Real-world practice has also shown that quite often, fields that originally seemed to be "required" turn out to be optional over time, hence the "required considered harmful" manifesto. In practice, you want to declare all fields optional to give yourself maximum flexibility for change.

The author dismisses this later on:

> What protobuffers are is permissive. They manage to not shit the bed when receiving messages from the past or from the future because they make absolutely no promises about what your data will look like. Everything is optional! But if you need it anyway, protobuffers will happily cook up and serve you something that typechecks, regardless of whether or not it's meaningful.

In real world practice, the permissiveness of Protocol Buffers has proven to be a powerful way to allow for protocols to change over time.

Maybe there's an amazing type system idea out there that would be even better, but I don't know what it is. Certainly the usual proposals I see seem like steps backwards. I'd love to be proven wrong, but not on the basis of perceived elegance and simplicity, but rather in real-world use.

> oneof fields can't be repeated.

(background: A "oneof" is essentially a tagged union -- a "sum type" for type theorists. A "repeated field" is an array.)

Two things:

1. It's that way because the "oneof" pattern long-predates the "oneof" language construct. A "oneof" is actually syntax sugar for a bunch of "optional" fields where exactly one is expected to be filled in. Lots of protocols used this pattern before I added "oneof" to the language, and I wanted those protocols to be able to upgrade to the new construct without breaking compatibility.

You might argue that this is a side-effect of a system evolving over time rather than being designed, and you'd be right. However, there is no such thing as a successful system which was designed perfectly upfront. All successful systems become successful by evolving, and thus you will always see this kind of wart in anything that works well. You should want a system that thinks about its existing users when creating new features, because once you adopt it, you'll be an existing user.

2. You actually do not want a oneof field to be repeated!

Here's the problem: Say you have your repeated "oneof" representing an array of values where each value can be one of 10 different types. For a concrete example, let's say you're writing a parser and they represent tokens (number, identifier, string, operator, etc.).

Now, at some point later on, you realize there's some additional piece of data you want to attach to every element. In our example, it could be that you now want to record the original source location (line and column number) where the token appeared.

How do you make this change without breaking compatibility? Now you wish that you had defined your array as an array of messages, each containing a oneof, so that you could add a new field to that message. But because you didn't, you're probably stuck creating a parallel array to store your new field. That sucks.

In every single case where you might want a repeated oneof, you always want to wrap it in a message (product type), and then repeat that. That's exactly what you can do with the existing design.

The author's complaints about several other features have similar stories.

> One possible argument here is that protobuffers will hold onto any information present in a message that they don't understand. In principle this means that it's nondestructive to route a message through an intermediary that doesn't understand this version of its schema. Surely that's a win, isn't it?

> Granted, on paper it's a cool feature. But I've never once seen an application that will actually preserve that property.

OK, well, I've worked on lots of systems -- across three different companies -- where this feature is essential.

> But I've never once seen an application that will actually preserve that property.

I wonder if author uses Chrome, which depends heavily on this in its Sync feature.

Yeah, most big Google services -- including Search -- rely pretty heavily on unknown field retention. Google has been building large services out of microservices since a decade before anyone ever said the word "microservice". When one service is updated to emit a new field, and another service is updated to consume it, it's important that the feature can then work, without updating all the middlemen.

When I worked on Chrome Sync, we spent some time making sure that unknown fields were preserved properly. Glad to see that someone noticed, cheers!

I did notice that when I was an owner of protobuf in Chromium :) Custom patches to support unknown field preservation in lite mode sure brought me some hassle when updating to version 3 of the library.

"Make all fields in a message required" would defeat one of the main benefits of protobufs: The ability to retroactively add/remove fields while still keeping the message compatible with implementations using the previous version of the proto definition.

The other issues (e.g. that you cannot make a repeated oneof) are annoying, but many of them are consequences of upgrading the "language" (if you want to call it that) without introducing incompatibilities and/or changing the wire format. Having a new, incompatible version would likely be a lot more annoying. Simply not having these features at all and having to write your own ugly hack as a workaround would definitely be a lot more annoying.

I'd be interested to hear their thoughts on capnproto.

I would expect he has the same issues with Cap'n Proto. Aside from some aesthetic cleanups, Cap'n Proto's type system is extremely similar to Protobuf -- because, frankly, Protobuf got that part right. Cap'n Proto's main difference from Protobuf is the encoding, which it doesn't seem like this guy cares too much about.

(I'm the author of Cap'n Proto, and Protobuf v2, though I did not design Protobuf's type system.)

The author isn't wrong about protobuf's shortcomings, but to say:

and solve a problem that nobody but Google really has

Is pretty absurd. There are plenty of projects that serialize a LOT of data between different runtimes/platforms (e.g. Go and Java) such that built-in serialization is not possible and JSON/XML is 3-10 times slower.

> The dynamic typing guys complain about it being too stifling, while the static typing guys like me complain about it being too stifling without giving you any of the things you actually want in a type-system. Lose lose.

Type system purists are blinded by their commitment to purity. All context is thrown out the window — it’s purism or bust.

The absurdity here is profound; it’s “Lose Lose” unless you go all typing or none.

And yet I completely understand the lament here. I think what the (smarter) type purists realize is that if they lose the purism position, static types do become much less of a tyrant tool and more like any other tool in our toolkit: a nominally useful one to be applied judiciously.

Then they’d have to turn their attention to the unforgivingly dynamic outside world and market.

> Fields with scalar types are always present. Even if you don’t set them. Did I mention that (at least in proto31) all protobuffers can be zero-initialized with absolutely no data in them? Scalar fields get false-y values—uint32 is initialized to 0 for example, and string is initialized as "".

> It’s impossible to differentiate a field that was missing in a protobuffer from one that was assigned to the default value. Presumably this decision is in place in order to allow for an optimization of not needing to send default scalar values over the wire.

I believe there’s a trick you can do if you mark it as a “oneof” with only one field.

> It’s impossible to differentiate a field that was missing in a protobuffer from one that was assigned to the default value. Presumably this decision is in place in order to allow for an optimization of not needing to send default scalar values over the wire.

Isn't this just flat incorrect? You can tell the difference between set-to-default and not-set with buffer.has_some_field().

Not for things like ints and strings.

It depends on which version you're using.

In proto1 and proto2, every field had a "has" method and "explicitly set to default value" was different from "absent".

In proto3, they tried to remove this feature, and instead said that for basic types, "set to default" and "absent" are the same thing.

(I wrote proto2. I left Google before proto3 came about.)

Thanks. I must be only used to proto2.

Its 2019 and protobuf js compiler still only support commonjs modules and google-developed closure imports. No AMD/UMD and no ES6 modules.

How are we supposed to use it in a browser environment if we are not using browserify or webpack?

The sad thing is that, rather than forward this to the small "decision team" at work, where we can ponder the merits of the author's points...

... I'm going to just close my browser tab due to the puerile ranting at the beginning (and sprinkled throughout). A few good points, and perhaps a great basis for "proto4" or whatnot, but that my "OMG they're so dumb" ranting?

If that was a peer-reviewed paper, I'd have rejected it after reading the first paragraph, if I even made it that far. That's just not how you make a technical argument or win people over.

One important thing missing from the current criticism is Protobuf’s lack of a facility for serializing a sequence of messages to a file. There’s RecordIO internally at Google, yet they markedly declined to open-source the C++ lib for it. There’s hints of it in Protobuf Java and then Amazon has open-sourced their own implementation of it with the same name.

Lack of public RecordIO is partially to blame for creation of TFRecords, which are in many ways inferior to (for example) tar archives of string-serialized protobuf messages. (tar supports index, streaming, compression, etc).

I requested that the RecordIO format (bytes on the disk and codfe implementation) be opensourced (for ease of interoperability between Google datasets and open source/scientific work). It wasn't because there were some 'flaws' in the design, but it was pointed out that leveldb open sourced a format very similar to it (but which never got used outside of leveldb).

I only have experience with flatbuffers in C++ (it seemed easier to be integrated in a project back then). Can anyone comment on the pros and cons of flatbuffers vs protobuffers?

I worked on trying to make flatbuffers work at google and it just never was as fast as proto2/c++. I guess the author of this piece would describe me as an amateur because like the authors of protocol buffers I only have about thirty years of industry experience. AMA.

I'd be really interested in hearing why it wasn't faster! I expect the answer is along the lines of: "Well theoretically the zero-copy design should be faster, but in practice factors X and Y dominate performance and Protobuf wins on those." I'd love to know exactly what X and Y are...

(I'm the original author of proto2/c++, but I'm mostly interested for any lessons that might apply to my work on Cap'n Proto...)

The C++ proto implementation is just already tuned to an absurd degree and it is hard to beat. Any place where copying was an important problem has already been eradicated with aliasing (ctype declarations) so flatbuffers' supposed advantage isn't there to begin with. It's much more important to eliminate branches, stores, and other costs in generated code.

I'm guessing you were trying to use it with Stubby?

Admittedly the networked-RPC use case is not a particularly compelling one for zero-copy (the mmaped-file case is much more so, and maybe even shared-memory RPC).

Still, I'd expect that not having to parse varints nor allocate memory would count for something. Wish I could see the test setup.

"but unfortunately, literally nobody considers Java to have a well-designed type-system". What? That is mildly put a lie.

Indeed. I imagine some people do think Java has a well-designed type system. However, you probably don't consider those people to be authorities on the subject.

What would be an appropriate replacement for embedded systems? I've looked at the "tiny" versions of protobuf (nanopb, etc), but haven't tried them yet.

Are protobuf competitors (flatbuffers, capnproto) appropriate for small embedded systems (microntrollers, mostly <64K RAM).

I think an implementation of Cap'n Proto that's actually optimized for embedded systems would likely be smaller than any implementation of Protobuf could be. However, I'd have to admit that the current Cap'n Proto C++ library is not so optimized.

Here's a GitHub comment where I outlined what we'd need to do to fix that, FWIW: https://github.com/capnproto/capnproto/issues/844#issuecomme...

Caveat: I don't have any personal experience with embedded systems.

My main problem with protobuf isn't the actual serialization or the proto files. It's the use case. They actually pitted this up against REST. REST is slowly going out of favor, so of course it makes sense to start gap filling. But when we look at two major competing technologies: GraphQL or Protobuf to fill that "I don't want to use REST anymore feeling" GraphQL actually solved something useful and pushed the notch forward. Protobuf really just said, hmmm, let's put TONS of constraints down on top of REST to make it more reliable and faster. Basically Swagger 4.0 maybe?

I keep seeing people saying things like well, protobuf can be used to make your GraphQL faster, etc... So you're actually trying to argue for "some" usefulness for protobuf for someone that made it to the next level. That might last, what, 1 month? The only thing that should be responsible for adding a binary encapsulation format would be something built into Http specs, not some kind of custom Rest->graphql->protobuf stack.

/rant over

It is either Protobuf or Protocol Buffers, not Protobuffers. This kinda upsets my OCD.

It's an interesting article. I was hoping for some alternative suggestions, because proto is "just good enough" at structure and wire to become the one tool a project will reach for so it doesn't need two tools.

I've been working on a project which requires writing ~40 different packet types in a custom protocol, but always thought something like protobuf would be a great fit for standardizing the packet serialization routines.

You could look at ASN.1, which was pretty much created for that purpose.

No, protobuffers are pretty good exactly because they don't try to solve everything. Their lack of expressiveness is actually a very good thing when designing communication between processes. Narrow is good.

Why doesn't he fork the project and crank out a few patches?

When you make a rant like this and don't actually solve it or offer an alternative you just come off as a jerk.

The solution offered is to write every field and also an isSet bit... Wouldn't this balloon message sizes, throwing away the major reason to use protobuff?

If I opened my comment with personal attacks on the authors competency I hope people would downvote me. This has 137 points right now and I don’t even think it makes much sense; it sounds as though they stopped short of understanding the reasoning behind many of the limitations and just assume they are mistakes, when I’d argue it makes a ton of sense from a PoV of how protos work.

Like why can’t you repeat a oneof? Imo because it stops acting like a oneof. A oneof is treated like a union in the generated code, and you can expect that only one of the oneof message tags will appear in binary for that message. If you want a repeated oneof, it’s actually no different than if you had all of the fields be repeated and outside of a oneof. It gives a different interface in generated code but it’s the exact same thing you’d want in the underlying proto binaries: multiple of whichever message tag. The distinction of oneof is not useful here.

I think the proto design is quite smart, OTOH. Like the format is designed to allow backwards and forwards compatibility provided you follow some rules that you can easily enforce via linting.

Yes, there are some slightly odd side effects. Many things in proto are special cased. Like Map can’t be repeated because Map is already repeated; maps are sugar for repeated pairs, and you can’t have a repeated repeated field. You can of course just make a quick submessage with a map and repeat that. It doesn’t seem like that big of a deal.

OTOH, for how simple protobufs wire format is, the sugar features like Map help make it feel a bit richer from the PoV of the generated code, whereas the simple wire format makes it more predictable, easier to understand under the hood, and helps to future proof for new features.

Seriously, binary protobuf is so simple anyone can parse it trivially. It’s just a flat sequence of pairs, of a message tag and the corresponding data, with 5 different wire types that do not specify any typing but instead only how to interpret the wire data (IIRC: variable length w length prefix, 4 byte intrger, variable integer, group start and group end, where ‘group’ refers to nested messages.) The wire type is encoded in the lowest 3 bits of the message tag. The message tag is written as a base128 variable length quantity, which is just an integer where each byte has a high bit specifying if there are more bytes, and low bits specifying binary data in least significant first order. The remainder of the bits are just the tag of the field from the proto file. The length prefix variant uses a second base128 vlq to specify length.

Protos are clever: the protobuf compiler itself compiles protobuf definitions into protobuf messages called descriptors that can be passed to a languages own protobuf compiler through standard in. These descriptors also get encoded into the resulting output because they can then be used to perform reflection.

Speaking of reflection, you can also have a bunch of metadata in the form of extensions and message/field/enum/etc. extension options. You can use these at runtime, or you can write custom protobuf plugins. I am doing both of these things simultaneously for different purposes in some projects; it helps me organize schema information and couple it with metadata.

I don’t think protos are perfect, but I do think they are clever and useful for what they are used for at Google. I actually personally suspect they are a bit underrated because outside of Google it’s not always completely obvious how to use protobufs to their fullest. That said, nothing’s perfect and protos are certainly full of weird quirks. But if you embrace them, I think there’s a lot of elegance to be found lurking beneath.

That is all. Disclaimer: I do work for Google, and admittedly I did not like protobuf until I started working here. But, now I sincerely like protobuf.

The author's tone may be rude but they are absolutely right. The design of data description languages is a well researched field and deviating from standard technique without explaining the motivation behind that deviation is a huge smell.

Has the author heard about capnproto?

(2018) Original thread here: https://news.ycombinator.com/item?id=18188519

Protobufs are the worst (de)serialization format, except for all the others.

My chief complaints are:

- Protoc is very obtuse and tricky to use for anything where you want packaging, especially with python

- The gRPC compiler plugin is even more frustrating in this regard

- It's very optimized for compactness on the wire, at the expense of serving as a useful structure within programs (I can't find the source, I think it's somewhere in the protobuf dev docs, but I've had multiple coworkers tell me this)

- The gRPC server python implementation does weird things with multiprocessing under the hood that I do not understand, which interferes with other modules trying to use multiprocessing.

- I still have not found an ideal way to organize files to work well with importing and still compile correctly with protoc/grpc plugin, and generate python files with correct import syntax. If anyone knows the "correct" way to do this that doesn't require too much setup.py hackery, please let me know.

External schema '.proto' files is a feature, not a bug.

The complaints in the article about the type system are pretty silly to me. I mean, they are great features, but they are not really in the sphere of the engineering goals when Google set out to make pb/gRPC.

Here's what I love about it though:

- Support, in particular gRPC's support across so many languages.

- Language agnostic data structure contracts

- Shallow learning curve to get a smoke test hello world put together - I found it a lot easier than Thrift to get up and start playing

To quote hardwaresofon:

> in combining three solutions (schema enforcement, binary representation and convenient client/server code generation), they've created something that's just easier to use

Specific comparisons:

Cap'n proto looks great on paper but at the time (about a year ago) it had some issues with python2.7 and 3.6, which made it a nonstarter for the application at the time.

MsgPack-RPC might work well but I'm a bit dissuaded by the unhealthy looking repo of the python/go/cpp implementations.

Anything over HTTP - you have the binary-to-text issue. Which, if there are better solutions for this nowadays, let me know.

I believe that XML is dead/dying as a ser/de format (outside of the markup domains it has already demonstrated to be very proficient at). Similar lack of binary support.

That leaves Thrift and Avro, which have juuust enough of a barrier to entry, with my lack of time to dig into alternatives, that I have not been able to research thoroughly yet.

Doesn't Rust's Serde can simply solve the forward and backward compatibility issue independent of the serialization format with ignoring non-exising members in serialized types and setting defaults for the other way around?

There's also https://github.com/TimelyDataflow/abomonation , which is very fast, but ugly.

Why would it? Serde does not enforce any schema.

Ey well so is that url!

“This is a good website name, only I am smart enough to think of this, I am an SEO genius with the number one search result for responsible polymorphism”

Protobuffers are shitty asn.1

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact