Hello. I didn't invent Protocol Buffers, but I did write version 2 and was responsible for open sourcing it. I believe I am the author of the "manifesto" entitled "required considered harmful" mentioned in the footnote. Note that I mostly haven't touched Protobufs since I left Google in early 2013, but I have created Cap'n Proto since then, which I imagine this guy would criticize in similar ways.
This article appears to be written by a programming language design theorist who, unfortunately, does not understand (or, perhaps, does not value) practical software engineering. Type theory is a lot of fun to think about, but being simple and elegant from a type theory perspective does not necessarily translate to real value in real systems. Protobuf has undoubtedly, empirically proven its real value in real systems, despite its admittedly large number of warts.
The main thing that the author of this article does not seem to understand -- and, indeed, many PL theorists seem to miss -- is that the main challenge in real-world software engineering is not writing code but changing code once it is written and deployed. In general, type systems can be both helpful and harmful when it comes to changing code -- type systems are invaluable for detecting problems introduced by a change, but an overly-rigid type system can be a hindrance if it means common types of changes are difficult to make.
This is especially true when it comes to protocols, because in a distributed system, you cannot update both sides of a protocol simultaneously. I have found that type theorists tend to promote "version negotiation" schemes where the two sides agree on one rigid protocol to follow, but this is extremely painful in practice: you end up needing to maintain parallel code paths, leading to ugly and hard-to-test code. Inevitably, developers are pushed towards hacks in order to avoid protocol changes, which makes things worse.
I don't have time to address all the author's points, so let me choose a few that I think are representative of the misunderstanding.
> Make all fields in a message required. This makes messages product types.
> Promote oneof fields to instead be standalone data types. These are coproduct types.
This seems to miss the point of optional fields. Optional fields are not primarily about nullability but about compatibility. Protobuf's single most important feature is the ability to add new fields over time while maintaining compatibility. This has proven -- in real practice, not in theory -- to be an extremely powerful way to allow protocol evolution. It allows developers to build new features with minimal work.
Real-world practice has also shown that quite often, fields that originally seemed to be "required" turn out to be optional over time, hence the "required considered harmful" manifesto. In practice, you want to declare all fields optional to give yourself maximum flexibility for change.
The author dismisses this later on:
> What protobuffers are is permissive. They manage to not shit the bed when receiving messages from the past or from the future because they make absolutely no promises about what your data will look like. Everything is optional! But if you need it anyway, protobuffers will happily cook up and serve you something that typechecks, regardless of whether or not it's meaningful.
In real world practice, the permissiveness of Protocol Buffers has proven to be a powerful way to allow for protocols to change over time.
Maybe there's an amazing type system idea out there that would be even better, but I don't know what it is. Certainly the usual proposals I see seem like steps backwards. I'd love to be proven wrong, but not on the basis of perceived elegance and simplicity, but rather in real-world use.
> oneof fields can't be repeated.
(background: A "oneof" is essentially a tagged union -- a "sum type" for type theorists. A "repeated field" is an array.)
Two things:
1. It's that way because the "oneof" pattern long-predates the "oneof" language construct. A "oneof" is actually syntax sugar for a bunch of "optional" fields where exactly one is expected to be filled in. Lots of protocols used this pattern before I added "oneof" to the language, and I wanted those protocols to be able to upgrade to the new construct without breaking compatibility.
You might argue that this is a side-effect of a system evolving over time rather than being designed, and you'd be right. However, there is no such thing as a successful system which was designed perfectly upfront. All successful systems become successful by evolving, and thus you will always see this kind of wart in anything that works well. You should want a system that thinks about its existing users when creating new features, because once you adopt it, you'll be an existing user.
2. You actually do not want a oneof field to be repeated!
Here's the problem: Say you have your repeated "oneof" representing an array of values where each value can be one of 10 different types. For a concrete example, let's say you're writing a parser and they represent tokens (number, identifier, string, operator, etc.).
Now, at some point later on, you realize there's some additional piece of data you want to attach to every element. In our example, it could be that you now want to record the original source location (line and column number) where the token appeared.
How do you make this change without breaking compatibility? Now you wish that you had defined your array as an array of messages, each containing a oneof, so that you could add a new field to that message. But because you didn't, you're probably stuck creating a parallel array to store your new field. That sucks.
In every single case where you might want a repeated oneof, you always want to wrap it in a message (product type), and then repeat that. That's exactly what you can do with the existing design.
The author's complaints about several other features have similar stories.
> One possible argument here is that protobuffers will hold onto any information present in a message that they don't understand. In principle this means that it's nondestructive to route a message through an intermediary that doesn't understand this version of its schema. Surely that's a win, isn't it?
> Granted, on paper it's a cool feature. But I've never once seen an application that will actually preserve that property.
OK, well, I've worked on lots of systems -- across three different companies -- where this feature is essential.
Indeed, I was waiting for the punchline where he reveals a superior alternative (given the fundamental nature of his criticism it needed to be a fundamentally superior concept, to justify the rhetoric, so something that would give you an a-ha moment, rather than iterative improvements). When you open sourced protobuf (thank you for doing so), we at my existing company had implemented a functionally similar but not language agnostic implementation (as I believe many companies had), and it was very exciting to have a cross language implementation. Also to your last point one of the major drivers was the forward-backward compatability and message preservation across intermediate systems that only required knowledge of their portion of the message, so the author's assertion that this never comes up also surprised me.
Of course I am all ears for the next revolution in high performance message formats that support schema evolution, etc., but the author did not provide one, and I think if one is going to unleash that level of negative rhetoric, one takes on the responsibility to also unveil at least a glimpse at an alternative.
> OK, well, I've worked on lots of systems -- across three different companies -- where this feature is essential.
Here's an actual real life example: Chrome uses this in Chrome Sync feature that allows you to sync your browser configuration and state across various devices. The feature is implemented basically like this: Chrome sends its version of state to server in a proto, and the Sync server reconciles it with the one it has saved, updating both according to which one is more recent. The feature fundamentally depends on the fact that the client won't drop the fields it doesn't know, because it would then be data loss for some other more recent client who knows these fields and synced them to server: if the older client dropped these unknown fields, it would be an equivalent to syncing in an empty value of this field.
Original designers of proto3 (the most recent protocol buffers definition language and semantics) actually decided to drop the unknown field preservation, for simplicity reasons. This made Googlers so unhappy that an internal doc was created listing many internal use cases for this feature, and after discussion, this was added back to proto3.
It's also useful the other way around - if you add a field which has significance only to the client (and these account for most fields), you don't have to count on the server knowing about this field immediately which simplifies your testing and deployment.
> Maybe there's an amazing type system idea out there that would be even better, but I don't know what it is. Certainly the usual proposals I see seem like steps backwards. I'd love to be proven wrong, but not on the basis of perceived elegance and simplicity, but rather in real-world use.
Care to elaborate on what these usual proposals are and why they're backwards?
Right now I'm thinking of 'row polymorphism', which, to my understanding, is just permissiveness cooked into the type system — you get to specify types that say "I may or may not have other fields, but I definitely have a 'name' and 'email' field" for example.
"Row polymorphism" sounds like exactly what protobuf does, which I think is the right answer.
As I mentioned, I've heard people argue that the client and server should pre-negotiate exactly which version of the protocol they are using, and then that version should have a rigidly-defined schema. It may have been unfair to suggest that this is a common opinion among type theorists -- I don't have that many data points.
Row polymorphism is a more principled approach that makes the type theorists happy.
The point of row polymorphism is not just that you can say "I may or may not have other fields, but I definitely have a 'name' and 'email' field", but that the extra unknown fields have a name that can be used to constrain other types.
The first type can act as a forward compatible passthrough because it says that whatever extra fields are in the input are also in the output. The second type promises that its output only has name and email fields.
The same applies to variant types: you can say that a variant has option A, option B, and other options rest:
(A | B | &rest) -> (A | B | &rest)
vs
(A | B | &rest) -> (A | B)
Functions of these types are polymorphic in the schema. For instance, the type (A | B | &rest) -> (A | B) can be instantiated with rest = (C | D) to get (A | B | C | D) -> (A | B).
So row types are fundamentally different in that it's the functions that that explicitly deal with multiple possible schemas. At the end of the day all of the rest variables get instantiated with explicit types to get a concrete instantiation of a function in which all the rest variables have been replaced by concrete types.
A serialisation library could use this to do version negotiation automatically. After it has negotiated a version, the library instantiates the functions so that the rest type variables get instantiated with the actual concrete version of the schema.
Different languages implement row types in different ways. Some compile row types in a way akin to Java's type erasure. They compile a single version of each polymorphic function that can be used regardless of how the rest parameters get instantiated. Some compile row types in a way akin to C#'s generics or C++ templates: they compile a separate version for each instantiation.
The advantage of the latter is that the data representation can be optimised with full knowledge of the concrete schema. If we have a function of type {name: string, email: string, &rest} -> {name: string, email: string, &rest} instantiated with rest = {age: int} then that compiles to a version of type {name: string, email: string, age: int} -> {name: string, email: string, age: int}. This compiles to faster code because the compiler statically knows the size of the thing.
In a client-server situation you wouldn't know the schema of rest until run time, so you'd need either have a JIT compiler that can compile new versions at run time, or specify a fixed number of options for rest at compile time. To update a client-server application you'd need to recompile both the client and server with support for the new version. That's not nice but it does not have a chicken and egg problem because both the client and server still support the old version too.
TL;DR: with row types schemas are always rigidly defined, it's the functions that can handle multiple schemas.
> In every single case where you might want a repeated oneof, you always want to wrap it in a message (product type), and then repeat that. That's exactly what you can do with the existing design.
But I could use the same argument for repeated anything. Why even allow repeated primitives at all, if your argument is convincing?
There is a protobuffer best practice that suggests that if you think you might need a repeated message in the future, you should use a repeated message instead of a repeated primitive conservatively.
Maybe there's an amazing type system idea out there that would be even better, but I don't know what it is.
Required and optional is just an encoding of nullability in the type system. This is a common feature in most modern languages (Go excepted). Clearly Google got a very long way with proto1, whose designers felt strongly enough this was important to put it into what is otherwise a very feature-lite system.
The crux of your argument is not fundamental to serialisation or messaging systems but rather, is specific to Google's environment, business and early design choices. It is wrong for virtually all users of serialisation mechanisms:
Real-world practice has also shown that quite often, fields that originally seemed to be "required" turn out to be optional over time, hence the "required considered harmful" manifesto
There are many, many examples where a field does start required and stay required for the lifetime of a system. But even if over time a required field stops being used and you'd like to clean up the data structures or protocols by removing it, what proto3 does is completely wrong for any normal business.
Obviously, you can always remove a required field by changing it to optional and updating the software that uses that field to handle the case where it's missing (perhaps by doing nothing). The "required considered harmful" movement that took hold at Google whilst I was there directly contradicts all modern statically typed language design, except for Go, which also originated at Google and to put it bluntly, almost revels in its reputation for having ignored any PL thinking from the past 20 years. But if you look at - say - Kotlin, which the Android team have adopted as their new practical working language for practical working programmers, it has required/optional as a deeply integrated feature of the language. And this is not controversial. I do not see long running flamewars saying that Maybe types or Kotlin-style integrated null checking is a bad idea. Basically everyone who tries it says, this is great!
So why did Google have this feature and remove it?
The stated rationale of the "optional everywhere" movement was something like this: if we make a required field optional and stop setting it, some server, somewhere, might barf and cause a production outage. And nothing is worse than an outage, therefore, it is better to process data that's wrong (some default zero value casted to whatever that field is meant to mean), than crash.
This set of priorities is absurd for basically any company that isn't Google or in a closely related business, i.e. consumer services in which huge sums of money can be made by serving data even if it's "wrong" (or in which correctness is unknowable, like a search result page). But for most firms correctness does matter. They cannot afford to corrupt data by interpreting a version skew bug as whatever zero might have meant.
Arguably even Google can't afford to do this, which is why Jeff Dean made "required" a feature of the proto1 language.
There's a simple and very bad reason for why protocol buffers were changed to make all fields optional - the protocol buffer wire format was set very, very early on in the companies lifetime when there were only a few servers and Google was very much in startup mode, writing everything from scratch in C++ ... and without using any other frameworks. If memory serves, they're called "protocol buffers" because they started as a utility class on top of the indexserver protocol. The class provided basic PushTaggedInt() type methods which wrote a buffer representing a request. Over time the IDL and compiler was added on top. But ultimately the wire format has never changed, except for re-interpreting some fields from being "byte array of unknown encoding" to being UTF-8 encoded strings (and even that subtlety caused data corruption in prod).
There is no metadata anywhere. A protobuf with a single tagged int32 field encodes to just three bytes. There's no version header. There's no embedded schema. There's nothing that can be used to even mark a message as using an upgraded version of the format. There is however a rather complex varint encoding.
This makes a ton of sense given it was refactored out of a protocol for inter-server communication in a web search engine (Google search being largely a machine that spends its time decoding varints). But having come into existence this way, protobufs quickly proliferated everywhere including things like log files which may have to be read years later. So almost immediately the format became permanently fixed.
And what can't you do if the format is fixed? You can't add any sort of schema metadata that might help you analyse where data is going or what versions of a data structure are deployed in production.
As a consequence, when I was at Google, nobody had any way to know whether there was something running in production still producing or consuming a protobuf that they were trying to remove a field from. With lack of visibility comes fear of change. Combined with a business that makes bazillions of dollars a minute and in which end users can't complain if the served results are wrong, you get protocol buffers.
This is not fundamental. Instead it reflects an understandable but unfortunate lack of forward planning when protobufs were first created. Other companies are not compelled to repeat Google's mistake.
You have misunderstood the "required considered harmful" argument. It's not fundamentally about the abstract concept of required vs. optional but about the specific implementation in Protocol Buffers, which turns out to have had unintended consequences.
Specifically: As implemented, required field checking occurred every time a message was serialized, not just when it was produced or consumed. Many systems involve middlemen who receive a message and then send it on to another system without looking at the content -- except that it would check that all required fields were sent, because that's baked into the protobuf implementation.
What happened over and over again is that some obscure project would redefine a "required" field to "optional", update both the producer and the consumer of the message, and then push it to production. But, once in production, the middlemen would start rejecting any message where these fields were missing. Often, the middleman servers were operating on "omnibus" messages containing bits of data originating from many different servers and projects -- for example, a set of search results might contain annotations from dozens of Search Quality algorithms. Normally, those annotations are considered non-essential, and Google's systems are carefully architected so that the failure of one back-end doesn't lead to overall failure of the system. However, when an optional backned sent a message missing required fields, the entire omnibus message would be rejected, leading to a total production outage. This problem repeatedly affected the search engine, gmail, and many other projects.
The fundamental lesson here is: A piece of data should be validated by the consumer, but should not be validated by pass-through middlemen. However, because required fields are baked into the protobuf implementation, it was unclear how they could follow this principle. Hence, the solution: Don't use required fields. Validate your data in application code, at consumption time, where you can handle errors gracefully.
Could you design some other version of "required" that doesn't have this particular problem? Probably. But would it actually be valuable? People who don't have a lot of experience here -- including Jeff and Sanjay when they first designed protobufs -- think that the idea of declaring a field "required" is obvious. But the surprising result that could only come from real-world experience is that this kind of validation is an application concern which does not belong in the serialization layer.
> There is no metadata anywhere.
Specifically, you mean there is no header / container around a protobuf. This is one of the best properties of protobufs, because it makes them compose nicely with other systems that already have their own metadata, or where metadata is irrelevant. Adding required metadata wastes bytes and creates ugly redundancy. For example, if you're sending a protobuf in an HTTP response, the obvious place to put metadata is in the headers -- encoding metadata in the protobuf body would be redundant and wasteful.
From what you wrote it sounds like you think that if Protobufs had metadata, it would have been somehow easier to migrate to a new encoding later, and Google would have done it. This is naive. If we wanted to add any kind of metadata, we could have done so at any time by using reserved field numbers. For example, the field number zero has always been reserved. So at any time, we could have said: Protobuf serialization version B is indicated by prefixing the message with 0x00 0x01 -- an otherwise invalid byte sequence to start a protobuf.
The reason no one ever did this is not because the format was impossible to change, but because the benefits of introducing a whole new encoding were never shown to be worth the inevitable cost involved: implementation in many languages and tools, code bloat of supporting two encodings at once (code bloat is a HUGE problem for protobuf!), etc.
> Validate your data in application code, at consumption time, where you can handle errors gracefully.
Honest question: how can I validate data in application code when optional fields decode to a necessarily-valid value by design?
Suppose I'm an application author and I have an integer field called "quantity" which decoded to a 0. How can I tell whether that 0 meant "the quantity was 0 in the database" or "the quantity field was missing" instead?
(One answer is that I should opt into a different default value, like -1, which the application can know indicates failure. If that's what I should always do, then why not help me gracefully recover by requiring that I always specify my fallback value explicitly, rather than silently defaulting to a potentially misinterpretable valid value like `0`?)
I understand that required fields break message buses that only need to decode the envelope, but if I am working on a client/server application where message buses are not involved (as almost all client/server programmers in the world are), I don't follow how "everything is optional, and optional means always succeed with a valid default value" facilitates graceful recovery in the application layer. In order to gracefully recover, the application has to be informed that something went wrong!
It seems to me that this design more directly facilitates bugs in the application layer that are difficult to detect because the information that something unexpected happened during decoding is intentionally discarded by default. It makes the resulting bugs "not the protocol layer's fault" by definition, but that is not a compelling pitch to me as an application author.
> Suppose I'm an application author and I have an integer field called "quantity" which decoded to a 0. How can I tell whether that 0 meant "the quantity was 0 in the database" or "the quantity field was missing" instead?
First, this is clear on the level of wire encoding: either the field has encoded 0 value, or it is simply missing from encoding.
Second, in proto2, you actually have has_quantity() method on a proto message, which will tell you whether quantity is missing or set to 0.
In proto3, the design decision was that the has_foo() methods are available only on embedded message field, and not available on primitive fields, so you'd have to wrap your int64 in a message wrapper, like e.g. the ones available in google/protobuf/wrappers.proto.
The point here (and a common pattern inside google3) is that in your handling code you simply manually check the presence of all required fields: if (!foo.has_quantity()) { return FailedPreconditionError("missing quantity"); }. It is a bit of a hassle, but the benefit is that you have control on where the bug originates and how it is handled in your application layer, as opposed to silently dropping the whole proto message on the floor.
In proto2, you could use `has_foo()` to check if `foo` is present, even for integer types. You could also specify what the default value should be, so you could specify e.g. a default of -1 or some other invalid value, if zero is valid for your app.
Unfortunately, proto3 removed both of these features (`has_` and non-zero defaults). I personally think that was a mistake. I'm not sure what proto3 considers idiomatic here. Proto3 is after my time.
Cap'n Proto also doesn't support `has_` due to the nature of the encoding, but it does support defaults. So you can set a default of -1 or whatever. Alternatively, you can declare a union like:
# (Cap'n Proto syntax)
foo :union {
unset @0 :Void;
value @1 :Int32;
}
This will take an extra 16 bits on the wire to store the tag, but gets the job done. `unset` will be the default state of the union because it has the lowest ordinal number.
I suppose in proto3, you ought to be able to use a `oneof` in a similar way.
My point about metadata is that if protobufs had even a small amount of self description in them, the middlemen who weren't being updated could all have been found automatically and the version skew issue would have been much less of an issue. Like how Dapper can follow RPCs around, but for data and running binaries.
Google doesn't do that because for its specific domain it needs a very tight encoding, amongst other reasons (and the legacy issues). It could have fixed the validating-but-not-updated-middleman issue in other ways, but instead it made the schema type system less rigorous vs more rigorous. That seems the wrong direction.
This article appears to be written by a programming language design theorist who, unfortunately, does not understand (or, perhaps, does not value) practical software engineering. Type theory is a lot of fun to think about, but being simple and elegant from a type theory perspective does not necessarily translate to real value in real systems. Protobuf has undoubtedly, empirically proven its real value in real systems, despite its admittedly large number of warts.
The main thing that the author of this article does not seem to understand -- and, indeed, many PL theorists seem to miss -- is that the main challenge in real-world software engineering is not writing code but changing code once it is written and deployed. In general, type systems can be both helpful and harmful when it comes to changing code -- type systems are invaluable for detecting problems introduced by a change, but an overly-rigid type system can be a hindrance if it means common types of changes are difficult to make.
This is especially true when it comes to protocols, because in a distributed system, you cannot update both sides of a protocol simultaneously. I have found that type theorists tend to promote "version negotiation" schemes where the two sides agree on one rigid protocol to follow, but this is extremely painful in practice: you end up needing to maintain parallel code paths, leading to ugly and hard-to-test code. Inevitably, developers are pushed towards hacks in order to avoid protocol changes, which makes things worse.
I don't have time to address all the author's points, so let me choose a few that I think are representative of the misunderstanding.
> Make all fields in a message required. This makes messages product types.
> Promote oneof fields to instead be standalone data types. These are coproduct types.
This seems to miss the point of optional fields. Optional fields are not primarily about nullability but about compatibility. Protobuf's single most important feature is the ability to add new fields over time while maintaining compatibility. This has proven -- in real practice, not in theory -- to be an extremely powerful way to allow protocol evolution. It allows developers to build new features with minimal work.
Real-world practice has also shown that quite often, fields that originally seemed to be "required" turn out to be optional over time, hence the "required considered harmful" manifesto. In practice, you want to declare all fields optional to give yourself maximum flexibility for change.
The author dismisses this later on:
> What protobuffers are is permissive. They manage to not shit the bed when receiving messages from the past or from the future because they make absolutely no promises about what your data will look like. Everything is optional! But if you need it anyway, protobuffers will happily cook up and serve you something that typechecks, regardless of whether or not it's meaningful.
In real world practice, the permissiveness of Protocol Buffers has proven to be a powerful way to allow for protocols to change over time.
Maybe there's an amazing type system idea out there that would be even better, but I don't know what it is. Certainly the usual proposals I see seem like steps backwards. I'd love to be proven wrong, but not on the basis of perceived elegance and simplicity, but rather in real-world use.
> oneof fields can't be repeated.
(background: A "oneof" is essentially a tagged union -- a "sum type" for type theorists. A "repeated field" is an array.)
Two things:
1. It's that way because the "oneof" pattern long-predates the "oneof" language construct. A "oneof" is actually syntax sugar for a bunch of "optional" fields where exactly one is expected to be filled in. Lots of protocols used this pattern before I added "oneof" to the language, and I wanted those protocols to be able to upgrade to the new construct without breaking compatibility.
You might argue that this is a side-effect of a system evolving over time rather than being designed, and you'd be right. However, there is no such thing as a successful system which was designed perfectly upfront. All successful systems become successful by evolving, and thus you will always see this kind of wart in anything that works well. You should want a system that thinks about its existing users when creating new features, because once you adopt it, you'll be an existing user.
2. You actually do not want a oneof field to be repeated!
Here's the problem: Say you have your repeated "oneof" representing an array of values where each value can be one of 10 different types. For a concrete example, let's say you're writing a parser and they represent tokens (number, identifier, string, operator, etc.).
Now, at some point later on, you realize there's some additional piece of data you want to attach to every element. In our example, it could be that you now want to record the original source location (line and column number) where the token appeared.
How do you make this change without breaking compatibility? Now you wish that you had defined your array as an array of messages, each containing a oneof, so that you could add a new field to that message. But because you didn't, you're probably stuck creating a parallel array to store your new field. That sucks.
In every single case where you might want a repeated oneof, you always want to wrap it in a message (product type), and then repeat that. That's exactly what you can do with the existing design.
The author's complaints about several other features have similar stories.
> One possible argument here is that protobuffers will hold onto any information present in a message that they don't understand. In principle this means that it's nondestructive to route a message through an intermediary that doesn't understand this version of its schema. Surely that's a win, isn't it?
> Granted, on paper it's a cool feature. But I've never once seen an application that will actually preserve that property.
OK, well, I've worked on lots of systems -- across three different companies -- where this feature is essential.