I have some very unfortunate news to share with the Cap'n Proto and Sandstorm communities.
Ian Denhardt (zenhack on HN), a lead contributor to the Go implementation, suddenly and unexpectedly passed away a few weeks ago. Before making a request to the community, I want to express how deeply saddened I am by this loss. Ian and I collaborated extensively over the past three years, and we had become friends.
As the de facto project lead, it now befalls me to fill Ian's very big shoes. Please, if you're able to contribute to the project, I could really use the help. And if you're a contributor or maintainer of some other implementation (C++, Rust, etc.), I would *REALLY* appreciate it if we could connect. I'm going to need to surround myself with very smart people if I am to continue Ian's work.
RIP Ian, and thank you. I learned so much working with you.
I've had a couple people suddenly taken from me, and it is soul crushing. Every time it happens it reminds me of how fragile life is, and how quickly things can change. I've started trying to enjoy the small things in life more, and while I don't neglect the future, I also try to enjoy the present.
He has left an amazing legacy that has touched a lot of people. RIP Ian.
I’m so sad to hear this. I didn’t know him but hugely admired the work he did on Tempest (his recent fork of Sandstorm, attempting to revive the project). Thank you for letting us know.
I find it surprising how few protocols (besides Cap'n Proto) have promise pipelining. The only other example I can think of is 9p, but that's not a general purpose protocol.
As neat as it is I guess it's hard optimize the backend for it compared to explicitly grouping the queries. I imagine a bespoke RPC call that results in a single SQL query is better than several pipelined but separate RPC calls, for example.
But even still, you would think it would be more popular.
If you're thinking strictly about stateless backends that just convert every request into a SQL query, then yeah, promise pipelining might not be very helpful.
I think where it shines is when interacting with stateful services. I think part of the reason everyone tries to make everything stateless is because we don't have good protocols for managing state. Cap'n Proto RPC is actually quite good at it.
The issue is that having per-session/transaction state on the server makes load balancing requests more difficult; especially when that state is long-lived.
While it's true that load-balancing long-lived resources is harder than short-lived, a lot of the difficulty of load balancing with stateful servers is actually in the protocol, because you somehow have to make sure subsequent requests for the same state land on the correct server.
Cap'n Proto actually does really well with this basic difficulty, because it treats object references as a first-class thing. When you create some state, you receive back a reference to the state, and you can make subsequent requests on that reference. The load balancer can see that this has happened, even if it doesn't know the details of the application, because object references are marked as such at the RPC layer independent of schema. Whereas in a system that returns some sort of "object ID" as a string, and expects you to pass that ID back to the server on subsequent requests, the load balancer is not going to have any idea what's going on, unless you do extra work to teach the load balancer about your protocol.
If it's the same backend handling multiple chained requests that happen to use the same database, it could in turn build a single big SQL query to generate the result(s). I used to write stuff like this all the time (not specifically Cap'n Proto servers, but general functional expression -> SQL -> here's your answer engines).
> I find it surprising how few protocols (besides Cap'n Proto) have promise pipelining.
Pipelining is a bad idea. It reifies object instances, and thus makes robust implementation much harder. You no longer make stateless calls, but you are running functions with particular object instances.
And you immediately start getting problems. Basically, Client Joe calls Service A and then pass the promised result of the call to Service B. So that Service B will have to do a remote call to Service A to retrieve the result of the promise.
This creates immediate complications with security boundaries (what is your delegation model?). But what's even worse, it removes the backpressure. Client Joe can make thousands of calls to Service A, and then pass the not-yet-materialized results to Service B. Which will then time out because Service A is being DDoS-ed.
Or more specifically, it seems to have client-chosen file descriptors, so the client can open a file, then immediately send a read on that file, and if the open fails, the read will also fail (with EBADF). Awesome!
This is great, but "promise pipelining" also needs support in the client.
Are there 9p clients which support promise pipelining? For example, if the user issues several walks, they're all sent before waiting for the reply to the first walk?
Also, it only has promise pipelining for file descriptors. That gives you a lot, definitely, but if for example you wanted to read every file in a directory, you'd want to be able to issue a read and then walk to the result of that read. Which 9p doesn't seem to support. (I actually support this in my own remote syscall protocol library thing, rsyscall :) )
There is also CapnP’s moral ancestor CapTP[1]/VatTP aka Pluribus developed to accompany Mark Miller’s E language (yes, it’s a pun, there is also a gadget called an “unum” in there). For deeper genealogy—including a reference to Barbara Liskov for promise pipelining and a number of other relevant ideas in the CLU extension Argus—see his thesis[2].
(If I’m not misremembering, Mark Miller later wrote the promise proposal for JavaScript, except the planned extension for RPC never materialized and instead we got async/await, which don’t seem compatible with pipelining.)
The more recent attempts to make a distributed capability system in the image of E, like Spritely Goblins[3] and the OCapN effort[4], also try for pipelining, so maybe if you hang out on cap-talk[5] you’ll hear about a couple of other protocols that do it, if not ones with any real-world usage.
(And I again reiterate that, neat as it is, promise pipelining seems to require programming with actual explicit promises, and at this point it’s well-established how gnarly that can get.)
One idea that I find interesting and little-known from the other side—event loops and cooperatively concurrent “active objects”—is “causality IDs”[6] from DCOM/COM+ as a means of controlling reentrancy, see CoGetCurrentLogicalThreadId[7] in the Microsoft documentation and the discussion of CALLTYPE_TOPLEVEL_CALLPENDING in Effective COM[8]—I think they later tried to sell this as a new feature in Win8/UWP’s ASTAs[9]?
Without knowing how exactly capnproto promise pipelining works, when I thought about it, I was concerned about cases like reading a directory and stating everything in it, or getting back two response values and wanting to pass only one to the next call. The latter could be made to work, I guess, but the former depends on eg the number of values in the result list.
In the actual implementation, when making a pipelined call on a result X, you actually say something like "Call X.foo.bar.baz()" -- that is, you can specify a nested property of the results which names the object that you actually want to operate on.
At present, the only operation allowed here is reading a nested property, and that seems to solve 99% of use cases. But one could imagine allowing other operations, like "take the Nth element of this array" or even "apply this call to all elements in the array, returning an array of results".
That assumes you know & can generate the complete pipeline ahead of time. The elegance of promise pipelining is that your pipeline can also be asynchronously grown.
Congrats on the release! It must be very exciting after 10 years :)
If you don't mind the question: will there be more work on implementations for other languages in the future? I really like the idea of the format, but the main languages in our stack aren't supported in a way I'd use in a product.
This is indeed the main weakness of Cap'n Proto. I only really maintain the C++ implementation. Other implementations come from various contributors which can lead to varying levels of completeness and quality.
Unfortunately I can't really promise anything new here. My work on Cap'n Proto is driven by the needs of my main project, the Cloudflare Workers runtime, which is primarily C++. We do interact with Go and Rust services, and the respective implementations seem to get the job done there.
Put another way, Cap'n Proto is an open source project, and I hope it is useful to people, but it is not a product I'm trying to sell, so I am not particularly focused on trying to get everyone to adopt it. As always, contributions are welcome.
The one case where I might foresee a big change is if we (Cloudflare) decided to make Cap'n Proto be a public-facing feature of the Workers platform. Then we'd have a direct need to really polish it in many languages. That is certainly something we discuss from time to time but there are no plans at present.
Hmm, the main weakness of Cap'n Proto is that you have to already know so much stuff in order to understand why it makes all the great decisions it does. The weakness you're talking about matters to me, sure, I don't use Cap'n'Proto because it lacks the same tooling as gRPC, but it is better than gRPC from an ideas point of view.
I am not going to write those language implementations, I have other stuff I need to do, and gRPC is good enough. But the people who love writing language implementations might not understand why Cap'n Proto is great, or at least not understand as well as they understand Golang and Rust, so they will rewrite X in Golang and Rust instead.
Anyway, the great ideas haven't changed in whatever it is, almost 10-15 years you've been working on this, they've been right all along. So it is really about communication.
A comment on HN that really stuck with me was like: "Man dude, this is great, but try to explain to my team that it's Not React. They won't care."
I'm just a guy, I don't know how to distill how good Cap'n Proto is. But "The Unreasonable Effectiveness of Recurrent Neural Networks" is the prototype. What is the unreasonable effectiveness of Cap'n Proto? In games, which I'm familiar with, entity component systems, user generated content and their tooling have a lot in common with Cap'n Proto. "The Unreasonable Effectiveness of ECS" is deterministic multiplayer, but that is also really poorly communicated, and thus limits adoption. Maybe you are already facing the same obstacles with Cloudflare Workers. It's all very communications related and I hope you get more adoption.
Yeah, this has been the struggle with Sandstorm and self-hosting too. Ten years on, I'm still confident it's the best way to self-host, but to convince someone of that I have to sit them down and figure out how to get them to understand capability-based security, and most people lose interest about... immediately. :P
I suspect a lot of things will eventually look more like Cap'n Proto and Sandstorm, but it will take a lot of time for everyone else to get there.
I’m sold on Sandstorm, but the company folded before I could do anything with it. If someone makes another push at it, I have a lot of security-focused enterprise folks who want more robust & streamlined ways to self host stuff.
I have been pretty invested into Sandstorm for like ten years now, so I am very interested in getting that new push moving. If you know people interested in contributing to the project in any way, we are definitely interested in looking at ways to make it happen.
We will probably put out a blog post sometime soon with an update.
That's completely understandable, thank you for the answer! I'd love to try and help with at least one implementation for those languages, but there's a good chance that it would end up like the existing implementations due to lack of time.
Anyway, thank you for making it open source and for working on it all this time!
> if we (Cloudflare) decided to make Cap'n Proto be a public-facing feature of the Workers platform.
How likely is this? What would be the benefits and use-cases of doing this? Would it be a standardized JS offering, or something specific to Workers that is deserialized before it hits the runtime?
This really hasn't been fleshed out at all, it's more like: "Well, we're built on Cap'n Proto, it'd be really easy to expose it for applications to use. But is it useful?"
Arguably Cap'n Proto RPC might be an interesting way for a Worker running on Cloudflare to talk to a back-end service, or to a service running in a container (if/when we support containers). Today you mostly have to use HTTP for this (which has its drawbacks) or raw TCP (which requires bringing your own protocol parser to run in "userspace").
That said there's obviously a much stronger case to make for supporting gRPC or other protocols that are more widely used.
There are people who have tried to write the RPC layer without it simply being a wrapper around the C++ implementation, but it's a LOT of code to rewrite for not a lot of direct benefit.
Feel free to take a crack at it. People would likely be rather cooperative about it. However, know that it's just simply a lot of work.
If any cloudflare employees end up here who helped decide on Capn Proto over other stuff (e.g. protobuf), what considerations went into that choice? I'm curious if the reasons will be things important to me, or things that you don't need to worry about unless you deal with huge scale.
To this day, Cloudflare's data pipeline (which produces logs and analytics from the edge) is largely based on Cap'n Proto serialization. I haven't personally been much involved with that project.
As for Cloudflare Workers, of course, I started the project, so I used my stuff. Probably not the justification you're looking for. :)
That said, I would argue the extreme expressiveness of Cap'n Proto's RPC protocol compared to alternatives has been a big help in implementing sandboxing in the Workers Runtime, as well as distributed systems features like Durable Objects. https://blog.cloudflare.com/introducing-workers-durable-obje...
I don't work at Cloudflare but follow their work and occasionally work on performance sensitive projects.
If I had to guess, they looked at the landscape a bit like I do and regarded Cap'n Proto, flatbuffers, SBE, etc. as being in one category apart from other data formats like Avro, protobuf, and the like.
So once you're committed to record'ish shaped (rather than columnar like Parquet) data that has an upfront parse time of zero (nominally, there could be marshalling if you transmogrify the field values on read), the list gets pretty short.
Cap'n Proto was originally made for https://sandstorm.io/. That work (which Kenton has presumably done at Cloudflare since he's been employed there) eventually turned into Cloudflare workers.
To summarize something from a little over a year after I joined there: Cloudflare was building out a way to ship logs from its edge to a central point for customer analytics and serving logs to enterprise customers. As I understood it, the primary engineer who built all of that out, Albert Strasheim, benchmarked the most likely serialization options available and found Cap'n Proto to be appreciably faster than protobuf. It had a great C++ implementation (which we could use from nginx, IIRC with some lua involved) and while the Go implementation, which we used on the consuming side, had its warts, folks were able to fix the key parts that needed attention.
Anyway. Cloudflare's always been pretty cost efficient machine wise, so it was a natural choice given the performance needs we had. In my time in the data team there, Cap'n Proto was always pretty easy to work with, and sharing proto definitions from a central schema repo worked pretty well, too. Thanks for your work, Kenton!
Albert here :) Decision basically came down to the following: we needed to build a 1 million events/sec log processing pipeline, and had only 5 servers (more were on the way, but would take many months to arrive at the data center). So v1 of this pipeline was 5 servers, each running Kafka 0.8, a service to receive events from the edge and put it into Kafka, and a consumer to aggregate the data. To squeeze all of this onto this hardware footprint, I spent about a week looking for a format that optimized for deserialization speed, since we had a few thousand edge servers serializing data, but only 5 deserializing. Capnproto was a good fit :)
The article says they were using it before hiring him though, so there must have been some prior motivation:
> In fact, you are using Cap’n Proto right now, to view this site, which is served by Cloudflare, which uses Cap’n Proto extensively (and is also my employer, although they used Cap’n Proto before they hired me)
I'm excited by Cap'n Proto's participation in the OCAPN standardization effort. Can you speak to if that's going to be part of the Cap'n Proto 2.0 work?
Kenton, I'm @lthibault on GitHub. I was working closely with Ian on the Go capnp implementation, and I would be happy to take over this initiative. Can you point me in the right direction?
Also, are you on Matrix or Telegram or something of the sort? I was hoping I could ping you with the occasional question as I continue work on go-capnp.
The guy making Tempest? That is tragic and deserves an HN front page. Is there an obituary or good link you can submit and we can upvote? Or would Ian have preferred not to have such?
Yes, he was the most active developer over the last few years, although it wasn't a huge amount of activity. And that activity had dropped off this year as Ian shifted his focus to Tempest, a mostly-from-scratch rewrite. https://github.com/zenhack/tempest
For my part I stopped pushing monthly Sandstorm updates this year as there hasn't really been anything to push. Unfortunately Sandstorm's biggest dependencies can't even be updated anymore because of breaking changes that would take significant effort to work around.
I have thought a bit on this but I have also been pretty much just been recovering lately. We will probably need to assemble a blog post at some point soonish, but we need to talk to a few people first.
Looking at https://sandstorm.io/news/2014-08-19-why-not-run-docker-apps from 9 years ago, it seems you think that Docker was/is poorly suited for individual user administration. Since the Internet has started enshittifying in the last few years I've been turning to self-hosting apps, but this requires researching distro-specific instructions for each new app I install, and I got burned when Arch updated their Postgres package which broke compatibility with the old database formats until I ran a manual upgrade script which required troubleshooting several errors along the way. (In hindsight, I should've picked a fixed-release distro like Debian or something.)
Would you say that there have been user-friendly Docker/etc. wrappers for self-hosting LAN services? Someone has recommended casaOS (or the proprietary Umbrel), though I haven't tried either yet.
While I never used Cap'n Proto, I want to thank kentonv for the extremely informative FAQ answer [1] on why required fields are problematic in a protocol
I link it to people all the time, especially when they ask why protobuf 3 doesn't have required fields.
This is some very valuable perspective. Personally, I previously also struggled to understand why. For me, the thing that clicked was to understand protobuf and Cap'n proto as serialization formats that need to work across API boundaries and need to work with different versions of their schema in a backwards- and forwards-compatible way; do not treat them as in-memory data structures that represent the world from the perspective of a single process running a single version without no compatibility concerns. Thus, the widely repeated mantra of "making illegal states unrepresentable" does not apply.
First, being valid or invalid with respect to a static type system is a GLOBAL property of program -- writing a type checker will convince you of that. And big distributed systems don't have such global properties: https://news.ycombinator.com/item?id=36590799
If they did, they'd be small :) Namely you could just reboot the whole thing at once. You can't reboot say the entire Internet at once, and this also holds for smaller systems, like the ones at say Google (and I'm sure Cloudflare, etc.).
So the idea is that the shape/schema is a GLOBAL property -- you never want two messages called foo.RequestX or two fields called "num_bar" with different types -- ever, anywhere.
But optional/required is LOCAL property. It depends on what version of a schema is deployed in a particular binary. Inherently, you need to be able to handle a mix of inconsistent versions running simultaneously.
---
To be pedantic, I woulds say "making illegal states unrepresentable" DOES apply, but you can't do it in a STATIC type system. [1] Your Maybe<T> type is not useful for data that crosses process boundaries.
A distributed system isn't a state machine.
1. Lamport showed us one important reason why: the relative order of messages means that there is no globally coherent state. You need something like Paxos to turn a distributed system back into a state machine (and this is very expensive in general)
2. The second reason is probably a consequence of the first. You can think of deploying a binary to a node as a message to that node. So you don't have a completely consistent state -- you always have an in-between state, a mix of versions. And presumably you want your system to keep working during this time period :)
And that coarse-grained problem (code versioning and deployment) implies the fine-grained problem (whether a specific message in a field is present). This is because protobufs generate parsers with validation for you -- or they used to!
---
tl;dr Think of the shape of data (aka schema) and field presence (optional/required) as different dimensions of data modeling. Maybe<T> mixes those up, which is fine in a single process, but doesn't work across processes.
---
[1] A very specific example of making illegal states unrepresentable without static types - my Oils project uses a DSL for algebraic data types, borrowed from CPython. The funny thing is that in CPython, it generates C code, which doesn't have any static notion of Maybe<T>. It has tagged unions.
And in Oils we first generated dynamically typed Python at first. Somewhat surprisingly, algebraic data types are STILL useful there.
Now the generated code is statically typed with MyPy (and with C++), and we do pleasant type-driven refactorings. But using algebraic data types were still extremely useful before static typing. They made illegal states unrepresentable -- but you would get the error at runtime.
I wonder about how to make this play nicely with systems that have different perspectives. Yes, a message bus is written to deal with any possible message and it can do that because it doesn't care what's in the message. Incomplete messages are useful to have, too.
This is sort of like the difference between a text editor and a compiler. An editor has to deal with code that doesn't even parse, which is easiest if just treats it as a plain text file, but then you're missing a lot of language-specific features that we take for granted these days. Meanwhile, a compiler can require all errors to be fixed before it emits a binary, but it has to be good at reporting what the errors are, because they will certainly happen.
It's unclear to me how the type of the field can be a global property in a large system. From a text editor's point of view, you can just edit the type. How can anyone guarantee that a type is always the same?
Also, SQL tables actually do make closed-world assumptions; every record meets the schema, or it can't be stored there. If you change the schema, there is a migration step where all the old rows in the production database gets upgraded. This doesn't seem unrealistic?
I guess it's unrealistic that you only have one production database, and not also a staging database, and every developer having their own database? And they will be on different versions. As soon as you have lots of databases, things get complicated.
Yes databases and persisted data are an even bigger problem -- it's not enough to "reboot the Internet", you would also have to migrate all the data it stores to a different format!
e.g. type systems are interior, network protocols and persisted data are exterior.
SWEs tend to reason about the interior; SREs tend to reason about the exterior. Every problem a SRE deals with has passed type checks.
I see many fallacies where programmers want to think about the interior ONLY. They want the type system to ensure correctness. But that can be taken too far -- there are some things the interior view can't (or doesn't currently) handle, like mixed versions of binaries, schema migrations, etc.
The key point with databases is that your schema is LITERALLY dynamic -- it lives in a process outside your program, outside your type system (unless your program consists entirely of stored procedures, etc.)
Of course most people have some kind of synchronization or ORM (with varying degrees of success). But the point is that the exterior reality is the thing that matters; the interior is just a guide. "When models and reality collide, reality wins"
---
On the other hand, I think there can be more static tools in the future -- if they are built to understand more of the system; if they're not so parochially limited to a single process.
But I think these issues are never going away -- quite the contrary they will increase, because software tends to get more heterogeneous as it gets bigger. It's tempting to think that someday there's going to be a "unified platform" that solves it all, but I think the trend is actually the opposite.
The other issue is that while type systems can get better, they're mostly applicable when you control both sides of the wire.
Most orgs writing software do not control both sides of the wire -- they are reusing services like Google Maps or Stripe. In the last 10 years it seems like every org is now relying on 50 different third party integrations (which btw makes the web super slow ...)
As I mentioned in the previous comment, even if you can somehow go into the repo of Google Maps or Stripe, download their schema, and compile it into your binary, that STILL doesn't give you any guarantees.
They can upgrade or roll back their binaries whenever. They might switch from free-form JSON to JSON schema to Swagger, etc. You don't control that decision.
The people on the other side of the wire may have started using protobufs 10 years ago, and don't feel like switching to whatever you think is great right now. There's a lot of heterogeneity across time too, not just across different tech right now.
So fundamentally the problems are with "the world" the type system is modeling, not about or within the type system itself!
It seems like protobufs are sort of a mixed interior / exterior model? In an organization that uses centralized protobuf schemas and generates serialization code from that, you can assume every protobuf record of that type came from a serializer that obeys some version of the schema.
Why can this guarantee a field's type but not whether it's required? Because fields have a lifecycle. They get added to the schema, used for a while, and later become obsolete and are deleted.
The maintainers of a protobuf schema can guarantee that a field number is always associated with a certain type, and that when the field is deleted, that field number is never reused. This makes (field number, type) tuples timeless. But they can't say whether a field will be needed forever, and they can't control which versions of the schema are still being used.
Effectively, "required" means "possibly required forever." As long as there is some parser out there that treats the field as required, deserialization will break if you leave out the field.
Changing a field's type is easier, because you don't really do it. You add a new field number with a new type and maybe eventually stop emitting the old field.
This suggests a strategy: fields can be required as long as they're not top-level. You can define coordinates(42) to always be (x,y) where both x and y are required. If you change it to be a 3D coordinate, then that will be coordinates(43) = (x,y,z), and you never reuse field number 42 in that schema.
The strategy illustrated at the end still ignores the learning of the FAQ. What if, let's say, one microservice is responsible for calculating the x coordinate but another microservice is responsible for calculating the y coordinate? The former microservice could rightly fill in only the x coordinate and ask the latter microservice to fill in the y coordinate.
Then you start to make compromises such as defining an IncompleteCoordinate message that requires x. But what if later you calculate y first and use that to calculate x?
Etc. Etc. I find it more convenient to make everything optional at the wire format level and explicitly validate what is required at the point of use, not at the point of parse.
>To help you safely add and remove required fields, Typical offers an intermediate state between optional and required: asymmetric. An asymmetric field in a struct is considered required for the writer, but optional for the reader. Unlike optional fields, an asymmetric field can safely be promoted to required and vice versa.
Can't we extend this argument to eliminating basically all static typing? And frankly that'd not even be wrong, and is why Alan Kay defined OOP as one that's dynamically typed and late bound, and we went against it anyway to keep relearning the same lessons over and over.
The argument is really more like: Always defer validation until the point where the data is actually consumed, because only the consumer actually knows what is valid.
Which is definitely a counterpoint to the oft-stated argument that you should validate all data upfront.
Either way though, you can still have types, the question is just when and where (in a distributed system, especially) they should be checked.
The argument is actually more like: don't use badly written middleman software that tries to parse messages it doesn't need to parse.
I was at Google when the "let's get rid of optional" crusade started. It didn't make sense to me then and over a decade later it still doesn't. If a program expects a field to be there then it has to be there, removing the protobuf level checking just meant that programs could now read garbage (some default value) instead of immediately crashing. But the whole reason we have types, assertions, bounds checking and so on is because, almost always, we'd like our software to NOT just blindly plough on into undefined territory when it doesn't understand something properly, so in reality it just means everyone ends up coding those very same required-ness assertions by hand.
Now, Google's business model is remarkably robust to generating and processing corrupt data, so you can argue that in the specific case of this specific company, it is actually better to silently serve garbage than to crash. This argument was made explicitly in other forms, like when they deleted all the assertions from the HTTP load balancers. But in every case where I examined an anti-required argument carefully the actual problem would turn out to be elsewhere, and removing assertions was just covering things up. The fact that so much of Google's code is written in C++ that not only starts up slowly but also just immediately aborts the entire process when something goes wrong also contributes to the brittleness that encourages this kind of thing. If Google had been built on a language with usable exceptions right from the start it'd have been easier to limit the blast radius of data structure versioning errors to only the requests where that data structure turned up, instead of causing them to nuke the entire server (and then the RPC stack will helpfully retry because it doesn't know why the server died, promptly killing all of them).
But this tolerance to undefined behavior is not true for almost any other business (except maybe video games?). In those businesses it's better to be stopped than wrong. If you don't then you can lose money, lose data, lose customers or in the worst cases even lose your life. I don't think people appreciate the extent to which the unique oddities of Google's business model and infrastructure choices have leaked out into the libraries their staffers/ex-staffers release.
> The argument is actually more like: don't use badly written middleman software that tries to parse messages it doesn't need to parse.
The middleman software in question often needed to process some part of the message but not others. It wasn't realistic to define a boundary between what each middleman might need and what they wouldn't need, and somehow push the "not needed" part into nested encoded blobs.
I'm not sure the rest of your comment is really addressing the issue here. The argument doesn't have anything to do with proceeding forward in the face of corrupt data or undefined behavior. The argument is that validation needs to happen at the consumer. There should still be validation.
> It wasn't realistic to define a boundary between what each middleman might need and what they wouldn't need, and somehow push the "not needed" part into nested encoded blobs.
This is an interesting argument that I would like to see more elaboration on, because that's the obvious solution. Effectively you're building a pipeline of data processors and each stage in the pipeline reads its own information and then passes along a payload with the rest of the information to the next stage. This would preserve full static typing with required fields, but I can see how it might inhibit some forms of dynamic instrumentation, eg. turning verbose logging on/off might dynamically reconfigure the pipeline, which would affect all upstream producers if they're wrapping messages for downstream consumers.
If this were a programming language I would immediately think of row typing to specify the parts that each stage depends on while being agnostic about the rest of the content, but I'm not sure how that might work for a serialization format. Effectively, you're pulling out a typed "view" over the underlying data that contains offsets to the underlying fields (this is the dictionary-passing transform as found in Haskell).
The particular piece of infrastructure I worked on sat in the middle of the search pipeline, between the front-end that served HTML web pages, and the back-end index. This middle piece would request search results from the back-end, tweak them in a bunch of ways, and forward them on.
These "tweaks" could be just about anything. Like: "You searched for Jaguar, but I don't know if you meant the cat or the car. The index decided that pages about the car rank higher so the first three pages of results are about the car. I'm going to pull some results about the cat from page 4 and put them near the top so just in case that's what you really wanted, you'll find it."
Google Search, at least when I worked on it, was composed of a huge number of such tweaks. People were constantly proposing them, testing if they led to an improvement, and shipping them if they do. For a variety of reasons, our middleman server was a great place to implement certain kinds of tweaks.
But what kinds of information are needed for these "tweaks"? Could be anything! It's a general-purpose platform. Search results were annotated with all kinds of crazy information, and any piece of information might be useful in implementing some sort of middleman tweak at some point.
So you couldn't very well say upfront "OK, we're going to put all the info that is only for the frontend into the special 'frontend blob' that doesn't get parsed by the middleman", because you have no idea what fields are only needed by the frontend. In fact, that set would change over time.
> If this were a programming language I would immediately think of row typing to specify the parts that each stage depends on while being agnostic about the rest of the content
Indeed, perhaps one could develop an elaborate system where in the schemas, we could annotate certain fields as being relevant to certain servers. Anywhere else, those fields would be unavailable (but passed through without modification or validation). If you needed the fields in a new place, you change the annotations.
But that sounds... complicated to design and cumbersome to maintain the annotations. Simply banning required fields solved the problem for us, and everything else just worked.
> Indeed, perhaps one could develop an elaborate system where in the schemas, we could annotate certain fields as being relevant to certain servers. Anywhere else, those fields would be unavailable (but passed through without modification or validation). If you needed the fields in a new place, you change the annotations.
I don't think it has to be elaborate. What I was thinking was something more like, in pseudo-C#:
// the framework's general channel type from which messages are read
public interface IChannel
{
T Read<T>() where T : interface;
}
// clients declare the interface they operate on:
public interface IClientFields
{
public int Foo { get; set; }
public string? Name { get; set; }
}
...
// client middleware function
Task MiddlewareFn(IChannel chan)
{
var client = chan.Read<IClientFields>();
... // do something with client before resuming at next stage
}
The client's interface type T must simply be a structural subtype of the underlying message type. As long as the underlying format is somewhat self-descriptive with a name and type map, you can perform the necessary checking that only applies locally to the client. Nothing fancy, and the required fields that client cares about are still there and the rest are ignored because they're never referenced. This could return an interface that contains a series of offsets into the data stream, which I believe is how capnproto already works.
Are you saying that each service would need to declare, separately, the subset of fields they operate on, and make sure that those fields are always a strict subset of the overall set of fields the protocol contains?
This essentially means declaring the same protocol multiple times, which seems like a big pain.
Assuming the service is not operating on the whole format, then it's already implicitly depending on the presence of those fields and also implicitly depending on the fact that they are a strict subset of the overall set of fields in the protocol. I'm not sure why making this fact slightly more explicit by having the client add a single interface would be a big pain.
In principle, this also enables type checking the whole pipeline before deployments since the interfaces can be known upfront rather than latent in the code.
I said a single interface per client, where I'm using "client" as a stage in the pipeline. Each piece of middleware that plugged into this pipeline already implicitly depends on a schema, so why not describe the schema explicitly as some subset of the underlying message?
It's easier to understand in context - some services (iirc web search but it might have been ads or something else very core) had kept adding fields to some core protobufs for years and years. It made sense, was the path of least resistance etc, but inevitably some of those fields became obsolete and they wanted to remove them but found it was hard, because every program that did anything with web search was deserializing these structures.
Truly generic middleware like RPC balancers did what you are saying, but there were also a lot of service specific "middlemen" which did need to look at parts of these mega-structures.
Now due to how protobufs work, you can do what you suggest and "cast" a byte stream to multiple different types, so they could have defined subsets of the overall structures and maybe they did, I don't remember, but the issue then is code duplication. You end up defining the same structures multiple times, just as subsets. With a more advanced type system you can eliminate the duplication, but there was a strong reluctance to add features to protobufs.
Honestly I wonder what is the big win in terms of performance by using static types here, because this sounds so terribly well fit for dynamic types (of which optionality by default is in fact a limited example). Such an odd choice to calcify a spec in a places where it changes all the time. "Static" optimizations should be local, not distributed.
I think you're defining consumer as the literal line of code where the field is read, whereas a more natural definition would be something like "the moment the data structure is deserialized". After all it's usually better to abort early than half way through an operation.
It was quite realistic to improve protobufs to help dig web search out of their "everything+dog consumes an enormous monolithic datastructure" problem, assuming that's what you're thinking of (my memory of the details of this time is getting fuzzy).
A simple brute-force fix for their situation would have been to make validation of required fields toggle-able on a per-parse level, so they could disable validation for their own parts of the stack without taking it away for everyone else (none of the projects I worked on had problems with required fields that I can recall).
A better fix would have been for protobufs to support composition. They could then have started breaking down the mega-struct into overlapping protos, with the original being defined as a recursive merge of them. That'd have let them start narrowing down semantically meaningful views over what the programs really needed.
The worst fix was to remove validation features from the language, thus forcing everyone to manually re-add them without the help of the compiler.
Really, the protobuf type system was too simple for Google even in 2006. I recall during training wondering why it didn't have a URL type given that this was a web-centric company. Shortly after I discovered a very simple and obvious bug in web search in which some local business results were 404s even though the URL existed. It had been there for months, maybe years, and I found it by reading the user support forums (nobody else did this, my manager considered me way out of my lane for doing so). The bug was that nothing anywhere in the pipeline checked that the website address entered by the business owner started with https://, so when the result was stuffed into an <a> tag it turned into <a href="www.business.com"> and so the user ended up at https://www.google.com/www.business.com. Oops. These bad strings made it all the way from the business owner, through the LBC frontend, the data pipeline, the intermediate microservices and the web search frontends to the user's browser. The URL did pass crawl validation because when loaded into a URL type, the missing protocol was being added. SREs were trained to do post-mortems, so after it got fixed and the database was patched up, I naively asked whether there was a systematic fix for this, like maybe adding a URL type to protobufs so data would be validated right at the start. The answer was "it sounds like you're asking how to not write bugs" and nothing was done, sigh. It's entirely possible that similar bugs reoccurred dozens of times without being detected.
Those are just a couple of cases where the simplicity (or primitivity) of the protobuf type system led to avoidable problems. Sure, there are complexity limits too, but the actual languages Googlers were using all had more sophisticated type systems than protobuf and bugs at the edges weren't uncommon.
> I think you're defining consumer as the literal line of code where the field is read
I am.
> After all it's usually better to abort early than half way through an operation.
I realize this goes against common wisdom, but I actually disagree.
It's simply unrealistic to imagine that we can fully determine whether an operation will succeed by examining the inputs upfront. Even if the inputs are fully valid, all sorts of things can go wrong at runtime. Maybe a database connection is randomly dropped. Maybe you run out of memory. Maybe the power goes out.
So we already have to design our code to be tolerant to random failures in the middle. This is why we try to group our state changes into a single transaction, or design things to be idempotent.
Given we already have to do all that, I think trying to validate input upfront creates more trouble than it solves. When your validation code is far away from the code that actually processes the data, it is easier to miss things and harder to keep in sync.
To be clear, though, this does not mean I like dynamic typing. Static types are great. But the reason I like them is more because they make programming easier, letting you understand the structure of the data you're dealing with, letting the IDE implement auto-complete, jump-to-definition, and error checking, etc.
Consider TypeScript, which implements static typing on JavaScript, but explicitly does not perform any runtime checks whatsoever validating types. It's absolutely possible that a value at runtime does not match the type that TypeScript assigned to it. The result is a runtime exception when you try to access the value in a way that it doesn't support (even though its type says it should have). And yet, people love TypeScript, it clearly provides value despite this.
This stuff makes a programming language theorist's head explode but it practice it works. Look, anything can be invalid in ways you never thought of, and no type system can fully defend you from that. You gotta get comfortable with the idea that exceptions might be thrown from anywhere, and design systems to accommodate failure.
1. The advantage of having it in the type system is the compiler can't forget.
2. It's quite hard to unwind operations in C++. I think delaying validation to the last moment is easier when you have robust exceptions. At the top level the frameworks can reject RPCs or return a 400 or whatever it is you want to do, if it's found out 20 frames deep into some massive chunk of code then you're very likely to lose useful context as the error gets unwound (and worse error messages).
On forgetting, the risky situation is something like this:
The intention is: in v1 of the message there's some default information returned, but in v2 the client is given more control including the ability to return less information as well as more. In proto2 you can query if options is set, and if not, select the right default value. In proto3 you can't tell the difference between an old client and a client that wants no extra information returned. That's a bug waiting to happen: the difference between "not set" and "default value" is important. Other variants are things like adding "int32 timeout" where it defaults to zero, or even just having a client that forgets to set a required field by mistake.
TypeScript does indeed not do validation of type casts up front, but that's more because it's specifically designed to be compatible with JavaScript and the runtime doesn't do strong typing. People like it compared to raw JS.
> Consider TypeScript, which implements static typing on JavaScript, but explicitly does not perform any runtime checks whatsoever validating types. It's absolutely possible that a value at runtime does not match the type that TypeScript assigned to it. The result is a runtime exception when you try to access the value in a way that it doesn't support (even though its type says it should have). And yet, people love TypeScript, it clearly provides value despite this.
> This stuff makes a programming language theorist's head explode but it practice it works. Look, anything can be invalid in ways you never thought of, and no type system can fully defend you from that. You gotta get comfortable with the idea that exceptions might be thrown from anywhere, and design systems to accommodate failure.
It's only possible if you're doing something wrong type-wise. In particular, when ingesting an object you're supposed to validate it before/as you assign the type to it. Delaying the error until the particular field is accessed is bad TypeScript! Those kinds of exceptions aren't supposed to be thrown from anywhere.
I think this comes from everyone wanting to use the same schema and parser. For example, a text editor and a compiler have obvious differences in how to deal with invalid programs.
Maybe there need to be levels of validation, like "it's a text file" versus "it parses" versus "it type checks."
Sure, that would also have been a fine solution. There are lots of ways to tackle it really and some of it is just very subjective. There's a lot of similarities here between the NoSQL vs SQL debates. Do you want a schemaless collection of JSON documents or do you want enforced schemas, people can debate this stuff for a long time.
You can also see it as a version control and awareness problem rather than a schema or serialization problem. The issues don't occur if you always have full awareness of what code is running and what's consuming what data, but that's hard especially when you take into account batch jobs.
> The argument is actually more like: don't use badly written middleman software that tries to parse messages it doesn't need to parse.
> I was at Google when the "let's get rid of optional" crusade started. It didn't make sense to me then and over a decade later it still doesn't. If a program expects a field to be there then it has to be there, removing the protobuf level checking just meant that programs could now read garbage (some default value) instead of immediately crashing. But the whole reason we have types, assertions, bounds checking and so on is because, almost always, we'd like our software to NOT just blindly plough on into undefined territory when it doesn't understand something properly, so in reality it just means everyone ends up coding those very same required-ness assertions by hand.
Yeah, that's what stuck out to me from the linked explanation as well; the issue wasn't that the field was required, it was that the message bus was not doing what was originally claimed. It sounds like either having the message bus _just_ process the header and not the entire message or having the header have a version number that indicated which fields are required (with versions numbers that are newer than the latest the bus was aware of being considered to have no required fields). I don't claim that it's never correct to design a protocol optimizing for robustness when consumed by poorly written clients, but I similarly struggle to see how making that the only possible way to implement a protocol is the only valid option. Maybe the goal of cap'n proto is to be prescriptive about this sort of thing, so it wouldn't be a good choice for uses where there's more rigor in the implementation of services using the protocol, but if its intended for more general usage, I don't understand this design decision at all.
That's valuable what you say, and it's kinda odd some people here discard practical experience in favor of their subjective flavor of theoretical correctness.
The distributed part shifts the problem from "find types that represent your solution" to "find a system of types that enable evolution of your solution over time." I think this is why bad things like json or xml do so well: they work fine with a client dev saying, "I need this extra data" and the server dev adding it, and then the client dev consuming it.
The more modern approaches, like protobuf or capn proto are designed with the experience of mutating protocols over time.
It works pretty well too unless the new field changes the semantics of old field values, e.g. adding a field "payment_is_reversal_if_set" to a payment info type, which would change the meaning of the signs of the amounts. In that case, you have to reason more explicitly about when to roll out the protocol readers and when to roll out the protocol writers. Or version it, etc.
> Can't we extend this argument to eliminating basically all static typing?
No, because static typing exists in all sorts of places. This argument is primarily about cases where you're exchanging data, which is a very specific use case.
Static type systems in programming languages are designed to break at compilation-time. The reason this works is because all users are within the same “program unit”, on the same version.
In other words, static typing allows more validation to be automated, and removes the need for multiple simultaneous versions, but assumes that the developer has access and ability to change all other users at the same “time” of their own change.
I find this whole topic fascinating. It seems like programmers are limited to an implicit understanding of these differences but it’s never formalized (or even properly conceptualized). Thus, our intuition often fails with complex systems (eg multiple simultaneous versions, etc). Case in point: even mighty Google distinguished engineers made this “billion-dollar mistake” with required fields, even though they had near-perfect up-front knowledge of their planned use-cases.
It's actually the opposite. The billion dollar mistake is to have pervasive implicit nullability, not to have the concept of optionality in your type system. Encoding optionality in the type system and making things required by default is usually given as the fix for the billion dollar mistake.
Huh? Did you read the link, from the guy who was there during the major failure at Google that led to proto3 being redesigned without that flaw?
The whole lesson is that you can’t apply the lessons from static type systems in PLs when you have multiple versions and fragmented validation across different subsystems. Counter-intuitively! Everyone thought it was a good idea, and it turned out to be a disaster.
I did read the link and I was at Google at the time people started arguing for that. With respect, I think the argument was and still is incorrect, that the wrong lessons were drawn and that proto3 is worse than proto2.
you reject the message in the framework? and if the client is aware it’s required they fail to send?
the bigger challenge with proto3 is that people use it both for rpc and storage, in some cases directly serializing rpc payloads. Disregarding how awful a choice that is, you likely want to trade off flexible deserialization of old data at the expense of rigidity, and conformance.
It remains a big asterisk to me, why was some random middleware validating an end-to-end message between two systems, instead of treating it as just an opaque message.
Why are we not having this debate about "everything must be optional" for Internet Packets (IP) for example? Because it's just binary load. If you want to ensure integrity you checksum the binary load.
Things like distributed tracing, auth data, metrics, error logging messages and other “meta-subsystems” is certainly typical use cases. Reverse proxies and other http middleware do exactly this with http headers all the time.
No one has near-perfect up-front knowledge of a software system designed to change and expand. The solution space is too large and the efficient delivery methods are a search thru this space.
I may have phrased it poorly. What I should have said is that Google absolutely could have “anticipated” that many of their subsystems would deal with partial messages and multiple versions, because they most certainly already did. The designers would have maintained, developed and debugged exactly such systems for years.
Static types are a partial application/reduction when certain mutable or unknown variables become constants (i.e. "I for sure only need integers between 0-255 here").
I'm not rejecting static types entirely, and yes I was discussing exchanging data here, as Alan Kay's OOP is inherently distributed. It's much closer to Erlang than it is to Java.
I'm not the person you're responding to, but I interpreted their comment as, "doesn't the argument against having protobuf check for required fields also apply to all of protobuf's other checks?"
From the linked article the post: "The right answer is for applications to do validation as-needed in application-level code. If you want to detect when a client fails to set a particular field, give the field an invalid default value and then check for that value on the server. Low-level infrastructure that doesn’t care about message content should not validate it at all."
(I agree that "static typing" isn't exactly the right term here. But protobuf dynamic validation allows the programmer to then rely on static types, vs having to dynamically check those properties with hand-written code, so I can see why someone might use that term.)
Sorry, I see how I'm vague. The idea is you have no "pre-burned" static types, but dynamic types. And static types then become a disposable optimization compiled out of more dynamic code, in the same way JIT works in V8 and JVM for example (where type specialization is in fact part of the optimization strategy).
No. "Types only known at runtime" are dynamic types. "And also you can optimize by examining the types at runtime" is just dynamic types. And it does not have the benefit of static types because it is dynamic types.
This is devolving into a "word definition war" so I'll leave aside what you call static types and dynamic types and get down to specifics. Type info is available in these flavors, relative to runtime:
1. Type info which is available before runtime, but not at runtime (compiled away).
2. Type info which is available at runtime, but not at compile time (input, statistics, etc.).
3. Type info which is available both at compile time and runtime (say like a Java class).
When you have a JIT optimizer that can turn [3] and [2] into [1], there's no longer a reason to have [1], except if you're micro-optimizing embedded code for some device with 64kb RAM or whatever. We've carried through legacy practices, and we don't even question them, and try to push them way out of their league into large-scale distributed software.
When I say we don't need [1], this doesn't mean I deny [3], which is still statically analyzable type information. It's static types, but without throwing away flexibility and data at runtime, that doesn't need to be thrown away.
Short of time travel one can not turn (3) or (2) into (1). I'm not sure where the confusion here is or what you're advocating for because this isn't making sense to me.
> there's no longer a reason to have [1]
I guess if you're assuming the value of static types is just performance? But it's not, not by a long shot - hence 'mypy', a static typechecker that in no way impacts runtime.
I think this conversation is a bit too confusing for me so I'm gonna respectfully walk away :)
The confusion is to assume "runtime" is statically defined. JIT generates code which omits type information that's determined not to be needed in the context of the compiled method/trace/class/module. That code still "runs" it's "runtime".
It's easy to imagine any statically typed language having a general-purpose JSON type. You could imagine all functions accepting and returning such objects.
Now it's your turn to implement the sum(a,b) function. Would you like to allow the caller to pass anything in as a and b?
This is like when people use protobuf to send a list of key-value mappings, and call that a protocol. (I've seen that same design in many protocol description arenas, even SQL database schemas that are just (entityId INT, key CLOB, value BLOB).
Do you need to make different versions of a program exchange information even though they do not agree on the types? No? Then this argument cannot be extended this way.
> The right answer is for applications to do validation as-needed in application-level code.
It would've been nice to include a parameter to switch "required message validation" on and off, instead of relying on application code. Internally in an application, we can turn this off, the message bus can turn it off, but in general, developers would really benefit from this being on.
A gotcha along the same path. Deserialization of things not needed as what you get with generated clients. An aspect of interfaces in Go I really like. Remotely type what I use. Skip the rest. Not fun to have incidents caused by changes to a contract that is not even used by a service. Also hard to find.
Avro solves this problem completely, and more elegantly with its schema resolution mechanism. Exchanging schemas at the beginning of a connection handshake is hardly burdensome
If by "solving" you mean "refuse to do anything at all unless you have the exact schema version of the message you're trying to read" then yes. In a RPC context that might even be fine, but in a message queue...
I will never use Avro again on a MQ. I also found the schema resolution mechanism anemic.
Avro was (is?) popular on Kafka, but it is such a bad fit that Confluent created a whole additional piece of infra called Schema Registry [1] to make it work. For Protobuf and JSON schema, it's 90% useless and sometimes actively harmful.
I think you can also embed the schema in an Avro message to solve this, but then you add a massive amount of overhead if you send individual messages.
> but it is such a bad fit that Confluent created a whole additional piece of infra called Schema Registry [1] to make it work.
That seems like a weird way to describe it. It is assumed that a schema registry would be present for something like Avro. It's just how it's designed - the assumption with Avro is that you can share your schemas. If you can't abide by that don't use it.
I do not think its unfair at all. Schema registry needs to add a wrapper and UUID to an Avro payload for it to work, so at the very least Avro as-is is unsuitable for a MQ like Kafka since you cannot use it efficiently without some out-of-band communication channel.
Everyone knows you need an out of band channel for it, I don't know why you're putting this out there like it's a fault instead of how it's designed. Whether it's RPC where you can deploy your services or a schema registry, that is literally just how it works.
Wrapping a message with its schema version so that you can look up that version is a really sensible way to go. A uuid is way more than what's needed since they could have just used a serial integer but whatever, that's on Kafka for building it that way, not Avro.
Understanding the data now depends not just on having a schema to be found in a registry, but your schema registry, with the schemata registered in the same specific order you registered them in. If you want to port some data from prod back to staging, you need to rewrite the IDs. If you merge with some other company using serial IDs and want to share data, you need to rewrite the IDs. Etc.
They're saying that for a sequential number, you must have a central authority managing that number. If two environments have different authorities, they have a different mapping of these sequential numbers, so now they can't share data.
Oh. Like, if you have two different schema registries that you simultaneously deploy new schemas to while also wanting to synchronize the schemas across them.
Having the schema for a data format I'm decoding has never been a problem in my line of work, and i've dealt with dozens of data formats. Evolution, versioning and deprecating fields on the other hand is always a pain in the butt.
If a n+1 version producer sends a message to the message queue with a new optional field, how do the n version consumers have the right schema without relying on some external store?
In Protobuf or JSON this is not a problem at all, the new field is ignored. With Avro you cannot read the message.
I mean a schema registry solves this problem, and you just put the schema in to the registry before the software is released.
A simpler option is to just publish the schema in to the queue periodically. Say every 30 seconds, and then receivers can cache schemas for message types they are interested in.
Disagree. Avro makes messages slightly smaller by removing tags, but it makes individual messages completely incomprehensible without the writer schema. For serializing data on disk it's fine and a reasonable tradeoff to save space, but for communication on the wire tagged formats allow for more flexibility on the receiver end.
The spec for evolving schemas is also full of ambiguity and relies on the canonical Java implementation. I've built an Avro decoder from scratch and some of the evolution behaviour is counter-intuitive.
Protobufs are also undecodable without the schema. You can't even properly log unknown tags, because the wire format is ambiguous and doesn't encode data type (just size)
> Exchanging schemas at the beginning of a connection handshake is hardly burdensome.
I dunno, that sounds extremely burdensome to me, especially if the actual payload is small.
And how exactly does exchanging schemas solve the problem? If my version of the schema says this field is required but yours says it is optional, and so you don't send it, what am I supposed to do?
Avro makes that case slightly better because you can default value for a missing field in one of the two schemas and then it works.
It's not worth the boatload of problems it bring in all other and normal use cases though. Having the default value in the app or specified by the protocol is good enough.
Great achievement. To be honest I wouldn't recommend Capnp. The C++ API is very awkward.
The zero copy parsing is less of a benefit than you'd expect - pretty unlikely you're going to want to keep your data as a Capnp data structure because of how awkward it is to use. 99% of the time you'll just copy it into your own data structures anyway.
There's also more friction with the rest of the world which has more or less settled on Protobuf as the most popular binary implementation of this sort of idea.
I only used it for serialisation. Maybe the RPC stuff is more compelling.
I really wish Thrift had taken off instead of Protobuf/gRPC. It was so much better designed and more flexible than anything I've seen before or since. I think it died mainly due to terrible documentation. I guess it also didn't have a big name behind it.
I do agree that the API required for zero-copy turns out a bit awkward, particularly on the writing side. The reading side doesn't look much different. Meanwhile zero-copy is really only a paradigm shift in certain scenarios, like when used with mmap(). For network communications it doesn't change much unless you are doing something hardcore like RDMA. I've always wanted to add an optional alternative API to Cap'n Proto that uses "plain old C structures" (or something close to it) with one-copy serialization (just like protobuf) for the use cases where zero-copy doesn't really matter. But haven't gotten around to it yet...
That said I personally have always been much more excited about the RPC protocol than the serialization. I think the RPC protocol is actually a paradigm shift for almost any non-trivial use case.
I've always been excited about zero copy messages in the context of its potential in database systems; the thought of tuples working their way all the way from btree nodes in a pager, to query results on the network without copies seems fantastic.
But every time I've tried to prototype or implement around this model I've run into conceptual blocks. It's a tricky paradigm to fully wrap one's head around, and to squeeze into existing toolsets.
One thing about google proto is that, at least in many languages, every message throws off a ton of garbage that stresses the GC. On the send side, you can obviously re-use objects, but on the receive side no.
More and more languages are being built on top of the "upb" C library for protobuf (https://github.com/protocolbuffers/upb) which is designed around arenas to avoid this very problem.
Currently Ruby, PHP, and Python are backed by upb.
Disclosure: I work on the protobuf team, and created the upb library.
This is also because Google's Protobuf implementations aren't doing a very good job with avoiding unnecessary allocations. Gogoproto is better and it is possible to do even better, here is an example prototype I have put together for Go (even if you do not use the laziness part it is still much faster than Google's implementation): https://github.com/splunk/exp-lazyproto
What part of the industry are you in where flatbuffers is seen as the de facto standard? Personally I've never randomly encountered a project using flatbuffers. I see protobuf all the time.
(I've randomly run into Cap'n Proto maybe 2-3 times but to be fair I'm probably more likely to notice that.)
Flatbuffers seems to have penetration in the games industry. And it sounds like from other posters that Facebook uses it.
I recently started a job doing work on autonomy systems that run in tractors, and was surprised to see we use it (flatbuffers) in the messaging layer (in both C++ and Rust)
As of the last time I was close, flatbuffer usage is or was near ubiquitous for use in FB's (ha ha, go figure) mobile apps, across Android and iOS at least.
I find MessagePack to be pretty great if you don't need schema. JSON serialization is unreasonably fast in V8 though and even message pack can't beat it; though it's often faster in other languages and saves on bytes.
It depends on your data. We ran comparisons on objects with lots of numbers and arrays (think GeoJSON) and messagepack came out way ahead. Of course, something like Arrow may have fared even better with its focus on columnar data, but we didn't want to venture that far afield just yet.
Encoding JSON or MessagePack will be about the same speed, although I would expect MessagePack to be marginally faster from what I’ve seen over the years. It’s easy to encode data in most formats, compression excluded.
Parsing is the real problem with JSON, and no, it isn’t even close. MessagePack knows the length of every field, so it is extremely fast to parse, an advantage that grows rapidly when large strings are a common part of the data in question. I love the simple visual explanation of how MessagePack works here: https://msgpack.org/
Anyone who has written parsing code can instantly recognize what makes a format like this efficient to parse compared to JSON.
With some seriously wild SIMD JSON parsing libraries, you can get closer to the parsing performance of a format like MessagePack, but I think it is physically impossible for JSON to be faster. You simply have to read every byte of JSON one way or another, which takes time. You also don’t have any ability to pre-allocate for JSON unless you do two passes, which would be expensive to do too. You have no idea how many objects are in an array, you have no idea how long a string will be.
MessagePack objects are certainly smaller than JSON but larger than compressed JSON. Even compressed MessagePack objects are larger than the equivalent compressed JSON, in my experience, likely because the field length indicators add a randomness to the data that makes compression less effective.
For applications where you need to handle terabytes of data flowing through a pipeline every hour, MessagePack can be a huge win in terms of cost due to the increased CPU efficiency, and it’s a much smaller lift to switch to MessagePack from JSON than to switch to something statically typed like Protobuf or CapnProto, just due to how closely MessagePack matches JSON. (But, if you can switch to Protobuf or CapnProto, those should yield similar and perhaps even modestly better benefits.)
Compute costs are much higher than storage costs, so I would happily take a small size penalty if it reduced my CPU utilization by a large amount, which MessagePack easily does for applications that are very data-heavy. I’m sure there is at least one terribly slow implementation of MessagePack out there somewhere, but most of them seem quite fast compared to JSON.
Also take note of the “ShamatonGen” results, which use codegen before compile time to do things even more efficiently for types known ahead of time, compared to the normal reflection-based implementation. The “Array” results are a weird version that isn’t strictly comparable, the encoding and decoding steps assume that the fields are in a fixed order, so the encoded data is just arrays of values, and no field names. It can be faster and more compact, but it’s not “normal” messagepack.
I’ve personally seen crazy differences in performance vs JSON.
If you’re not handling a minimum of terabytes of JSON per day, then the compute costs from JSON are probably irrelevant and not worth thinking too hard about, but there can be other benefits to switching away from JSON.
Size savings depends I guess on the workload. That home page example gets larger gzip'd so raw msgpack is smaller. Another comment says their data was considerably smaller vs json.
Sometimes you can't gzip for various reasons. There were per-message deflate bugs in Safari and Brave somewhat recently. Microsoft is obsessed with the decades old CRIME/BREACH for some reason(I've never heard any other company or individual even mention them) so signalR still doesn't have the compression option yet..
I always liked the idea of capnp, but it bothers me that what is ultimately a message encoding protocol has an opinion on how I should architect my server.
FWIW, gRPC certainly has this problem too, but it’s very clearly distinct from protobuf, although pb has gRPC-related features.
That entanglement makes me lean towards flatbuffers or even protobuf every time I weigh them against capnp, especially since it means that fb and pb have much simpler implementations, and I place great value on simplicity for both security and maintenance reasons.
I think the lack of good third-party language implementations speaks directly to the reasonability of that assessment. It also makes the bus factor and longevity story very poor. Simplicity rules.
Part of the problem with cap'n'proto whenever I've approached it is that not only does it have an opinion on how to architect your server (fine, whatever) but in C++ it ends up shipping with its own very opinionated alternative to the STL ("KJ") and when I played with it some years ago it really ended up getting its fingers everywhere and was hard to work into an existing codebase.
The Rust version also comes with its own normative lifestyle assumptions; many of which make sense in the context of its zero-copy world but still make a lot of things hard to express, and the documentation was hard to parse.
I tend to reach for flatbuffers instead, for this reason alone.
Still I think someday I hope to have need and use for cap'n'proto; or at least finish one of several hobby projects I've forked off to try to use it over the years. There's some high quality engineering there.
Yes, it's true, the C++ implementation has become extremely opinionated.
I didn't initially intend for KJ to become as all-encompassing as it has. I guess I kept running into things that didn't work well about the standard library, so I'd make an alternative that worked well, but then other parts of the standard library would not play nicely with my alternative, so it snowballed a bit.
At the time the project started, C++11 -- which completely changed the language -- was brand new, and the standard library hadn't been updated to really work well with the new features.
The KJ Promise library in particular, which made asynchronous programming much nicer using the newly-introduced lambdas, predated any equivalent landing in the standard library by quite a bit. This is probably the most opinionated part of KJ, hardest to integrate with other systems. (Though KJ's even loop does actually have the ability to sit on top of other event loops, with some effort.)
And then I ended up with a complete ecosystem of libraries on top of Promises, like KJ HTTP.
With the Workers Runtime being built entirely in that ecosystem, it ends up making sense for me to keep improving that ecosystem, rather than try to make things work better across ecosystems... so here we are.
FWIW, one of the biggest obstacles to adoption of RPC components in our C++ codebase at work (not FB/Meta) is clean integration with our preexisting Folly + libevent scheduler. IMO, this has become a common theme across most C++ libraries that involve async IO. Rust has the same problem to a lesser extent. I think it's going to become a fundamental problem in all languages that don't have a strong opinion on, or built-in, async and schedulers.
Oh I understand completely how that would happen. I believe the first time I played with your work was not long after the C++11 transition, and so I could see why it happened.
This is why these days I just work in Rust :-) Less heterogenous of an environment (so far).
Cap'N'Proto comes with a (quite good) RPC facility. Based on asynchronous promises and grounded in capabilities.
You don't have to use it. You could just use it just as a 'serialization' layer but if you're writing services you could be missing half the advantage, really. And if you're writing in C++ you'll end up having to use their KJ library anyways.
If you take the whole package the zero copy, capability-security, and asynchrouny (a word I just coined!) all fit together nicely.
Yeah I'm aware of all of that. What I'm saying is that I don't see what about the Schema Definition Language pushes you towards the RPC other than that they obviously go well together, just like gRPC is almost always used with protobuf, or http with JSON.
> but it bothers me that what is ultimately a message encoding protocol has an opinion on how I should architect my server.
To me, this is like saying "Using JSON is unfortunate because it has an opinion that I should use HTTP" when I don't think anyone would argue that at all, and I don't see the argument for capnp much either.
The main thing that Cap'n Proto PRC really requires about the serialization is that object references are a first-class type. That is, when you make an RPC, the parameters or results can contain references to new, remote RPC objects. Upon receiving such a reference, that object is now callable.
Making this work nicely requires some integration between the serialization layer and the RPC layer, though it's certainly possible to imagine Protobuf being extended with some sort of hooks for this.
Huh, that reference actually never occurred to me.
The name Cap'n Proto actually originally meant "Capabilities and Protobufs" -- it was a capability-based RPC protocol based on Protocol Buffers. However, early on I decided I wanted to try a whole different serialization format instead. "Proto" still makes sense, since it is a protocol, so I kept the name.
The pun "cerealization protocol" is actually something someone else had to point out to me, but I promptly added it to the logo. :)
We have a great plethora of binary serialization libraries now, but I've noticed none of them offer the following:
* Specification of the number of bits I want to cap out a field at during serialization, ie: `int` that only uses 3 bits.
* Delta encoding for serialization and deserialization, this would further decrease the size of each message if there is an older message that I can use as the initial message to delta encode/decode from.
Most formats use varints, so you can't have a 3-bit int but they will store a 64-bit int in one byte if it fits. Going to smaller than a byte isn't worth the extra complexity and slowness. If you're that space sensitive you need to add proper compression.
By delta compression you mean across messages? Yeah I've never seen that but it's hard to imagine a scenario where it would be useful and worth the insane complexity.
CBOR approximates this, since it has several different widths for integers.
> an older message that I can use as the initial message to delta encode/decode from.
General-purpose compression on the encoded stream would do something toward this goal, but some protocol buffers library implementations offer merge functions. The question is what semantics of "merge" you expect. For repeated fields do you want to append or clobber?
Take a look at FAST protocol [1]. It has been around for a while. Was created for market/trading data. There appears to be some open source implementations, but I don't think in general they'd be maintained well since trading is, well, secretive.
One thing I liked about Ada, the small amount I used it, is it has actual subtypes: you could define a variable as an integer within a specific range, and the compiler would (presumably) choose an appropriate underlying storage type for it.
zserio [1] has the former at least. It isn't intended for the same use cases as protobuf/capnproto/flatbutter though; in particular it has no backward or forwards compatibility. But it's great for situations where you know exactly what software is used on both ends and you need small data and fast en-/decoding.
this changed my world how i think about computing on the web.
if there was just a good enough js library as for lua and you could directly send capnp messages to workerd instead of always going through files. I guess one day i have to relearn c++ and understand how the internals actually work.
If you are interested in learning more about binary serialization (for Cap'n Proto and others), I wrote two free-to-read papers extensively looking at the scene from a space-efficiency point of view that people here might be interested in:
Not something that aims to compete with Cap'n Proto (different use cases), but I've been also working on a binary serialization format that is pure JSON-compatible with a focus on space-efficiency: https://www.jsonbinpack.org.
It’s a testament to the subtlety of software engineering that even after four tries (protobuf 1-3, capn proto 1) there are still breaking changes that need to be made to the solution of what on the surface appears to be a relatively constrained problem.
I assume you are talking about the cancellation change. This is interesting, actually. When originally designing Cap'n Proto, I was convinced by a capabilities expert I talked to that cancellation should be considered dangerous, because software that isn't expecting it might be vulnerable to attacks if cancellation occurs at an unexpected place. Especially in a language like C++, which lacks garbage collection or borrow checking, you might expect use-after-free to be a big issue. I found the argument compelling.
In practice, though, I've found the opposite: In a language with explicit lifetimes, and with KJ's particular approach to Promises (used to handle async tasks in Cap'n Proto's C++ implementation), cancellation safety is a natural side-effect of writing code to have correct lifetimes. You have to make cancellation safe because you have to cancel tasks all the time when the objects they depend on are going to be destroyed. Moreover, in a fault-tolerant distributed system, you have to assume any code might not complete, e.g. due to a power outage or maybe just throwing an unexpected exception in the middle, and you have to program defensively for that anyway. This all becomes second-nature pretty quick.
So all our code ends up cancellation-safe by default. We end up with way more problems from cancellation unexpectedly being prevented when we need it, than happening when we didn't expect it.
EDIT: Re-reading, maybe you were referring to the breaking changes slated for 2.0. But those are primarily changes to the KJ toolkit library, not Cap'n Proto, and is all about API design... I'd say API design is not a constrained problem.
Since then, we extended and improved the support: added it for export (initially, it was only for import) and improved the performance.
About strange stuff in the library - it uses a non-obvious approach for exception handling, and their C++ code feels like they too much focusing on some non-orthodox approaches.
Some of these “high perf” RPC libraries never get one key point, if I really need something to be fast, the most important aspect is that it must be simple.
Congrats in 10 years! Question: Can Cap'n Proto be used as an alternative to Python Pickle library for serializing and de-serializing python object structures?
If your goal is to serialize an arbitrary Python object, Pickle is the way to go. Cap'n Proto requires you to define a schema, in Cap'n Proto schema language, for whatever you wan to serialize. It can't just take an arbitrary Python value.
After exploring a few constant access serialization formats, I had to pass on Capn Proto in favor of Apache Avro. Capn has a great experience for C++ users, but Java codegen ended up being too annoying to get started with. If Capn Proto improved the developer experience for the other languages people write, I think it would really help a lot.
I used capnproto years ago as the network serialization format for a multiplayer RTS game. Although the API can be quite awkward, it was overall a joy to use and I wish I was able to use it in more projects.
I'm using Cap'N Proto in a message broker application(LcuidMQ) I'm building for serialization. It has allowed me to created client applications rather quickly. There are some quirks can be difficult to wrap your head around, but once you understand it is really solid.
There are some difference between the language libraries and documentation can be lacking around those language specific solutions. I'm hoping to add blog articles and or contribute back to the example of these repositories to help future users who want to dabble.
Ian Denhardt (zenhack on HN), a lead contributor to the Go implementation, suddenly and unexpectedly passed away a few weeks ago. Before making a request to the community, I want to express how deeply saddened I am by this loss. Ian and I collaborated extensively over the past three years, and we had become friends.
As the de facto project lead, it now befalls me to fill Ian's very big shoes. Please, if you're able to contribute to the project, I could really use the help. And if you're a contributor or maintainer of some other implementation (C++, Rust, etc.), I would *REALLY* appreciate it if we could connect. I'm going to need to surround myself with very smart people if I am to continue Ian's work.
RIP Ian, and thank you. I learned so much working with you.
------
P.S: I can be reached in the following places
- https://github.com/lthibault
- https://matrix.to/#/#go-capnp:matrix.org
- Telegram: @lthibault
- gmail: louist87