Hacker News new | past | comments | ask | show | jobs | submit login
Why does gRPC insist on trailers? (carlmastrangelo.com)
324 points by strzalek on Aug 7, 2022 | hide | past | favorite | 125 comments



> Whether it’s because I was wrong, or failed to make the argument [for HTTP trailers support], I strongly suspect organizational boundaries had a substantial effect. The Area Tech Leads of Cloud also failed to convince their peers in Chrome, and as a result, trailers were ripped out [from the WHATWG fetch specification].

FWIW, I personally think it's a good thing that other teams within Google don't have too much of an "advantage" for getting features into Chrome, compared to other web developers, however, I also think it's very unfortunate that a single Chrome engineer gets to decide not only that it shouldn't be implemented in Chromium, but that that also has the effect of it being removed from the specification. (The linked issue [1] was also opened by a Google employee.)

Of course, you might reasonably argue that, without consensus among the browsers to implement a feature, having it in the spec is useless. But nevertheless, with Chromium being an open source project, I think it would be better if it had a more democratic process of deciding which features should be supported (without, of course, requiring Google specifically to implement them, but also without, ideally, giving Google the power to veto them).

[1]: https://github.com/whatwg/fetch/issues/772


It's clear that the "single engineer" thing is a lie. Many engineers commented on the Chrome issue with opposing viewpoints, and even the original post describes it being escalated to tech leads on both sides, getting more people involved. I guarantee if it was only one person standing alone opposed to trailers then they would have been overruled. As you say, it's a good thing that Chrome resists adding the pet features of every other Google team to the web.


Sure, that's fair enough. But I'm not sure if characterizing this feature as a pet feature of the gRPC team is accurate either - after all, it's simply exposing an HTTP 1.1 & H2 feature, and it was already in the WHATWG fetch spec. There, the security concerns were apparently discussed as well [1], and adding the trailer headers as a separate object was deemed safe. I haven't read the entire discussion and don't have a vested interest in it, but the WHATWG spec seems like a better place to have this discussion, and come to a conclusion, than the Chromium issue tracker.

Apparently, there is a new issue for it, so that might yet happen: [2].

[1]: https://github.com/whatwg/fetch/issues/34

[2]: https://github.com/whatwg/fetch/issues/981


No other browser has implemented it either. So it's not as though Chrome alone is preventing this feature from existing. It seems likely that other browsers might also take issue with it. It probably shouldn't have been finalized in the spec before being implemented at all anywhere. It may not have been well thought out.


Well, anything is possible, but the original WHATWG fetch issue mentions:

> We discussed this at the HTTP workshop:

> 1. It's okay to expose trailer headers sans semantics in the API and browsers are okay with that.

So it was discussed, and presumably at least well thought out enough that browsers were on board with it. Then, Chrome changed their mind, it was removed from the spec, and yeah, then other browsers didn't implement it either. But I wouldn't call it "likely" that they would have taken issue with it, as there was evidence to the opposite.


This could be solved without trailers by piggybacking on top of server sent events rather than requests with trailers, and terminating a stream with status metadata as a final event.


The underlying data is binary (protobufs), though. It would defeat the purpose to re-encode it as text to send over SSE.

Also, IIRC gRPC supports full-duplex, which SSE does not.

WebSockets may have been a better fit, as the Chromium developers hinted on the WontFix issue linked in the post.


In the next web, https://web.dev/webtransport/ would be the thing to move to. It's a bit nicer under the hood than websockets, though it's also brand new.

My comment was less about a full duplex replacement though, and more "how to get something equivalent to trailers, that works in browsers today", and this would enable that.


Server sent events are just a potential payload format for response bodies, which allows to encode a potentially infinite amount of distinct messages. Trailers are a different concept since they exist outside of bodies.

If the team had decided to only use http body streams for all payload, they could totally have done that without directly using SSE encoding.


SSE adds another thing besides the body encoding, which is a standard content type that implies unbuffered streaming.


Jeebus. Just because it's not true doesn't mean it's a lie.


It's a lie if OP knows it isn't true. And OP knows it isn't true because they explicitly describe a lot more than one person being involved on both sides.


Why should the decisionmaking process for Chromium be “democratic” simply because it is open source?

Anyone who wants to pay can implement whatever they want in the codebase. That’s in a way as democratic as it gets: equality of opportunity [to invest money and time].

If Google is paying for the implementors’ time, Google should have 100% say in what code they write. You and everyone else are free (thanks to Google’s generosity) to fork it at any point in the commit history and individually veto any specific change.


> Anyone who wants to pay can implement whatever they want in the codebase. That’s in a way as democratic as it gets: equality of opportunity [to invest money and time].

Leaving aside whether that's how it should work, I'm not sure if that's in fact how it works for Chromium today. If I write a high-quality patch adding support for trailers, will it get accepted? As I understand it, the answer is no. (But I would be happy to be wrong.)

So that's my main point: it would be good to have a democratic decision making process, not for what code Googlers should write, but for what patches would get accepted into Chromium. Not just because it's open source, but also because it's the basis not just of Google's browser, but a bunch of other browsers as well.

(And note that https://www.chromium.org/ seemingly aims to give the project an air of independence from Google. Thus, I'm merely questioning whether it is, in fact, independent, and arguing that it should be, if it isn't.)


The democratic process is that anyone who wants to pay for the ad campaign can try to convince everyone else that it's a good spec everyone should adopt, not merely pay a developer to code it.

If everyone else is not convinced then it should not become a thing no matter how much one party with money wants it.


I wanted my road to swoop up and down like a roller coaster across the gorge but a single structural engineer on the bridge team overruled my obvious benefit.


I had never heard of HTTP trailers. So FYI

> The Trailer response header allows the sender to include additional fields at the end of chunked messages in order to supply metadata that might be dynamically generated while the message body is sent, such as a message integrity check, digital signature, or post-processing status.

https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/Tr...


I suppose even fewer people have heard that Transfer-Encoding: chunked supports chunk extensions, which allows one to supply arbitrary metadata without trailers.

https://datatracker.ietf.org/doc/html/rfc2616#section-3.6.1

Ever went to some site that generates compressed downloads or database exports on the fly, got no progress bar as a result, and were severely annoyed by that lack of feedback? I was, so I used chunk extensions to submit a draft to emit progress information dynamically:

https://datatracker.ietf.org/doc/html/draft-lnageleisen-http...

As noted at the end of the draft, this could be generalized and extended to have additional capabilities such as in flight integrity checks, or whatever you can think of.


Haven't headers and trailers been renamed in the recent past?


Yes, slightly - RFC 9110 ("HTTP Semantics") calls them "header fields" and "trailer fields," and it calls "headers" and "trailers" colloquialisms. In a nod to gRPC-style usage, the section on trailer fields even says, "Trailer fields can be useful for supplying...post-processing status information."

https://www.rfc-editor.org/rfc/rfc9110.html#header.fields

https://www.rfc-editor.org/rfc/rfc9110.html#trailer.fields


A few years ago I worked on a service that had to stream data out using protobuf messages, in a single request that could potentially transfer several gigabytes of data. At the HTTP level it was chunked, but above that I used a protobuf message that contained data plus a checksum of that data, with the last message of the stream containing no data but a checksum of the entire dataset (a flag was included to differentiate between the message types).

This simple design led us to find several bugs in clients of this API (e.g. messages dropped or processed twice), and gave us a way to avoid some of the issues mentioned in this article. Even if you don't use HTTP trailers, you can still use them one layer above and benefit from similar guarantees.


Inserting metadata in the protobuf itself seems like the obvious, simple solution to avoid having to depend on what the transport layer supports. Just defining a message to provide the metadata they wanted to insert in trailers would have avoided a whole lot of pain.


> As an aside, HTTP/2 is technically superior to WebSockets. HTTP/2 keeps the semantics of the web, while WS does not.

WTF is this? Those are different layer protocols. WebSocket can run on top of HTTP/2.

It's like saying TLS is technically superior to TCP, or IP is superior to copper cables.

Reference: https://www.rfc-editor.org/rfc/rfc8441.html


The idea here is that http provides something like request/response semantics, methods, path, status code, etc - which are all also useful for gRPC. Websockets provide none of that - they are just message streams.

Websockets over http/2 are a new thing, and haven’t even been available at the time gRPC incepted.


I'm not sure if websockets over HTTP/2 are actually a new thing. Firefox implemented support years ago but it was disabled almost immediately afterwards because it doesn't work with proxies and has been disabled still. I think the only engine implementing it is Chrome.

As far as I know for HTTP/3 there is no way to use websockets yet.


That is their whole point. That is why they exist. Dumb reliable pipe, please keep your silly semantics away.


Hardly dumb. They include a variable length, payload masking, their own layer of fragmentation, different message type, validation (txt vs bin) and an extension framework.


For sure there are good reasons to abandon or find alternative resourceful protocols!

But in general, http is & could be the de-facto really good resourceful protocol. It's already 90% there.

Alas, the browser has been a major point of obstruction & difficulty & tension in making the most obvious most successful most clearly winning resourceful protocol at all better. The browser has sat on it's haunches & pissed around & prevented obvious & straightforward incremental growth that has happened everywhere else except the browser, such as with http trailers, such as with http2+ push. The browser has kept the de-facto resouceful protocol from developing.

You dont have to believe in http as the way to see what an oppressive & stupid shitshow this is. Being able to enhance the de-facto protocol of the web better should be in everyones interest, in a way that doesnt prevent alternatives/offshoots. But right now only alternatives & offshoots have any traction, because the browsers have all shot doen & rejected doing anything to support modern http 2+ in any real capacity. Their http inplementations are all frozen in time.


HTTP/2 provides features that websockets don't. Even if you were to use websockets over HTTP/2, you'd lose features like being able to multiplex requests _because_ it's a higher level protocol. Why is it wrong to say its better to use a more feature full and lower level protocol?


They provide different APIs. Websockets provide a bidirectional stream of messages. The messages in a given direction are always delivered in order. If they were to be suddenly reordered that would cause a lot of headaches.

While the individual messages can't be multiplexed, different websocket streams over a single HTTP/2 connection can be multiplexed.

I think websockets also provides a feature that HTTP/2 doesn't: the ability to easily push data from the server to browser javascript.


> I think websockets also provides a feature that HTTP/2 doesn't: the ability to easily push data from the server to browser javascript.

I will never stop beating this drum: Server-Sent Events and EventSource! Simple to implement, scale, support.

https://developer.mozilla.org/en-US/docs/Web/API/Server-sent... https://developer.mozilla.org/en-US/docs/Web/API/EventSource


I've used SSE and websockets a lot. I like SSE. But it has two key weaknesses compared to websockets.

Firstly, it's unidirectional. The server sends a stream of events to the client, but the client can't send anything back. That might not be a problem at a purely semantic level (maybe the client doesn't have anything to send back), but it means you can't implement application-level heartbeating or flow control in-band; there is no way for a server to know if a client is reading messages, and if it is, whether it is keeping up with the stream. You can use TCP-level state management and flow control for this, which is inadequate, because there are all sorts of failure modes it doesn't catch, or doesn't catch quickly. Or you can implement the back-channel via separate HTTP requests, which is pretty ugly.

Secondly, SSE connections are subject to the browser's HTTP connection limit - i think most browsers these days have a limit of six connections to any given origin. If you have one connection per page, the user can open six tabs, and then any further attempts to access your site, using SSE or not, will mysteriously fail, because they can't open a connection. Websockets are not subject to this limit; they have a separate per-origin limit, which is usually much larger - 200 for Firefox, 255 for Chrome (although Chrome also has a global limit of 255 websockets?). For SSE, you may be able to work around this by port or hostname sharding, or sharing a single connection using a web worker, but all of these are complicated and intrusive. Now, this conversation started with talking about HTTP/2, and HTTP/2 does solve this, because it can multiplex many SSE connections over a single TCP connection (at the cost of head-of-line blocking on packet drops). But now you're making HTTP/2 a hard requirement for deployment - fallback to HTTP/1.1 just won't work reliably.


Valid points, the request limit in particular. On the other hand, there is some smart resource management in the user agent for EventSource, IIRC, which I'm not sure exists for websockets. I feel HTTP/2 as a requirement is acceptable but that obviously depends on the use case.

That said, I still feel SSE is both underutilized and sadly unknown.


Both downsides are fixed by using http/2, right? Heartbeat seems trivial to implement using HTTP/2 PING frame. And you already mentioned the stream sharing, which IMO is much better than the state of each tab on the same website opening its own TCP socket, pretty much a resource exhaustion hell (browsers didn't implement websockets on top of h2 last time I checked).

Websockets do have an advantage over SSE: they get binary streams instead of text.


HTTP/2 doesn't fix the former downside. It doesn't give you a general bidirectional stream, so you can't implement flow control. I'm not aware of a browser API for sending (or inhibiting) HTTP/2 pings - is there one? In the absence of one, you also can't use HTTP/2 to implement an application-level heartbeat.

To expand on this a bit, a problem i have run into with SSE is that the socket is open, some part of the client (the browser or HTTP client) is reading data, but it is not being processed, because the actual application code on the client is stuck or broken or something. To deal with that, you have to be able to drive feedback from the application code on the client to the server.


Pardon my ignorance: Are messages sent through this guaranteed to arrive in order?


Yes. SSEs are just an infinite-length HTTP body formatted in a particular way. And like all HTTP bodies, they are guaranteed to arrive in the same order the bytes were sent.


Server-to-client, yes. The moment you write a little chat app that sends every line you type as separate POSTs -- which can be reordered -- you get to keep both pieces.

I wish browsers would support streaming request bodies in the fetch API. That would obsolete WebSockets practically immediately.


TCP can't multiplex. HTTP/2 runs over TCP and does multiplexing.

WebSocket can't multiplex. Nothing prevents gRPC over WebSocket implement multiplexing itself.


Except it wouldn't work with a gRPC agnostic proxy. HTTP/2 can be split up by, say, nginx and requests can be handled separately. Any custom protocol probably won't have that level of support.


Again, you're saying fiber optics provides features that telephony doesn't. The fact that telephony usually runs over copper cables, and telephony over fiber optics is a recent thing, doesn't change the fact that the comparison makes no sense.

Telephony (websockets) runs over the copper cables or fiber optics (HTTP/1 or HTTP/2).


I’m not a web developer, but that RFC, which talks about bootstrapping, talks about using the CONNECT method to “transition” to the WebSockets protocol. Which matches what I thought the CONNECT method does: Switch to a protocol that is not HTTP?

But I only skimmed the introduction, did I miss something?


normally uses GET and the Upgrade header, not CONNECT


That’s for HTTP/1.1, where WebSockets are really a completely different protocol which “hijacks” the underlying TCP or TCP+TLS stream from HTTP via the Upgrade request.

HTTP/2 has its own concept of streams, so WebSockets can run over a single HTTP/2 stream, and the linked RFC describes extending the CONNECT method to take over a single stream.


That's in a section specifically about picking the right transport. Per your example, it's like saying "TLS is technically superior to TCP, because it means our protocol can offload encryption and authentication to it".


> WebSocket can run on top of HTTP/2

Isn't "websocket" just a standard tcp socket, whose specification to instantiate it was born in a comparatively ephemeral HTTP (of whatever version) request, and which outlives the request, so isn't on top of anything other than tcp?


No, despite the name WebSockets are not plain TCP.


They're an http protocol for setting one up, you mean?

First sentence at this article:

https://en.wikipedia.org/wiki/WebSocket


No, they're more than that. After you upgrade to WebSocket you still have to speak the WebSocket protocol over TCP. It includes message framing — with different defined message types, masking, ping/pong, an extension system (including for example optional per-message compression), ...


TCP is a stream of bytes. WebSockets are message-based.


From my perspective, I think the biggest issue with gRPC is it using HTTP/2. I understand that there’s a lot of reasons to say “No, HTTP/2 is far superior to HTTP/1.1.” However, in terms of proxying _outside Google_ HTTP/2 has lagged, and continues to lags at the L7 proxy layer. I recently performed a lot of high-throughput proxying comparing HAProxy, Traefik, and Envoy. HTTP/1.1 outperformed HTTP/2 (even H2C) by a pretty fair margin. Enough that if gRPC used HTTP/1.1 we could use noticeably less hardware. I could see this holding true even with a service mesh.


Also, http/2 over cleartext is not very well supported by a lot of things. Which is probably a good thing when going over the open internet. But it means you have to deal with setting up certificates even if just developing locally, and makes it more difficult to use for IPC on a single host.


My preferred setup is to have an unencrypted service running on 127.0.0.1 (so not publicly available), and then have nginx in front to handle certificates. Lets me do all certificate stuff across all virtual hosts in one place. HTTP/2 makes this impossible due to its ridiculous TLS requirement, so I, and everyone who does it the way I do, must keep using HTTP/1.1 forever.

It's my belief that requiring TLS for HTTP/2 is what killed the protocol. It just causes too much friction during both development and deployment, for little to no (or negative) performance gain.


> My preferred setup is to have an unencrypted service running on 127.0.0.1 (so not publicly available),

Don't forget that JS from any webpage can access your 127.0.0.1 to various degrees. Depending on what types of requests exactly the server accepts, it may be somewhat unsafe for a machine with a browser.


Oh, that was for a server. So the process which serves e.g my pastebin (https://sr.ht/~mort/coffeepaste/) runs an unecrypted HTTP server on 127.0.0.1 on some high port, then an nginx reverse proxy handles HTTPS on port 443.

On a machine with a browser, local servers are dangerous, HTTPS or not.


A few years back it seemed the ecosystem had the tools needed for h2c (HTTP/2 minus TLS) to work out. Was able to get the proto service set up in Golang, and work with a couple different proxy options


I agree. HTTP/2 is a huge requirement to push everywhere you want to use RPC. Want to do RPC to a microcontroller? Tough luck. Want to make RPC calls from a web page? Yeah have fun figuring out the two incompatible gRPC-web systems, setting up complicated proxies and actually finding a gRPC library that actually supports them fully.

Thrift has a much more sane design where everything is pluggable including the transport layer.

Bit of a shame that Thrift never became more popular.


Yup, and this is also why many end up proxy gRPC over HTTP/1.1 after giving up on making HTTP/2 work with systems that don’t support it…


> In this flow, what was the length of the /data resource? Since we don’t have a Content-Length, we are not sure the entire response came back. If the connection was closed, does it mean it succeeded or failed? We aren’t sure.

I don’t get that argument. GRPC uses length prefixed protobuf messages. It is obvious for the peer if a complete message (inside a stream or single response) is received - with and without trailers.

The only thing that trailer support adds is the ability to send an additional late response code. That could have been added also without trailers. Just put another length prefixed block inside the body stream, and add a flag before that differentiates trailers from a message. Essentially protobuf (application messages) in protobuf (definition of the response body stream).

I assume someone thought trailers would be a neat thing that is already part of the spec and can do the job. But the bet didn’t work out since browsers and most other HTTP libraries didn’t found them worthwhile enough to fully support.


He offers two facts that I think explain this well enough:

> Additionally, they chose to keep Protobuf as the default wire format, but allow other encodings too.

And:

> Since streaming is a primary feature of gRPC, we often will not know the length of the response ahead of time.

These make sense; you'd enable servers to start streaming back the responses directly as they were generating them, before the length of the response could be known. Not requiring servers to hold the entire response can have drastic latency and memory/performance impact for large responses.


This doesn't match what I see in the gRPC spec. It says every message must be length-prefixed.

https://github.com/grpc/grpc/blob/master/doc/PROTOCOL-HTTP2....

Disclaimer: I don't know much about gRPC.


I've spent a fair bit of time working with gRPC, and you're correct - gRPC's length-prefixing makes it easy to detect when individual messages are terminated early. You do still need some way to detect streams that terminate unexpectedly on message boundaries - perhaps you could rely on HTTP/2 EOS bits as evidence of an application-level success, but you need some equivalent of trailers to communicate the details of any errors that occur midway through the response stream anyways.


First, we need to clarify. The problem is that you cannot use GRPC from javascript (yes, there's unofficial, and sketchily supported, hacks, but read on for why they're required)

He explains the problem that caused this, in his opinion, in the article, but not very obviously. The problem is protobuf encoding. It's key-length-value (key identifies the field that follows, length is the length of the value, value is the value). A message is thus "KLVKLVKLVKLVKLV".

The second thing is that repeating a key is always valid (not just when it's declared as an array of values). If a field is repeated but it's not an array, then the newer value overrides the previous one.

(this was done so that if you have 20, or 2000000, massive protobuf files listing, say, urls visitors used, and you need to combine them, you can just concatenate the bytes together and read in the result as a valid protobuf. Also it means you do streaming reads of any protobufs coming in)

So:

1) you don't know when the message ends because you don't have a length field for the entire message (only for individual fields)

2) you don't know the message ends after a given number of fields, for 2 reasons. A, some fields are optional, and may or may not be present. B, even if all fields were sent, a newer version of an already-sent field might be sent to override the previously sent value.

You're right, of course, that the problem can be fixed by only allowing a single top-level message that is decoded in a nonstandard way (and frankly support for this could be added to the official protobuf libraries and it can be made to work by making this requirement optional ... Even concatenating can still be possible that way)

The problem here is rigid, immovable adherence to their own standard (ie. not incorporating a change like demanding KLV on the top level message in an RPC connection, or having a special concat-compatible, one-field-at-a-time decoding, because an architectural decision made 15 years ago said not to do this. Then make this mode required for the HTTP case)

This was an organisational problem. Not that they don't trust each other (obviously a browser developer doesn't trust many people, security really demands they don't. This is not where the problem is). GRPC failed to consider the fallout of them choosing the nuclear option of just dropping out of browsers in order to satisfy an old internal requirement without modifying their design. They chose this outcome, because they couldn't deal with achieving 99.9% of their aims as opposed to 100% ... and missed one of their main aims, let's say they achieved 50% instead.


I understand the points you're raising; I'm saying that gRPC enveloping solves them. Once messages are enveloped, you _do_ have a length field for the entire message.

The article is, IMO, somewhat misleading - it discusses the issue without mentioning gRPC envelopes at all, and it seems pretty clear that envelopes were designed (in part) to address this exact issue.


Another option is simply to reserve a key (or multiple) for providing stream/transport specific metadata which should be stripped out before handoff to the client, such as allowing you to send an "end" marker. Now you're not depending on the transport layer cooperating. It's not a particularly hard problem.


That has the downside that you're now limiting what protobuf payloads you can send. You need to have the inner protocol (protobuf) cooperate with the outer protocol. That also makes it difficult to switch to a different type of payload besides protobuf, since even if you could convince the protobuf standard to reserve a specific key for the outer protocol, you might not be able to convince the standard for a different protocol to reserve that for you. The article says "gRPC is definitely not Protobuf specific".

It's like an intrusive linked list vs std::list. Sure you can do intrusive linked lists, but it means you have to clutter up your object with info about how it's stored. It mixes the layers.


I think most people would say that not being able to use GRPC at all on the web is not exactly the superior option/outcome ... I'm not sure why a "pure design" (non-mixed layers) matters when the result is non-functional.

Just open a second websocket for the second protocol. Or use WebRTC. Or ... most protocols have these channels, which means mixing payloads is not really that useful. It doesn't buy you anything over the situation where you're not mixing protocols.


I'm not saying the current situation is better than your suggestion. I'm saying there are better ways to fix the current situation than your suggestion.

My idea of a better way would be:

* If you just need a boolean of success vs failure: use END_STREAM vs RST_STREAM.

* If you additionally need metadata of failure reason: use the existing length prefixes that gRPC has, and additionally add a bit to indicate data vs trailer. Then implement your own trailers inside the HTTP2 stream to indicate success vs failure and failure reason. Sure these trailers won't get HTTP2 compression like native HTTP2 trailers, but that shouldn't be a big problem.

Using 2 websockets would be confusing because things could arrive out of order from how they were sent. And one websocket could fail but not the other, leading to a confusing mixed state. The whole reason for trailers was to make failures less confusing by having error messages in them.

Also, using websockets goes against the whole gRPC design idea. They wanted native HTTP2. We don't need websockets to fix the problem, we just need to implement trailers inside the stream instead of using native HTTP2 trailers. Implementing trailers inside the stream can be done with native HTTP2 streams or with websockets inside HTTP2 streams. It's a smaller change from the current protocol to put the trailers inside the native HTTP2 stream than to add websockets to the mix then implement trailers inside that.


> It is obvious for the peer if a complete message (inside a stream or single response) is received

If I'm reading [1] correctly, you can't distinguish between [repeated element X is empty] and [message truncated before repeated element X was received] because "A packed repeated field containing zero elements does not appear in the encoded message." You'd need X to be the last part of the message but that's not a problem because "When a message is serialized, there is no guaranteed order [...] parsers must be able to parse fields in any order".

[1] https://developers.google.com/protocol-buffers/docs/encoding...


Yes, the protobuf format makes the end ambiguous, meaning the end needs to be indicated by the protocol containing the protobuf.

But it looks to me like the gRPC spec says that everything must be prefixed by a length at the gRPC layer. So then it doesn't matter that protobuf doesn't internally indicate the end, since the gRPC transport will indicate the end.

https://github.com/grpc/grpc/blob/master/doc/PROTOCOL-HTTP2....

Disclaimer: I don't know much about gRPC.


Yes - that’s the length prefixing I was talking about. I should prevent any disambiguities.

But even without that peers should be able to distinguish between truncated and full messages: HTTP2 allows to finish streams either cleanly or reset them. I’m case of a clean close (FIN) no truncation should be expected. I’m case of a HTTP/2 Reset or a tcp connection breakdown the stream must be treated as potentially truncated.


I use GRPC between micro services instead of REST and for that it is really great; All the deficiencies of REST - Non versioned, no typed goes away with GRPC and the protobuf is the official interface for all micro-services. No problems with this approach for over two years now; and also multi language support - We have Go and Java and Python and TypeScript micro-services now happily talking and getting new features and new interface methods updated. Maybe it was demise in the web space; but a jewel in micro-service space


This is more or less what stubby is/was for Google and so the original driving force in implementing it. Now, if you add a catch all service that translates the requests from the outside to Protobuffers and then forwards the translated requests to the correct service you have a GFE (Google Front-End) equivalent.

Should you do it? Probably not as it's not just a dumb translation layer and it is extremely complex (e.g. needs to support streams which is non-trivial in this situation). For Google it's worth because this way you only have to handle protobuffers beyond the GFE layer.


With GRPC, you lose the ability to introspect the data on-the-wire. You lose the ability to create optimized data formats for YOUR application (who said you have to use JSON?) Most people can’t implement REST correctly, so it has been a shitshow for the last 20 or so years, GRPC isn’t a magic bullet, it just forces you to solve problems (or helps you to solve them) that you should have been doing in the first place. You can do all of these things without GRPC, there is no power it grants you that can’t be done better or at all in your own libraries and specs.


I suppose you mean "inspect" the data on-the-wire

https://grpc.io/blog/wireshark/

Wireshark can load proto files and decode the data for you.

BTW, "The Internet is running in debug mode".


Personal opinion: RPC is a failed architectural style, independent of what serialization/marshalling of arguments is used. it failed with CORBA, it failed with ONC-RPC, it failed with Java RMI.

Remote Procedure Calls attempt to abstract away the networked nature of the function and make it "look like" a local function call. That's Just Wrong. When two networked services are communicating, the network must be considered.

REST relies on the media type, links and the limited verb set to define the resource and the state transfer operations to change the state of the resource.

HTTP explicitly incorporates the networked nature of the server/client relationship, independent of, and irrespective of, the underlying server or client implementation.

Media types, separated from the HTTP networking, define the format and serialization of the resource representation independent of the network.

HTTP/REST doesn't really support streaming.


That's true of CORBA, for sure. I'm not familiar with ONC-RPC or Java RMI.

It's not true of gRPC. It's not "RPC" in any traditional sense - it's just a particular HTTP convention, and the clients reflect that. They're asynchronous, make deadlines and transport errors first-class concepts, and make it easy to work with HTTP headers (and trailers, as the article explains). Calling a gRPC API with a generated client often doesn't feel too different from using a client library for a REST API.

It's definitely a verb-oriented style, as opposed to REST's noun orientation. That's sometimes a plus, and sometimes a minus; it's the same "Kingdom of Nouns" debate [0] that's been going on about Java-style OOP for years.

0: http://steve-yegge.blogspot.com/2006/03/execution-in-kingdom...


The verb-oriented style is part of the problem. Too many verbs is the problem. Java/OOP problems are not the same as the REST style, which is entirely about Nouns. There's none of the Java ManagerFactoryManager problems.

The generated client from an IDL that wraps the network protocol with a function call is also part of the problem.

REST APIs that have function calls that aren't "Send Request and wait for Response" aren't REST. ("wait for response" doesn't imply synchronous implementation, but HTTP is a request/response oriented protocol).


gRPC has no opinion on what you name things.


CORBA and RMI are quite different from gRPC. (I have not used ONC-RPC.)

Both of those are explicitly centered on objects and locality transparency. The idea is that you get back references to objects, not mere copies of data. Those references act as if they're local, but the method calls actually translate into RPC calls, as the local reference is just a "stub". Objects can also contain references, meaning that you are working on entire object graphs, which can of course be cyclical, too.

These technologies (as well as Microsoft's DCOM) failed for many reasons, but it was in part because pretending remote objects are local leads to awful performance. Pretty magical and neat, but not fast. I built a whole clustered system on DCOM back in the late 1990s, and it was rather amazing, but we were also careful to not fall into the traps. One of the annoying bugs you can create is to accidentally hold on to a reference for too long in the client (by mis-implementing ref counting, for example); as long as you have a connection open to the server, this creates a server memory leak, because the server has to keep the object alive for as long as clients have references to them.

Ultimately, gRPC and other "simple" RPC technologies like Thrift are much easier to reason about precisely because they don't do this. An RPC call is just passing data as input and getting data back as output. It maps exactly to an HTTP request and response.

As for REST, almost nobody actually implements REST as originally envisioned, and APIs today are merely "RESTful", which just means they try to use HTTP verbs as intended, and represent URLs paths that map as cleanly to the nouns as possible. But I would argue that this is just RPC. Without the resource-orientation and self-describability that comes with REST, you're just doing RPC without calling it RPC.

I don't believe in REST myself (and very few people appear to, otherwise we'd have actual APIs), so I lament the fact that we haven't been able to figure out a standard RPC mechanism for the web yet. gRPC is great between non-browser programs, mind you.


I don't "believe" in REST, except that when you do it "properly" with media types and links, and proper thought about the resources you identify and what the state transitions are, it all works very nicely as a request/response API style.

The difference is during the design phase, where you focus on those resources and their state, instead of the process for changing that state.


Something that’s always bugged me about streaming protocols of this type is that they prevent processing pipelining.

If trailers are used for things such as checksums, then the client must wait patiently for potentially gigabytes of data to stream to it before it can verify the data integrity and start processing it safely.

If the data is sent chunked, then this is not an issue. The client can start decoding chunks as they arrive, each one with a separate checksum.


This comment is mixing up a few concerns.

1. When transferring large amounts of data, the checksum for the full transfer can't be verified until all the data is received. If you want to (for example) download an Ubuntu ISO and verify its checksum before installing it, you'll have to buffer the data somewhere until the download finishes.

2. When transferring small amounts of data, such as individual chunks, the data integrity is (/ should be) automatically verified by the encryption layer[0] of the underlying transport. There's no point in putting a shasum into each chunk because if a bit gets flipped in transit then that chunk will never even arrive in your message handler.

3. In gRPC, chunking large data transfers is mandatory because the library will reject Protobuf messages larger than a few megabytes[1]. As the chunks of data arrive, they can be passed into a hash function at the same time as you're buffering them to disk.

[0] gRPC supports running without encryption for local development, but obviously for real workloads you'd do end-to-end TLS.

[1] IIRC the default for C++ and Go implementations of gRPC is 4 MiB, which can be overridden when the client/server is being initialized. For bulk data transfer there's also the Protobuf hard limit of 2GB[2] for variable-length data.

[2] https://developers.google.com/protocol-buffers/docs/encoding


I would argue that RPC systems shouldn't be burdened down with features like this. If you want to exchange more date than either endpoint can buffer comfortably in memory split it up into several RPC messages e.g. create large object, define byte range, seal object. If the producer can precompute the length and hash you can simplify things by using content addressing to reference such objects and replicate them. If you can afford the overhead breaking them up into Merkle DAGs reasonably sized leaves is a good idea to allow validating and resuming partial transfers. It matters if your devices are connected via PCIe in a single chassis or mobile networks spread over half the world and datacenter optimised protocol won't be optimal for expensive, slow, unreliable links.


Interesting read.

  As an aside, HTTP/2 is technically superior to WebSockets. HTTP/2 keeps the semantics of the web, while WS does not. Additionally, WebSockets suffers from the same head-of-line blocking problem HTTP/1.1 does.
Not really a fair comparison. WebSockets is essentially a bidirectional stream of bytes without any verbs[0] or anything fancy. WebSockets is more like a fancy CONNECT.

And speaking of bidirectional stream of bytes... HTTP/2 also suffers from head-of-the-line blocking as well, it uses TCP as its substrate, after all. QUIC, however, despite sharing some ideas from TCP, it seems to ameliorate this by resorting to multipaths[1]. It remains to be seen if indeed this is going to be beneficial however.

[0] - unless you count its opcode field as something similar to HTTP verbs but if so it'd resemble more TCP than HTTP, I think

[1] - https://datatracker.ietf.org/doc/html/draft-ietf-quic-multip...


It seems like a lot of other technologies in this space have solved the listed problems while remaining compatible with browsers, load balancers, reverse proxies, etc.

It was a product choice not to offer a fallback path when HTTP/2 was unavailable. That choice made gRPC impossible to deploy in a lot of real-world environments.

What motivated that choice?


Can you name one or more?



> Why Do We Need Trailers At All?

The author convinced they're needed. But I wonder if some sort of error signaling should have been baked into `Transfer-Encoding: chunked` instead. It wouldn't have made sense in HTTP/1.1 since you can just close the connection. But in later HTTP versions with pipelined requests, I can see the use for bailing on one request while keeping the rest alive.


> The author convinced they're needed. But I wonder if some sort of error signaling should have been baked into `Transfer-Encoding: chunked` instead. It wouldn't have made sense in HTTP/1.1 since you can just close the connection. But in later HTTP versions with pipelined requests, I can see the use for bailing on one request while keeping the rest alive.

It did not to me.

I would rephrase the argumentation as:

- In HTTP/2 we thought to be smart by multiplexing multiple HTTP transactions over a single TCP connection.

- Shit we realized later that HTTP/1.1 did not necessitate trailers because they could abort the connection and we can not afford to do that anymore, we are multiplexed now.... Shit, we do need trailers now.

->

- That is currently a good example of good intentions inducing complexity. And complexity inducing even more complexity for free.

- HTTP/2 is now rolled over the world and everybody has to deal with that

- Still HTTP/2 suffers of several problems completely ignored by the blog post (Like the Head-of-Line blocking problem: it is not solved in HTTP/2). The result is now QUIC + HTTP/3 and we all start over again.


>Shit we realized later that HTTP/1.1 did not necessitate trailers because they could abort the connection and we can not afford to do that anymore

This is an interesting point, but I don't think it's correct. The HTTP/2 spec allows you to send a RST_STREAM frame to indicate that an individual stream had a problem. To be contrasted with an END_STREAM frame that indicates an individual stream ended successfully.

https://www.rfc-editor.org/rfc/rfc7540.html


Right, the author seems to be unfamiliar with H2, which is bizarre for someone who worked on gRPC.


> Like the Head-of-Line blocking problem: it is not solved in HTTP/2

How so? It is not solved on the network level (due to its use of TCP), but it is solved on application layer: slow response to one request is not blocking other requests.


> How so? It is not solved on the network level (due to its use of TCP), but it is solved on application layer: slow response to one request is not blocking other requests.

Like you said, it does not solve the network level:

- in HTTP/2, A packet drop will slow down every HTTP transaction.

- In HTTP/1 with a session pool (The usual web browser way of doing thing), it will slow down only one over X (X, size of the pool).

This has been found to be a big problem over (unreliable) mobile network and that makes HTTP/2 sometimes even worst than HTTP/1.

http://nl.cs.montana.edu/lab/publications/Goel_H2_extended.p...


Offtopic, but:

> However, Google is not one single company, but a collection of independent and distrusting companies.

This is an important thing to keep in mind when considering the behavior of any large company.


As a Googler, it's worse. Even within one of the "companies" there is distrust in the chain of command.


Googler here. Yeah, it goes my boss, another guy I'm somewhat familiar with, then some cloud of VPs or something and then Sundar.

I have no idea what those vp's are up to and not much faith in their decisions.


When I joined I found it funny you can't DM Sundar. But you can DM SVPs. At least the SVP of my division is open on the chat app.


Relevant post from a few days ago:

Connect-Web: TypeScript library for calling RPC servers from web browsers

https://news.ycombinator.com/item?id=32345670

I’m curious if anyone knows how Google internally works around the lack of support for gRPC in the browser? Perhaps gRPC is not used for public APIs?

The lack of browser support in the protobuf and gRPC ecosystem was quite surprising and one of the biggest drawbacks noted by my team while evaluating various solutions.


Back in the day, it wasn't used for private API's either. Different teams had come up with different ways of encoding protobuf-style messages as JSON for web apps.

For the best browser-side performance, usually you want to use browser's native JSON.parse() API call and this doesn't really let you use unmodified protobufs. In particular, you can't use 64-bit ints since that's not a native JavaScript type. Meanwhile, server-side folks will use 64-bit ints routinely. So if the server-side folks decided on 64-bit integer ID's, you need workarounds like encoding them as strings on the wire.

JavaScript has BigInt now, but still doesn't natively support decoding 64-bit integers from JSON.

It didn't seem like the gRPC folks understood the needs of web developers very well.


Is decoding performance typically a problem for web UIs? The lackluster performance of binary protobuf decoding in browsers (and unmarshaling BigInts from JSON) seems much less problematic than (1) using a 200 for unary error responses, (2) choosing a wire format that's _always_ opaque to the network inspector tab, and (3) having really poor generated code.

> It didn't seem like the gRPC folks understood the needs of web developers very well.

Agreed. Being fair to the team that designed the protocol, though, it seems like browsers weren't in scope at the time.


Looks like they have a separate protocol[0] for web compat, and they use a proxy to translate.

[0]: https://github.com/grpc/grpc/blob/master/doc/PROTOCOL-WEB.md


Google internally doesn't have browser grpc clients. It's for service to service rpc and also exposed on googleapis.com apis to third party callers.


Is it still the case that Google Chrome can't support Google gRPC?


To be fair, it's also true that Firefox doesn't expose trailers to the fetch API. To the Chrome and gRPC team's credit, the public Chrome issue actually contains a somewhat substantive back-and-forth; the corresponding Firefox issue has virtually no discussion.

https://bugzilla.mozilla.org/show_bug.cgi?id=1339096


It requires a translation proxy: https://github.com/grpc/grpc-web


Trailers would be theoretically useful in a variety of HTML streaming-related cases if they actually had widespread support (but they don't):

- sending down Server-Timing values for processing done after the headers are sent - updating the response status or redirecting after the headers are sent - deciding whether a response is cacheable after you've finished generating it

All of these except the first one obviously break assumptions about HTTP and I'm not surprised they're unsupported. Firefox [1] actually supports the first case. The rest have workarounds, you can do a meta-refresh or a JS redirect, and you could simply not stream cacheable pages (assuming they'd generally be served from cache anyway).

But it's still the case that frontend code generally likes to throw errors and trigger redirects in the course of rendering, rather than performing all that validation up front. That's sensible when you're rendering in a browser, but makes it hard to stream stuff with meaningful status codes.


Great article. I really like the points in "Lessons for Designers" section. Applicable for software engineering in general as well.


The author also posted an interesting Twitter thread a few months ago [0], on the day my coworkers and I posted here about our gRPC-compatible RPC framework [1]. I was a bit afraid to read this post, but I shouldn't have been - the author's a class act, and he never called us out explicitly. There's not much written about what the gRPC team was _thinking_ when they wrote up the protocol, and this was a nice window into how contemporaneous changes to HTTP and the fetch API shaped their approach. Given my current work, the final section ("Lessons for Designers") really hit home.

That said, I didn't follow the central argument - that you need HTTP trailers to detect incomplete protobuf messages. What's not mentioned in the blog post is that gRPC wraps every protobuf message in a 5-byte envelope, and the bulk of the envelope is devoted to specifying the length of the enclosed message. It's easy to detect prematurely terminated messages, because they don't contain the promised number of bytes. The author says, "[i]t’s not hard to imagine that trailers would be less of an issue, if the default encoding was JSON," because JSON objects are explicitly terminated by a closing } - but it seems to me that envelopes solve that problem neatly.

With incomplete message detection handled, we're left looking for some mechanism to detect streams that prematurely terminate at a message boundary. (This is more likely than you might expect, since servers often crash at message boundaries.) In practice, gRPC implementations already buffer responses to unary RPCs. It's therefore easy to use the standard HTTP Content-Length header for unary responses. This covers the vast majority of RPCs with a simple, uncontroversial approach. Streaming responses do need some trailer-like mechanism, but not to detect premature termination - as long as we're restricting ourselves to HTTP/2, cleanly terminated streams always end with a frame with the end of stream bit set. Streaming does need some trailer-like mechanism to send the details of any errors that occur mid-stream, but there's no need to use HTTP trailers. As the author hints, there's some unused space in the message envelope - we can use one bit to flag the last message in the stream and use it for the end-of-stream metadata. This is, more or less, what the gRPC-Web protocol is. (Admittedly, it's probably a bad idea to rely on _every_ HTTP server and proxy on the internet handling premature termination correctly. We need some sort of trailer-like construct anyways, and the fact that it also improves robustness is a nice extra benefit.)

So from the outside, it doesn't seem like trailers improve the robustness of most RPCs. Instead, it seems like the gRPC protocol prioritizes some abstract notion of cleanliness over simplicity in practice: by using the same wire protocol for unary and streaming RPCs, everyday request-response workloads take on all the complexity of streaming. Even for streaming responses, the practical difficulties of working with HTTP trailers have also been apparent for years; I'm shocked that more of the gRPC ecosystem hasn't followed .NET's lead and integrated gRPC-Web support into servers. (If I had to guess, it's difficult because many of Google's gRPC implementations include their own HTTP/2 transport - adding HTTP/1.1 support is a tremendous expansion in scope. Presumably the same applies to HTTP/3, once it's finalized.)

Again, though, I appreciated the inside look into the gRPC team's thinking. It takes courage to discuss the imperfections of your own work, especially when your former coworkers are still supporting the project. gRPC is far from perfect, but the engineers working on it are clearly skilled, experienced, and generally decent people. Hats off to the author - personally, I hope to someday write code influential enough that a retrospective makes the front page of HN :)

0: https://twitter.com/CarlMastrangelo/status/15322565762742435...

1: https://news.ycombinator.com/item?id=31584555


this is a fantastic take, also, you’re right about server side gRPC web.

Java had an experimental implementation that was abandoned.

If Google were using gRPC web internally, typescript and Java support would be first class.


To this day it's still not clear to me, as even if asked on their github issues there is no definite answer,

Can one use nginx in front of a grpc serving backend if the client is a JS client in the broadest sense?

This unanswered question is the main reason I'm still doing RESTful JSON.


Very nice post. Http2 did not solve the TCP HOL problems though. Not sure about the WS statement. On the other hand. Vanilla WS has never ended up in prod on any of my projects even if it has been implemented several times.


I'm not getting it. Why is HTTP so inadequate for gRPC?

A service app for example can open 1000 sockets with a server and simply multiplex that way.


Author doesn't support the case for grpc being a "failure". I wonder by what measure. It's certainly pretty popular.


Being able to use gRPC in browsers was an explicit goal of the project, and that is impossible. gRPC-Web has to use a modified version of the protocol that has a limited feature set and performs worse for the reasons described in the article.


The lack of trailers support is not what means gRPC-Web needs to exist. It would be trivial for gRPC to support trailers-in-body like gRPC-Web does as a work around. The main problem for gRPC in the browser is the lack of HTTP/2 framing for messages which means gRPC-Web has to invent it's own framing format to make streams work.

My experience in the early days of gRPC is that they seemed fairly unwilling to consider any need for an easy upgrade path for existing people using HTTP/1.1 at all.

The author touches on this at the end:

> Focus on customers. Despite locking horns with other orgs, our team had a more critical problem: we didn’t listen to early customer feedback.

I'm glad they realise it now because lots of us warned them about this at the time.


Lets call the next version gRPC Send. then the vp can "leave" after a year or so, the project can get scrapped, and we can go back to something decent XD


I was going to post a similar comment, but looking back at the post I realised that the author is upfront about what they consider the original point of gRPC to be:

> gRPC was reared by two parents trying to solve similar problems:

> 1. The Stubby team. They had just begun the next iteration of their RPC system...

> 2. The API team. ... serving (all) public APIs at Google ... [this is not said explicitly but presumably the vast majority of API clients are web based]

As you say, gPRC is very popular at server messaging, but I suppose it can never be an API solution. So, even if gRPC is successful in general, it was not successful at its original goal (as far as this author is concerned).


gRPC: protobuf and stubby for performance reasons, we’ve spared no expense.


Application layer encoding should not interfere in the protocol transport layer.


This is more about whether it is acceptable to push error checking to the application layer or not. Seems like the gRPC designers agreed with you and so they need trailers.


I was so excited for gRPC when it came out because it meant having strongly typed APIs and auto-generated clients, but two things made it horrible to use: requiring http/2 (so you couldn’t use most load balancers at the time) and the generated clients were unpleasant to use (you couldn’t just return an object to serialize, you had to conform to their streaming model).


Checkout Twirp, you get the good parts, protobufs and generated code, but its supports regular http, by no including streaming.

https://github.com/twitchtv/twirp


[flagged]


This is very much not it. The Stubby team decided early on to use trailers and was combative and obstinate when the Chrome and GFE (reverse proxy) teams tried to explain why doing so would be a bad idea. The use of trailers in gRPC originates in the hubris of one team, not a conspiracy by Google.


If you read the article you’d find out the opposite. That Chrome continues to lack support for trailers.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: