Hacker News new | past | comments | ask | show | jobs | submit login

You know, articles like this make me wish an OS would actually have a built-in fast, reliable, fully-featured HTTP parser. I've written a couple of (very strict) HTTP parsers on my own, and this whole "request smuggling" is possibly only because the HTTP messages have rather delicate and fragile, totally non-robust framing and structure. Miss a letter in one of the relevant RFCs (IIRC, you need to read at least the first 3 RFCs to learn the complete wire format of an HTTP message) and you'll end with a subtly non-compliant and vulnerable parser.

And yet, every single programming language/platform build their own HTTP-handling library, usually several, of very varying quality and feature support. Again, it would not be as bad if HTTP was a robust format where you could skip recognizing and correctly dealing with half of the features you don't intend to support but it is not: even if you don't want to accept e.g. trailers, you still have to be aware of those. We have OpenSSL, why not also have OpenHTTP (in sans-io style)?

I've implemented a few things from RFCs and I always wish that for each RFC there was a library of test cases to test your implementation.

Does anyone know if there is anything like this for HTTP or associated RFCs?

Eg, for HTTP header parameter, names can have a * to change the character encoding of the parameter value. How many implementations test this? Or tests for decoding of URI paths that contain escaped / characters to make sure they're not confused with the /s that are the path separators.

Or at least a bunch of examples in the RFC itself. Don't you just love reading a long description of a convoluted data format with literally zero examples of how the full thing looks and what it is supposed to mean? Sadly, leaving the validation undocumented is pretty common across formats/protocol descriptions, and RFCs actually seem to generally be on the "more specific" end of scale, thanks to the ubiquitous use of MUST/SHOULD language. But I've recently wrote a toy ELF parser and it's amazing how many things in its spec are left implicit: e.g. you probably should check that calculating a segment's end (base+size) doesn't overflow and wrap over zero... should you? Maybe you're supposed to support segments that span over the MAX_MEMORY_ADDRESS into the lower memory, who knows? The spec does not say.

You know, the IETF is an open group, and you can write some examples and submit them as a pull request:


>Eg, for HTTP header parameter, names can have a * to change the character encoding of the parameter value

Where did you read this? HTTP header fields may contain MIME-encoded values using the encoding scheme outlined in rfc2047, but I haven't heard of the asterisk having any special meaning...

I believe he refers to RFC 8187.

Fuchsia has an http client component [0][1] which is part of the platform and, given Fuchsia's component architecture, it's accessed through a message-passing protocol [2] which is programming language agnostic.

[0]: https://cs.opensource.google/fuchsia/fuchsia/+/main:sdk/fidl... [1]: https://cs.opensource.google/fuchsia/fuchsia/+/main:src/conn... [2]: https://fuchsia.dev/fuchsia-src/reference/fidl/language/lang...

When I started to build my browser [1] I realized that there's literally no standard test suite to test your HTTP implementation against.

There are test suites for _some_ subsets of the spec, and there are implementation-specific testsuites (e.g. in chromium) ... but there's not a single HTTP 1.1 all-in-one testserver that you can test your client or server implementation against - over the wire.

The additional lack of tests for hop by hop networking changes (which is e.g. the Transfer Encoding parts of the spec in 1.1) and you have a disaster waiting to happen.

Combine that with 206 Partial Content and say, some byte ranges a server cannot process...and you've got a simple way to crash a lot of server implementations.

There's not a single web server implementation out there that correctly implements multiple byte range requests and responses especially not when chunked encoding can be a requested thing. Don't get me started on the ";q=x.y" value headers, they are buggy everywhere, too.

For my browser project, I had to build a pcap (tcpdump) based test runner [2] that can setup temporary local networks with throttling and fragmentation behaviour so that I have reproducible tests that I can analyze later when they failed. Otherwise it would be a useless network protocol test that's implementation specific as all others.

I think the web heavily needs a standard HTTP test suite, similar to the ACID tests back then...but for malicious and spec compliant HTTP payloads combined.

[1] https://github.com/tholian-network/stealth

[2] https://github.com/tholian-network/stealth/tree/X0/covert

So from that observation, why don't you put your browser project on hold for a while and start that http test server project?

I think there currently is no such thing because writing test cases for protocols is an uphill start. You simply don't have any constraints on how to start. Write the tests in plain text? How to encode the behavior? Write the tests in a programming language? How to execute the tested client? It's not impossible to have a client/server-agnostic test library, but it's non-trivial to design the framework.

"And the next thing you know, you’re at the zoo, shaving a yak, all so you can wax your car."

That said, it depends on your goals. Writing pragmatic, limited test cases for protocols is super hard, due as you say to the lack of constraints.

But if your goal from the outset is to write a definitive, exhaustive test suite then it's a far more mechanical task (much the way that writing a chess AI is hard if you want it to run on a desktop computer, but writing a program to play perfect chess only requires a simple understanding of graph searching if you don't care how fast it is.) Just start from the start of the protocol and work your way through one statement of a time, enumerate all the different ways that an implementation could cock it up, and write a test for each. Of course there are still engineering decisions to be made but you don't have to pick the perfect solution to each. A solution is enough, you (or someone else) can always improve it later.

> articles like this make me wish an OS would actually have a built-in fast, reliable, fully-featured HTTP parser.

You mean, like Windows?

Http is application layer, it should have nothing to do with an operating system. In fact, the OS should likely have no access to the HTTP frame at all, if the connection uses TLS

Can we please stop with this OSI nonsense already? HTTP is a transport-level protocol today. If something uses TCP, chances are pretty good it also uses HTTP on top of that, and some sort of homegrown RPC on top of that.

And the OS absolutely has access to the HTTP frame: it manages the process's network buffers and its whole memory mapping, it locates and loads OpenSSL at the process's startup... a process is really not a black box from the OS point of view.

I think FFI, as you mention with OpenSSL, would be the better approach. And I think this would be a good idea in general. But most languages don't make FFI easy on either side.

The best solution would be to make an http version 4 with a non-fragile format, e.g. json.

Otherwise we will keep chasing bugs forever.

Configuring your webserver/reverse proxy to talk HTTP/2 to backend appservers is a good improvement against request smuggling. (If they support it, sadly not guaranteed). The binary format is much less ambiguous.

There is a talk by James Kettle about request smuggling with HTTP/2, but it is largely about attacks when the frontend talks HTTP/2 and then converts to HTTP/1.1 to talk to backend servers [1]. That said, it does also highlight some HTTP/2-only quirks, so it’s not completely perfect, but it’s so much better than HTTP/1.1.

[1]: https://portswigger.net/kb/papers/rfekn2Uv/HTTP2whitepaper.p...

A lot of the new http bugs aren’t caused by ambiguities in http1 headers, or ambiguities in http2 headers. They happen when an http2 message gets rewritten into http1 and “valid” http header characters (like new lines) show up as header separators in http1.

The problem isn’t that we don’t have a good header format. The problem is we have too many.

> a non-fragile format, e.g. json.

JSON is a terrible format. Especially for streaming data.

Here's a super simple shell script for generating invalid JSON that will blow Python's stack:

    n="$(python3 -c 'import math; import sys; sys.stdout.write(str(math.floor(sys.getrecursionlimit() - 4)))')"

    left="$(yes [ | head -n "$n" | tr -d '\n')"

    echo "$left" | python3 -c 'import json; print(json.loads(input()))'
It is invalid JSON. But you cannot tell if it's invalid until either:

a) You run out of memory.

b) The connection ends.

Because JSON is terrible for streaming data.

http 2 and 3 have much stricter defined binary formats. JSON would be a step back in terms of spec and performance.

JSON has the problem that different parsers handle multiple occurrences of a object element differently. You need to watch out that header names are only ascii otherwise you could run into string comparisons being different on different platforms.

Why is HTTP so complex? The base use case (hypermedia request-response) sounds really simple.

HTTP 1.1 is an old protocol, over time new requirements made modifications necessary, some things fell out of use, and some changes turned out to be mistakes. That it's text-based without using doesn't help

The basis is simple, but then add Cookies, HTTPS, Authentication, Redirecting, Host headers, caching, chunked encoding, WebDAV, CORS, etc etc. All justifiable but all adding complexity.

Http/0.9 is pretty simple, but for a fast web we needmore complexity.

More parsing and data processing = faster web. It all makes sense, really!

Joking aside, some "features" in HTTP/1.1 are really questionable. Trailing headers? 1xx responses? Comments in chunked encoding? The headers that proxies must cut out in addition to those specified in "Connection" header except the complete list of those is specified nowhere? The methods that prohibit the request/response to have a body but again, the full list is nowhere to be found?

All these features have justifications but the end result is a protocol with rather baroque syntax and semantics.

P.S. By the way, the HTTP/1.1 specs allow a GET request to have a body in chunked encoding — guess how many existing servers support that.

Nearly all L7 protocols and their parsers are complex. HTTP is kind of simple, relatively speaking.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact