Hacker News new | past | comments | ask | show | jobs | submit login
Surprising things about HTTP (webdevguild.com)
92 points by vinnyglennon on June 9, 2022 | hide | past | favorite | 61 comments




There are quirkier things:

• Content-Length is defined to be a number, but it actually may be a comma-separated list. This is due to a rule that repeated headers can be merged into one header with a list, and buggy software that happened to output multiple Content-Length headers that subsequently got merged.

• If you want to implement a crawler or proxy that works with all real-world HTTP servers, then no matter how broken and invalid responses you get, if Internet Explorer accepted them (and it accepted pretty much everything), then you have to make sense of them too.

• HTTP/1 is vulnerable to request smuggling. There can be multiple HTTP messages on the same TCP/IP connection sent one after another. If the sender and recipient don't parse the length of the first message in exactly the same way, they'll get out of sync and the recipient will start interpreting message body (often attacker-controlled) as subsequent HTTP requests/responses.

• Expect + 100 Continue allows clients to start uploading a file only if the server gives it a green light (client sends headers and waits for status 100 response). However, this part of the protocol is kinda buggy, because the client has no way of knowing if the server supports status 100. If the client gives up waiting for status 100 and falls back to sending the body anyway, there's a risk of a race condition when server denied the upload, but still received it. In that case the server will parse body as a new HTTP request.


As much as I'll miss being able to type raw HTTP into an nc/telnet/openssl s_client when H/1.1 support is dropped, these are all caused by HTTP originally being a simple, textual protocol that got thrust into supporting connection reuse, "flow control" and a bunch of other stuff. All solved by H/2 and beyond.


One surprising thing about HTTP/3 is that QPACK (the compression algorithm used to transmit HTTP fields) has a static Huffman table based on the most common header values [1]!

The other day I was implementing a HTTP/3 client and I wanted to test the header encoding, so I just sent an example value "content-type: text/plain" and checked the output. The entire header was just 3 bytes! I then spent an hour looking for bugs in my code, as I assumed there must be a problem with it as no regular compression algorithm can compress text that much. Then I finally opened the IETF draft and discovered the table there. So yeah, it's a compression algorithm that's really specifically geared towards HTTP.

[1]: https://greenbytes.de/tech/webdav/draft-ietf-quic-qpack-15.h...


As the link you provided implies (it's in the Security Considerations section of the document) this decision also prevents QPACK becoming an effective Oracle.

Attacks like CRIME or BREACH abuse compression with dynamic behaviour as an Oracle. The attacker can't see the secret token, but they can send arbitrary data over the wire with the secret token and the dynamic compression will adjust based on the attacker's inputs and the secret token, changing the compressed data size. Over many iterations they can learn what the token must be this way, but QPACK effectively defeats that, for HTTP/3 headers, while still delivering compression (you can of course also defeat these Oracle attacks by never using compression).


Does this kind of header compression mean anything of value, am asking seriously. I'm not an HTTP person but typically I'd expect the body to dwarf the header, and if you are doing many small requests such that the header is larger then the rest, then I wonder if maybe that's not the right way.

If header compression is worth anything much then ok, if it isn't then why might it be being done.


I think so, it really save a lot of bandwidth as almost all HTTP requests have multiple of these common headers encoded in the Huffman table.


I've been doing web development since 1996. It catches my attention that those topics are considered "surprising" by the newer developers.

Some people see this as a sign that younger developers are weaker. Not at all.

There seems to be certain limits to the amount of knowledge we manage to get access to. Past a certain point, you start to get into an area full of "unknown unknowns" and, even after turning them into "known unknowns", it's very hard to find further information to clarify them. When I started my career, some topics were extremely hard for me to grasp. I couldn't even articulate my questions about those subjects. And when I started to be able to, it was hard to find information, either in the form of articles or in the form of knowledgeable people

On the other hand, older developers, who had had access to earlier computers and programming languages, had no problems at all to master those topics.

Nowadays I feel the same about other fields I entered in more recently. Some older ham radio operators I know seem to have an intuition about topics I started to understand only after months of research -- I finally found the answers on books from the 70s and 80s. I have friends in different branches of engineering who feel the same about their fields.

It seems that we have an inclination to abstract away the fundamentals. And this is terrible in all sorts of ways.


Is it?

If you don't abstract away the fundamentals, doesn't that mean you have to understand the internals of everything?

That would be quite a lot! (physics, chemistry, mathematics, etc.)

I would contend that 99.99% of everything we deal with (if you visually look around your surroundings) is "abstracted away". Perhaps, for software engineers, this is only 99% when it comes to software.


If your day job involves a particular field of maths or physics then I'd argue you're definitely supposed to have an understanding of the fundamentals. You can't build a bridge if you don't know how torque works or how environmental factors interact with the forces at play.

You don't need to know all of the frontend stuff if you're doing backend work, you don't need to know all the backend stuff if you're doing frontend work, and you don't need to know either if you're doing kernel development In turn, you can leave details like memory management, execution modes, endianness and concurrent locking out if you're writing code for the browser. Browsers can do a lot, but the most challenging parts of programming have already been taken care of the moment your HTML gets loaded.

There are fields within programming (often overlapping) and if you work in your field, you should probably understand the fundamentals of that field.

It's fine for a web dev to know "between 1ms and 2s the response of a DNS query appears" without having to know how DNS works. You don't have to know _everything_. But, unless you're developing HTML files on people's desktops, as a web dev you do need to know how HTTP works. It's not magic, the base protocol is actually quite simple as it was once text based for human readability.

If you can fight your way through Javascript or Typescript with its absurd quirks and problems, you can understand HTTP if you try. Most of web dev is memorising the billion different options that browsers allow you to tweak these days, from CSS to HTML components, to get the design you want. Learning 50 or so other options and a few very basic protocols is hardly difficult in comparison.


Yes, it is.

You're not supposed to understand the internals of everything. But of the things underlying your field of work/study? You should.

We shouldn't think about the fundamentals the whole time. The cognitive load would be unbearable. But we should be aware of them and be able to think about them more deeply if/when needed.

And like I said, this phenomenon is not restricted to software development. An example that comes to mind now: in the crash of Air France Flight 447, the lack of skills on fundamentals of flight of the pilot flying -- young and used to the Airbuses' extremely high level of automation -- has been pointed out as a contributing factor.


I think the article is just poorly titled. Almost all of these are unequivocally basic elements of HTTP. Maybe new web developers aren’t learning how HTTP works, but that’s no excuse to be surprised when they do learn about it.


Most of these are pretty basic but I've been doing web development since around that same time and I had never heard of a Trailer response header.


I've met a number of "web developers" who were at a fairly fundamental level, incapable of architecting web applications due to not knowing how HTTP works.

It's so important in any engineering to understand what primitives you're working with and building on. With web development, everything is built on an HTTP-shaped pipe, and so anything that isn't in HTTP does not fit down that pipe. This becomes an important concern when dealing with security – any security measures not tied to HTTP in some way (a cookie, a header, etc) are just UI, and can be circumvented.


On the topic of "HTTP is just text," I was amazed the first time I made a raw request by hand with netcat. It's not surprising if you're familiar with HTTP/1.1, but at the time, it felt like magic to be able to write out the request by hand.

Just install netcat, run an HTTP server, write

    nc localhost 80
    GET / HTTP/1.1
    Host: localhost
and hit enter twice.

With the transition to HTTP/2, we gained a lot, but binary frames and TLS take away that eureka moment when you see a wireshark trace and suddenly understand more of what your computer is doing.


> Others, like 409 Conflict have such a narrow use-case that I’ve never seen it used in practice.

Huh, I've often seen this used to prevent resources with unique names or ids from being created. i.e POST

/things

{ "name" : "myUniqueName" }


I actually have a CRUD-type interface for a database, where everything is done with GET and POST (or PUT or whatever), and I use 409 Conflict to mean "There was a conflict, someone else edited this, I rejected your edit because your edit was based on an assumption that is no longer true".

I guess if I used AJAX for everything, I could return a 200 OK and a JSON object meaning "actually everything is not okay, there was a conflict" but I did it the simple way first.


I've seen 412 "precondition failed" returned when there is that sort of conflict. For example, Google Cloud Storage returns 412 if you try to replace exactly one resource (defined as data + generation or etag) with new data: https://cloud.google.com/storage/docs/request-preconditions


Yeah, like >>409 an account with this email already exists<<


That usage that leaks sensitive information.


Not leaking this sensitive information is almost always done because doing otherwise would create very confusing UX.


Number one “HTTP is text based” was true, but with the advent of HTTP/2, it uses a binary protocol now


Yeah and in fact even for HTTP/1.1 only the start of requests/responses are plain text, as the content is transmitted in its raw, binary, form. If it's textual content, or none at all, it's definitely all text, but if it's say an image being uploaded, then the content is binary. I'm glad that HTTP was so pragmatic about this compared to purist e-mail with its base64.

As for HTTP/2 changes, header names are now all lowercased (the article still works with 1.x casing). A stupid decision to lowercase if you ask me, but that's what the HTTP/2 authors went with. Lastly, the article mentions server push, but Chrome has announced its deprecation 2 years ago.


> I'm glad that HTTP was so pragmatic about this compared to purist e-mail with its base64.

The issue with e-mail is that it's store-and-forward, and the server which receives your message might have to forward it to another server which understands only 7-bit characters (using the 8th bit for parity). This worry might not make much sense nowadays, but remember that e-mail is old, back from before "all bytes are 8 bits" won over all other variants. As for HTTP, it was specified over TCP, which has always used 8-bit bytes, so it doesn't have to worry about systems which mangle or reject data with a high bit set.


Mandating lowercase is good because it makes parsing faster.

Of course, there’s going to be some poorly coded client out there that ignores this rule and so everybody else will have to accommodate this half-assedness too.


I get the arguments why lowercasing is better, and it's certainly more beautiful than making everything caps, but the transition from the old system to everything being lower case headers leads to a lot of bugs.

https://github.com/cockpit-project/bots/commit/3cdbaa6d6765a...

https://github.com/edwardspec/mediawiki-aws-s3/issues/49

https://github.com/firecracker-microvm/firecracker/pull/3006

https://github.com/snapview/tungstenite-rs/pull/267


Yes, libs that don’t adhere to the spec need to be fixed or deprecated.


It strikes me as misleading at best even regarding HTTP/1.1, which was of course still capable of transferring binary files. just text might be taken to indicate that it just supports text.


Yeah, but you have to convert the binary into a text representation. Same thing with email, you have to base64 encode binary to transform it into something capable of being represented by a string of ASCII characters.


No, you don't have to - you mustn't. Binary is transfered as binary, except in cases when the actual data contains already encoded values - www-form-urlencoded or application/json, ...


“ The client opens a random port to receive the request on, sends a SYN packet to the server, waits for a SYN-ACK packet from the server…”

The client “connects” to the port, while the server opens/listens on that port.

I should check if I can submit corrections to the OP. Great article, btw!


There’s a lot of things wrong with the article, but that’s not really one. A connection is started by binding a local port. You can’t receive the SYN-ACK if there’s no open socket to receive it.


What's wrong exactly?


I was expecting something like sending a payload with GET requests is not prohibited by the standard...


Also interesting: You're allowed to use the same query parameter multiple times in one URL and the server is expected to merge it into an array. But I don't really know which implementations do this.


Yes, learned this when writing server side code it's very easy to extract query params dictionary style and then learn that to be pedantically correct, they should be handled list style.


IIRC this isn’t actually allowed, but it isn’t disallowed either


This article is pretty bad. Almost every number in this list (ugh, top ten lists on HN) is either not surprising and/or not even related to HTTP in any meaningful way. Like "HTTP is just text" is not surprising in the least, likewise "HTTP is TCP but could go over any delivery protocol" has virtually nothing to do with HTTP.


The whole article gave me the impression HTTP is something the author learned about last weekend. Good for him, I guess.

The article should be titled “10 things I didn’t know about HTTP”.

In his Twitter bio he describes himself as a “TypeScript & React Code Poet”.

Alright.


Meta: I find it strange that this post, which surely was quite interesting even if I actually did know quite a few and I'm far from being a web developer, uses Markdown-tyle backticks for things that I would expect to be in a monospace font.

It could, being styled web content, just use a monospace font, but it doesn't. Strange, and (to me) oddly distracting from the content.


On both my laptop and my phone I'm seeing a monospace font in addition to the backticks.


I see both as well, but I think they were talking about just the non-block code samples. "cURL" in this example: https://live.staticflickr.com/65535/52133970834_47a88e642e_o...


That also looks like a monospace font.


Good stuff. I wholeheartedly concur. Web dev should START with it's underpinnings, not gloss them over as complicated and thereby leaving the dev with no sense of the true costs for various decisions.


> Web dev should START with it's underpinnings, not gloss them over as complicated

I disagree here. We don't even start by teaching car mechanics about that fuel injection cycles, we start with "take the wheels off and put them back on again".


If we're going by car metaphors, in this case HTTP would be the wheels of the car, though. It's what makes the web pages appear in your page.

HTTP is the wheels, HTML is the chassis, CSS is the paint and JS is the instrument panel. APIs are your roads and JS/CSS web frameworks are the upholstery and infotainment panel, nice pretty features that are easy to modify and customise, carried along by the rest of the stack.

I wouldn't expect web devs to read the browser source code to see how the sausage actually gets made, the fuel injection cycle so to say. However if you can't tell a HEAD from a GET or learn about Content-Type, Content-Security-Policy and all the other headers that will make or break your applicantion in practice, you're severely limiting yourself.


Yep; but its worse. A lack of knowledge doesn't just stop you from building some applications. If you don't know what you're doing, its really easy to accidentally end up with security vulnerabilities in the software you make. I don't want my data ending up in yet more have-I-been-pwned dumps because beginners don't learn the basics.

Eg, lots of stack overflow answers recommend setting "Access-Control-Allow-Origin: *" to "fix" certain errors. Cross-origin messages are disabled by default in modern browsers for a good reason. No matter how convenient it is, you should understand the implications before opening flood gates like that.


> We don't even start by teaching car mechanics about that fuel injection cycles, we start with "take the wheels off and put them back on again".

I imagine that changes depending on whether you're training to be a car mechanic, or studying to be a mechanical engineer.

Similarly some people have learned to program by starting out with assembly, although this approach isn't common today.


I get a feeling this is one of the reasons why the world of programming is such a shitshow.

There isn't a proper distinction between the mechanical engineer vs car mechanic equivalent. Everyone and their mom is a "software engineer", regardless of just tinkering / doing the car mechanic equivalent or being one of the maybe sub-1% of people out there that does anything similar to engineering (aka validating their designs with scientific knowledge and industry/legislation requirements).

The discipline is in its infancy, so there's a hotpot of needs that are being mixed all together.


I’d argue that learning some basics about how a web browser locates and communicates with a server is pretty basic, it’s the user experience.

It’s how the pedal connects to a throttle cable which rotates a plate in the throttle body and allows more air into the manifold, which results in more fuel injected and bigger combustion which means higher RPM.

Not overly complicated, but fundamental and should be learnt pretty early on.


I get the sentiment, but it's not just HTTP - it's DNS, TLS, TCP, IP, etc. It really depends on what type of web dev you want to be.


It totally depends on wherever you want to teach already programmers how the web works or if you want to teach programming starting with the web as their first technology.


RFC 2616 is not relevant any more, hasn’t been for many years. It was obsoleted by RFCs 7230–7235 in 2014, which were mostly obsoleted just a few days ago by RFCs 9110–9114. Do not refer to RFC 2616 except as a historical document.

On the particular points:

> 1. It’s just text.

Only partly true. HTTP/1 is a textual protocol so that the raw stream is readable, but HTTP is semantics, not text. (That phrase is the key here.) Case in point, HTTP/2 and HTTP/3, which are binary protocols. In fact, even curl -v is not emitting the raw HTTP text, but a reserialisation of what it has already parsed. Some dislike HTTP/2 and later because they’re no longer textual so that you can’t just read the stream, but I say this concern is highly overrated: even HTTP/1 was complex enough in places that treating it as text tended to blow up in your face; better rather to deal with protocol-aware tools (libraries, debuggers, &c.) so that you can interact with it all semantically rather than in serialised form. I will note that this advice would not hold in all domains: not requiring special tooling is a virtue in general (though it comes at a cost), but HTTP is a place where binary encoding is reasonable because it’s so foundational and common that good support for such binary encodings will be widespread. But at the API level, JSON will often be more convenient than protobuf or similar schemes.

So overall I’d say that HTTP is not text, and if, after being aware of HTTP/2 and HTTP/3’s binary encoding (which the author seems to be), one still claimed that HTTP was text, then I’d respond that by that definition just about everything is text.

> Or the complicated 451 Unavailable For Legal Reasons?

It’s not all that complicated, really. There’s an extra related link relation (blocked-by), but beyond that 451 is just a number that indicates intention. See https://www.rfc-editor.org/rfc/rfc7725.html which defines it all.

> 418 I'm a teapot is an actual status code that was added to HTTP in 1998 as an April Fools joke. There’s now a movement to keep this status code in the spec.

That movement was around five years ago, and it was very quickly resolved in a way that could be seen as either victory or loss: the process began to reserve it, unused. This was finally completed a few days ago in RFC 9110.

> Remember the Robustness Principle, which says servers should “be conservative in what they send and be liberal in what they accept”.

In the context, this is a misapplication of Postel’s law. Postel’s law was about handling out-of-spec behaviour by the other end, whereas the immediate context here is not implementing in-spec behaviour. I also have to note that Postel was wrong, and his so-called robustness principle tends to actually lead to a lack of robustness, and security vulnerabilities. See https://datatracker.ietf.org/doc/html/draft-iab-protocol-mai... (including the delightful name the draft was first given when it was a personal submission: “postel-was-wrong”).

—⁂—

The title of the article is outrageous clickbait and the author should be ashamed of that fact (since it was clearly done deliberately), but I did discover one thing that I didn’t know about: the Clear-Site-Data response header <https://w3c.github.io/webappsec-clear-site-data/>.


> The title of the article is outrageous clickbait and the author should be ashamed of that fact

Agreed. Some big part of this blog post are things that every backend, web and even mobile developer should know (except if you are really deep into HTTP like yourself I presume). However, from the job interviews I conducted, roughly 90% of backend/web/mobile devs don't know 90% of facts in this article, which is a shame. And which ironically makes the title not so clickbaity anymore.


I used to ask a lot of questions about HTTP during the interviews. It's really surprising how hard backend developers found it to answer those. Even the semantics of the methods, even though REST API were already commonly used.


Me here, who knows http and parts of underlying stack, but doesn’t like to use it as is (I mean in /api/ realm). In my personal programming I’d rather abstract from http[s] and use it as yet another communication layer that carries rpc over ssl in a browser, where there is no alternative. Aka 200-json guy, except for true files/content or lower level errors. REST’s “let’s imagine all our calls are bureaucratic statements and directives with vaguely predefined status codes” just doesn’t click for me.


The spec allows you to add the domain in the request line like

GET website/path HTTP/1.1 Host: website2

Some servers, like NGINX, will key on the request line domain (website) to find the logical server block rather than the host header (website2).

This can lead to interesting outcomes


What I find surprising is:

> Others, like 409 Conflict have such a narrow use-case that I’ve never seen it used in practice.

Very interesting. Almost every API I worked with uses it. The most obvious example: S3.


>Others, like 409 Conflict have such a narrow use-case that I’ve never seen it used in practice.

I use 409 when users of the API try to create a resource that already exists.


(11) HTTP is a highly inefficient protocol and abandoning it would help a lot against global warming


this article is terrible lmfao




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: