There are some tools, but in general it is easier to do so manually---you can alter the code when the picture doesn't seem to fit in the template, while tools generally can't.
That's the beauty of original HTTP - simplicity. Same as parsing HTML(in the 90s). With HTTPS(S as Secure) it's a whole different story and most programmers use some library.
No, parsing HTTP/1.x is a nightmare and definitely not simple. It wasn't even particularly well defined until 2014 when the original RFCs were modernized, and even now there are bugs reported in HTTP parsers all the time.
Node.js came out in 2009, a full ten years after HTTP/1.1 (RFC 2068) and its original http-parser is rather hard to follow, doesn't conform to the RFCs for performance reasons, and is considered unmaintainable by the author of it's replacement[0]
As for parsing HTML, well go look at how Cloudflare have stumbled[1]
I'm the author of the fastest open source HTTP server. Parsing HTTP 0.9, 1.0, and 1.1 is trivial. It's a walk in the park. It only takes about a hundred lines of code to create a proper O(n) parser. https://github.com/jart/cosmopolitan/blob/0b317523a0875d83d6...
The Joyent HTTP parser used by Node is very good but it's implemented in a way that makes the problem much more complicated than it needs to be. The biggest obstacle with high-performance HTTP message parsing is the case-insensitive string comparison of header field names. Some servers like thttpd do the naive thing and just use a long sequence of strcasecmp() statements. Joyent goes "fast" because it uses callbacks, which effectively punts the problem to the caller, and, for a few select headers which it handles itself, like Content-Length, it uses this really complicated internal "h_matching" thing for doing painstakingly written out hardcoded character compares. Redbean solves the problem by using better computer science: perfect hash tables. Thanks to gperf command. That makes the API itself much more elegant since the parser can not only go faster but return a hash-table like structure where individual headers can be indexed without performing string comparisons.
HTTP/1 is deceptive, because you think "oh, I'll just read RFC 2616, make a state machine, and I'm done!". But then writing a browser-compatible HTTP/1 client is a whole another bag of problems that weren't even documented until recently.
For example, Content-Length isn't just a single header with an integer, like the spec says. You need to support responses with multiple Content-Length headers and comma-separated lists of potentially contradictory lengths, and then perform garbage error-recovery the way Internet Explorer or Chrome did. Getting this wrong will make your client hang, consume garbage, or allow response stuffing.
Multiple responses, especially with chunked encodings, are really really hard to get right. Even more so when you have to be able to resume downloads due to socket instability.
>I'm the author of the fastest open source HTTP server
Okay, I'll bite: What are you talking about? Fastest under what conditions, using what measurements? What is the project, and where is the analysis? EDIT: you probably mean redbean (https://redbean.dev/) - the source of which is, oddly, embedded within cosmopolitan. However the question about analysis stands.
Messages per second, latency, footprint, name it. Particularly if it's gzip encoded. See https://youtu.be/1ZTRb-2DZGs?t=717 I'm working on giving redbean the fastest https serving too. Recently I've been helping to make the strongest elliptic curves go 3x faster.
You seem to have written a glorious series of interlocking hacks which yields a small, tight, simple, fast, portable, useful nugget of functionality in a single file. Bravo.
I really like it and will attempt to use it for something real. However, I fear that even this bit of magic doesn't address the central problem of our time, which is software distribution. I believe that the web has solved that problem, and although the web is currently abused by central power, and webapps tend to be thin, animated protocol viewers, it doesn't have to be that way. You've created/discovered a local (maybe global, given real-world limits) minima of what a binary executable can be, but this only finds the minima of the pain of traditional software distribution, but doesn't eliminate it.
The real path forward, if I might be so bold, is to make a browser on top of cosmopolitan/redbean, and bring TBL's original dream of a singular client+server http/html runtime to modern fruition - but with additional superpowers that cosmo brings which I don't think TBL anticipated. No doubt some enterprising souls are already working to get Bellard's QuickJS into redbean to mimic node. Then you need window/drawing context, and the rest of the browser, including layout, could be done in (presumably equally tight) JS. Have you given any thought to exposing those drawing syscalls directly instead of delegating to the browser? And if you haven't and are interested, may I suggest Java's AWT v. SWT as an interesting case study in "where the indirection should go".
QuickJS is already ported to Cosmo so adding it to redbean is only a matter of time. If we wanted a browser we could always port the one the SerenityOS guy built. It'd be the best thing since OpenSSH. Don't look to me to do it though. I don't do anything unless I can do it better than all the existing alternatives out there. I can build a better web server. I can build a better executable format. I don't think I can build a better Chrome. Most of the platforms I target don't even have GUIs.
I think that implementing a proper state machine for the header parsing with ragel would give a more comprehensive result than using gperf or even the handmade one from your code.
I think there are already some versions of the ragel code online, but they might be for other target programming languages.
I'm one of the authors of Ragel and I disagree with you. HTTP is trivial enough that you'd be better served writing the state machine yourself using a switch statement. See my GitHub link above for an example. The code easily ports to other languages, like Java. Lastly when it comes to Ragel and gperf, they both do two completely different things. Ragel would generate a prefix trie search in generated code which would have enormous code size compared to what gperf is doing, which is much faster. With gperf, you only need to consider exactly O(3) octets total to tell which header it is. After that, it does a single quick string compare to confirm it's one of the predetermined headers rather than some unknowable value.
I apparently never paid enough attention to how in the spec there's a clear defined list and I always assumed that a "parser" should handle all valid header*ish looking pairs.
Based on this consideration I was thinking that the ragel state machine would generate faster code for the non happy path (invalid non-ascii, or other types of error) at least in the GOTO version.
When working on the full list it makes perfect sense to check the minimum amount of bytes for identifying headers, so thank you for the clarifications, very informative. :)
Are you handling the obs-fold? Because this is something the joyent parser falls over at (a streaming parser must buffer the obs-fold to disambiguate). I don't see it in your code. If you handle it your parser can't be O(1) space.
See the remark at the bottom of the Boost Beast parser docs for a hint at the trade off here:
Building an HTTP library myself, I have to disagree.
I drop two things here to counterargue that HTTP is easy to parse: 206 multiple ranges (requests will not always be responded to, and ranges from servers are almost always invalid when resumed) and Transfer Encodings (br inside gzip inside deflate inside chunked, anyone?)
Both headers are implemented spec incompliant by every single web server I've seen, including nginx, apache and others.
Source: building a peer to peer web browser that shares its bandwidth (and cache and downloads) with trusted peers. [1]
Trailers can be parsed by invoking the function using something along the lines of ParseHttpMessage((struct HttpMessage){.t = kHttpStateName}, p, n) where you just tell the parser to skip the first-line states. Charset isn't a nightmare either. Headers are ISO-8601-1 so you just say {0300 | c >> 6, 0200 | c & 077} to turn them into UTF-8. It's not difficult. It might be if you want to support MIME. But this is HTTP we're talking about. It was made to be simple! We're talking Internet engineering on the lowest difficulty setting. Implement a TCP or SIP stack if you want hard.
I think ysleepy might be referring to HTML charsets, which can be set in the HTTP header. They are a nightmare, I have come across HTML documents with so many “if lt IE 9” comments that they move the charset declaration out of the first 1000 bytes, causing browsers to ignore it and interpret as Latin 1. But yes, not actually anything to do with parsing HTTP.
> Node.js came out in 2009, a full ten years after HTTP/1.1 (RFC 2068) and it's original http-parser is full-on spaghetti code, doesn't conform to the RFCs for performance reasons, and is considered unmaintainable by the author of it's replacement
That's because of the way the parser is written. There are other simpler parsers that are much more readable.
The fact that someone wrote a parser that's hard to follow doesn't mean that parsing HTTP/1.x is extremely difficult. What is really hard is to construct a parser that is at the same time (1) fast, (2) complete, (3) secure. It is much easier to choose just two, compare e.g. the one based on Nginx[0] vs picohttpparser [1].
for the basics, however http/1.x is pretty simple. you can test webserver health by literally typing in the request.
i suspect the complexity you speak of is similar to MIME. where SMTP/POP/IMAP are pretty simple, things got pretty hairy with the introduction of MIME, SASL and friends.
i think, though, that most of the complicated stuff in http is optional, is it not? like if you don't send a header that compression is supported, the server won't compress... or am I misremembering?
either way, simpler to understand from a packet capture than a grpc stream or spdy/http2 stream.
Pretty much everything is optional if you stick to http/1.0. If you implement http/1.1 then you're required to do a lot of non-essential stuff like chunk encoding, pipelining, and provisionals which themselves are reasonably trivial too but they make the server code less elegant. If you want a protocol that's actually hard, implement SIP.
I once heard that it’s impossible to build a “spec compliant” IMAP4 library as the spec itself is contradictory. Don’t have a reference to prove it, so I could be wrong.
Somewhat related but in the Python space of things: I love that Python has a standard for web frameworks so much so that you can build your own web framework that targets said standard and it can be deployed anywhere without getting lost in the weeds of parsing HTTP. For example FastAPI is directly a ASGI compliant framework, and it is known as one of the fastest Python web frameworks out there. Bottle I think is also a raw WSGI framework and its all in one file. (ASGI is what became the natural progression for WSGI, think of it like the http package Rust wants to standardize).
The whole idea behind Node.js was to write a super-efficient completely nonblocking http server in C, while keeping all the business logic in a simple scripting language.
You should not expect the Node.js parser to be simple.
HTTP/1.x is anything but simple. They were under defined and overly complex in many ways. The original RFC was so complex that when reworked, they split it into 6 documents.
I've worked heavily on some HTTP implementations and its ridiculously hard to get them right.
Not to mention, this "server" only responds to a simple well formed GET request. Without handling about 90% of what the HTTP specifications talk about. Its a nice project, but it doesn't speak to the simplicity of HTTP
I agree. As with many things, it's only simple as long as you ignore the complexities. As they say, the devil's in the details.
> this "server" only responds to a simple well formed GET request.
And not even that. The Request-URI in a Simple-Request line (inherited from HTTP/0.9) may contain escape characters. (e.g. `GET /my%20file.txt` to get `my file.txt`) HTTP/1.0 states "The origin server must decode the Request-URI in order to properly interpret the request."[1] This server does not.
Which is not to say that this server isn't interesting. Just that it's not a demonstration of how easy HTTP/1 is to parse.
But HTTPS just adds TLS. You can use "some library" to do the TLS handshake and subsequent encryption, and end up with a readable-writable stream that you can then parse HTTP from yourself. Your code is the same as when it was dealing with a TCP stream directly.
The ASCII-art formatted version is pretty nice looking.
I was going to say that I don’t however get why the “almost readable version” is weirdly formatted. But then I ran it through clang-format and it looks the same still and I saw that indeed it’s because it’s made to do lots of things on the same line and so it is not for lack of white space that it looks so messy.
In conclusion, the “almost readable version” is exactly what it should be in this case.
Regarding availabiltiy: It only handles a single connection, and has no timeouts. If someone just connects, and does nothing else, the server will be unavailable.
You should still never use functions which have well known security flaws if there is a widely available alternative which avoids the flaws. Secure programming isn't just about calculating whether your current code has a bug, it's also about writing code that avoids bugs.
The author literally asked for security advice, and then ignored it. I'm trying to explain why one should not just ignore it. There's a lot of novice programmers who read these threads and might think it's perfectly fine to use strcpy (outside of IOCCC submissions). And by the way, who the hell cares about security vulns in IOCCC submissions anyway? It's not supposed to be secure, it's supposed to be obfuscated.
I don't think anyone asked for free advice from a foul-mouthed anonymous throwaway on how to secure their computer. If I was building a website I'd want to secure it from you not with you.
strncpy() isn't dangerous. People have their heads so twisted around muh security that they don't even know what the function was intended to do. The purpose of strncpy() is to prepare a static search buffer so you can do things like perform binary search:
static const struct People {
char name[8];
int age;
} kPeople[] = {
{"alice", 29}, //
{"bob", 42}, //
};
int GetAge(const char *name) {
char k[8];
int m, l, r;
l = 0;
r = ARRAYLEN(kPeople) - 1;
strncpy(k, s, 8);
while (l <= r) {
m = (l + r) >> 1;
if (READ64BE(kPeople[m].s) < READ64BE(k)) {
l = m + 1;
} else if (READ64BE(kPeople[m].s) > READ64BE(k)) {
r = m - 1;
} else {
return kPeople[m].age;
}
}
return -1;
}
It was a really common practice back in the 70's and 80's when the function was designed for databases to use string fields of a specific fixed length.
Suppose the C specification said that string constants are automatically null terminated unless they are a certain size that is platform-dependent. At that given size the null is not added. (And let's say above that size there's a compiler error. Let's also say there's a pragma for telling the compiler you want a bigger limit on the maximum string constant size.)
I don’t think it compiles on windows (netdb.h doesn’t exist there, I think), so you’re fine there, too, from a security viewpoint.
However, if somebody did a quick and dirty “make it compile” port (include winsock2.h instead and, possibly, replace some functions/argument types), I think that would create security vulnerabilities because the fopen on Windows might support using backslashes as path separators.
“A pathname that begins with two successive slashes may be interpreted in an implementation-defined manner, although more than two leading slashes shall be treated as a single slash.”
That opens the door for doing special things for paths that start with //, for example by supporting “//machine:foo/bar/baz” on clusters.
I noticed the check for "/." in the path and I chuckled.
I think if you strip all leading forward slashes and check for "/." then you may have it be "secure" on linux at least.
Your best bet for fixing this on windows is making sure the code never compiles on windows because god knows what on earth the path handling is like on windows.
It won't be a problem in this code because "%" is not handled. But could be a vulnerability for more implemented HTTP servers, and a few days ago I got this idea too.
On the line after the printf where it looks like they’re getting status strings for returns… it looks like there is are three ternary options. Is that right? How does that work?
m = n
? /* if (n != 0) */
/* adds index.html if path ends with "/" (means the filename is omitted), otherwise copies zero */
strcpy(b+i-1,b[i-2]-'/'?"":"index.html"),
/* log the requested filename to stdout */
printf("%s\n",b+5),
/* if "/." is in the path or an error occurred while opening the file */
strstr(b,"/.")||!(f=fopen(b+5,"rb"))
? "404 Not Found"
: "200 OK"
: "501 Not Implemented"; /* if (n == 0) */
By filtering filenames with "/." I prevent exploits with ".." and also don't allow to read files starting with a dot, these are hidden files in Unix-like OS.
Just fired up the server and that does indeed break it. I suppose openat2 with RESOLVE_BENEATH and AT_FDCWD would be a bullet-proof fix, but that's not very codegolf.
Actually it's the job of the operating system to handle file system authorizations. It's just the case that we have shitty default configurations for operating operating systems which allows a lot of ambient authority.
I was surprised to read that this is actually a totally valid HTTP/1.1 application, according to the RFC. The only thing you need is the status line (http version, status code, status message, CRLF) and then the message body.
It's neat, but I don't believe it is a compliant implementation of HTTP/1.1 (or 1.0). For example, it does not handle percent-encoded characters in the request URI.[1][2]
Two CRLF pairs (one to terminate the status line, one to terminate the (empty) headers), which this is one CR short of. Trivially fixable, though it'd mess up the P slightly…
It says "22 lines of C", not "22 statements of C".
For this type of exercise it is assumed that some readability is going to be lost... just look at Perl golf competition. These tend to be written in a single line and it is not always given you are going to even be able to tell where statements start.
Sure, IOCCC exercises are ones in futility, in the same way that breaking a speed running record achieves nothing real-world useful. But that doesn't mean it isn't spectacular and damn impressive.
That's what I was going to say. If nondescript variable names and poor use of whitespace is obfuscation, a few of my friends could submit code they write every day.
It's cool if you can read code like this without issue. I'm chasing Kolmogorov complexity, rather than obfuscation. I add things like this to fill gaps in a specific shape.