
Fast Finite State Machine for HTTP Parsing (2014) - userbinator
http://natsys-lab.blogspot.com/2014/11/the-fast-finite-state-machine-for-http.html
======
userbinator
State machines are really the ideal application of goto, where the semantics
map directly - go to this state, making the flow of the code very clear. The
performance benefits are nice too, so in that sense I think it's beneficial
for both the machine and the programmer.

It'd be funny to see the results of an implementation with "proper OOP design
using the State design pattern", I have a feeling that could be worse than the
original switch/loop.

~~~
chubot
I've been using re2c lately and it's really great. It's this gem from the 90's
that not a lot of people know about; I encountered it in the Ninja build
system, where they were concerned about parsing megabytes of text very
quickly.

It basically generates a bunch of gotos from readable regexes.

[http://re2c.org/](http://re2c.org/)

One thing I haven't explored is "push vs pull" scanners. It's basically the
same thing as threads vs. async, so it could be really useful for async
servers. Normal scanners will pull data in when needed, but if you pass -f, it
can apparently have data pushed upon it.

I guess Ragel can be used easily like this since Mongrel is async (?)

------
zzzcpan
> There was successful attempt to generate simple HTTP parsers using Ragel,
> but the parsers are limited in functionality.

They are not limited in any way. You are free to combine as many state
machines as you want with as convoluted logic as you want. But it takes some
time to learn Ragel.

------
js8
I think that [http://www.ucw.cz/holmes/](http://www.ucw.cz/holmes/) has (or
had) an HTTP parser based on perfect hashing of strings for maybe 15 years. It
was written by some very smart people and there are lot of tricks like that in
these sources.

------
MichaelGG
This is another reason these text style protocols are poorly designed. In
addition to ending up ambiguous in practice, in addition to not being standard
because of "liberal implementations", we get this terrible performance to
boot.

SIP has the same issue. Though most VoIP companies don't have much bandwidth
and you can just saturate them. But if they manage to filter that, overloading
the actual system is not a difficult task.

~~~
chubot
Design has many dimensions; performance is only one of them.

Text protocols tend to grow ambiguities, but that's because they're evolvable.
That's a feature, not a bug. A binary protocol would likely have already
collapsed under its own weight. (Name a binary application level protocol that
has achieved adoption as wide as HTTP)

HTTP also wasn't designed from scratch. The message format is shared with
e-mail (RFC 2822, RFC 822), presumably so people could reuse existing code.
HTML wasn't designed from scratch either; it was based on SGML. There are lots
of good reasons that e-mail was text.

You're basically arguing against nature. Humans ostensibly design protocols,
but it's really the laws of evolution and economics (nature) that determine
which ones survive. It's like complaining that human knees and backs are
fragile... yeah that's true, but there's nothing you could have done about it.

~~~
magila
> That's a feature, not a bug.

No, it's definitely a bug. You can have extensiblity without ambiguity, see
the examples below...

> Name a binary application level protocol that has achieved adoption as wide
> as HTTP

TLS, SSH, DHCP, DNS

~~~
chubot
Why does SSH have v1 and v2? Honest question, I don't know the protocol. I
would guess it's because v1 was too brittle to be upgraded.

SSH and TLS have very few implementations... and there also seem to be
incompatible versions there, at least with SSL 2.0 and 3.0. That approach
might work for the ecological niche of those protocols, but I don't think it
would have worked for HTTP.

There are orders of magnitude more HTTP implementations than DNS, DHCP, TLS,
SSH, etc. That's one reason for its wide adoption.

~~~
magila
The different versions of SSH and TLS came about because the protocols'
semantics needed to be revised in non-backwards compatible ways to resolve
security issues. This has nothing to do with the format of the protocol. I
would remind you that HTTP has also had similar revisions (0.9, 1.0, 1.1).

Hyperbole about "orders of magnitude" aside, I'm not sure having a very large
number of implementations should be seen as a plus. Protocols like SSH, DNS,
and DHCP have relatively few implementations because the vast majority of use
cases are covered perfectly well by a handful of existing libraries and tools.
By comparison HTTP implementations tend to be ad-hoc and tightly coupled.

~~~
chubot
HTTP 1.1 is backward compatible with 1.0, and I'm pretty sure 0.9 is. I think
with SSH and SSL they basically have an (if version) statement at the top
level, and then enter a completely different codebase for each version. They
can get away with this because there are so few implementations.

With HTTP you had more graceful upgrade. They made "mistakes" but were able to
accomodate it within the existing structure.

------
nickpsecurity
Two other benefits to the FSM-based methods:

1\. Easier to apply formal verification or at least model-checking to the
parsers.

2\. Synthesis methods are available to directly generate hardware logic from
several types of FSM's.

So, using the FSM methods gives an immediate benefit plus potential for future
benefits.

------
amelius
Of course it is way more elegant to use coroutines for parsing of stream data.

