
Network protocols, sans I/O - kawera
https://sans-io.readthedocs.io/
======
nv-vn
This idea is the basis for one of the popular HTTP libraries in OCaml, cohttp
[1]. The core library (see the lib/ directory in the source code) implements
the HTTP protocol, but makes no attempt to perform IO. The file lib/s.mli
describes the interface that may be used to perform the IO for the library,
making use of OCaml's module system to work generically over multiple
interfaces. What's interesting about this approach is that it almost
"generates the code" for IO once you plug it in to the implementation code
(i.e. all you have to do is provide a handful of very basic functions and
types and it will make use of those to construct the rest of the library).

[1] - [https://github.com/mirage/ocaml-
cohttp](https://github.com/mirage/ocaml-cohttp)

------
allan_s
So I guess it will integrate well with uvloop [http://magic.io/blog/uvloop-
blazing-fast-python-networking/](http://magic.io/blog/uvloop-blazing-fast-
python-networking/) ? (which is lacking an http parser :) )

May I dream that this + uvloop + asyncpg [0] and we can start to have
something to build simple web services à la tornado that leverage python3.5?

[0] [http://magic.io/blog/asyncpg-1m-rows-from-postgres-to-
python...](http://magic.io/blog/asyncpg-1m-rows-from-postgres-to-python/)

~~~
RubyPinch
[https://github.com/MagicStack/httptools](https://github.com/MagicStack/httptools)
magicstack already did their own http parser

------
dividuum
I used that approach for my minecraft protocol parsing library
([https://github.com/dividuum/fastmc](https://github.com/dividuum/fastmc)).
It's indeed very useful to decouple parsing a protocol from the underlying
transport. I used my library both for socket communication as well as to read
and write files. I'd love to see more libraries implemented that way. So
thanks for posting that link.

I imagine it might be difficult sometimes depending on the protocol. IIRC TLS
might sometimes require writes while reading.

~~~
oxymoron
On the topic of IO. I'm actually a fan of java.io.Reader/Writer and
java.io.InputStream/OutputStream. It seems to me that most java protocols and
parsers are quite composable by default due to these simple little
abstractions. Granted, they might not be a good fit for async IO, but I'd
still argue that they've held up quite well during the past fifteen years.
I'll go out on a limb and claim that standard C++ IO haven't worked out quite
as well, although it might just be my skip-the-oo-and-template-everything
skepticism. Rust seems closer to java, although I feel like the Jury's still
out. I'm not a enthusiastic of doing everything asynchronously as in Node.
Haskell does seem to get the composability right.

Python has some of javas thinking through a standard convention on the method
names for reading and writing (an interface!), but it seems common with
departures from those simple rules when people get creative. From that
perspective, an initiative like this is quite understandable and convenient.
Nevertheless, I don't seem to feel the need in the same way in certain other
languages.

~~~
paulddraper
A little off-topic but IMO one of the more unfortunate decisions by Java was
Reader/Writer and InputStream/OutputStream, instead of CharReader/CharWriter
and ByteReader/ByteWriter (or similar).

------
wtbob
From the description, it _sounds_ like the necessity is the lack of a single
way to do I/O in Python. If that's right, then this shows the real advantage
of Go's io.Reader/Writer/&c. interfaces, which enable this sort of
composability.

From my Python years, I'd think that something built around mandatory
read()/write() methods could do much the same.

~~~
Lukasa
I think that's a tempting conclusion to draw, but it's not quite right. The
standard interfaces of Go reduce the pain, but the design principle ("Don't do
I/O in parser or state machines") remains solid.

The reason for this is basically that I/O and protocol logic are separate
concerns, and whenever they start influencing each other too much they impose
costs on each other.

The best example is actually testing. If your protocol code includes calls to
Golang's `Reader`/`Writer` interface methods, it causes a few problems.

The easiest thing to see is that it causes testing problems. For example, for
each call to `Reader`/`Writer` methods, in addition to testing all possible
reads/writes (a protocol concern), you need to test all possible I/O failures
(timeouts, closed connections, weird kernel problems) in order to actually
cover the complete failure space.

However, if your code doesn't have reads/writes mixed in with protocol logic,
your testing scenarios are much easier. Bytes just come in and go out.
Reading/writing problems aren't an issue.

This is just basic separation of concerns stuff, but it really does help, even
in languages with "blessed" I/O mechanisms.

~~~
jjnoakes
I'm not sure I follow.

Why not test with something that implements the reader or writer interface,
but shuffles bytes around in memory? That should alleviate the testing
explosion.

And whatever interface you do have must be putting bytes in our getting bytes
out of the protocol layers. Why not call that interface reader or writer?

I'm not seeing the distinction.

~~~
Lukasa
Sorry, let me be clearer.

The reason that just having an in-memory Reader or Writer doesn't solve the
problem is that the failure modes don't match up. An in-memory reader/writer
has basically no failure modes beyond ENOMEM. That's why in the no-I/O
implementation, this is exactly what we use: write to an in-memory buffer.

Real I/O on the other hand has many failure modes. For an example, consider
timeouts. If your parser does I/O, you need to test timeouts at _every_
location that your parser does I/O. You need to confirm it handles those
timeouts appropriately. And you need to decide what "appropriately" means
here: do you retry? Do you abort? Do you attempt to unwind that state
transition?

All of these are expansions of your state space. This means your protocol
parser has to handle this combinatorial explosion of possible outcomes: at
every point you have a Read/Write you need to be ready and prepared to handle
all possible error conditions that can come out of that.

If your parser does no I/O, though, and only writes to buffers, this problem
does not exist. That allows you to have two totally isolated sections of code:
one part manipulates bytes in memory (the parser), and another bit is
responsible for getting those bytes to and from the network. Each can be
tested separately. If we need `n` tests for the no-I/O parser, and `m` tests
for the I/O without parser, then to achieve equivalent test coverage your
combined code requires `n * m` tests to achieve equivalent logical coverage of
the possibility space.

Small, isolated components are good.

~~~
jjnoakes
Oh. So it is less about I/O vs no-I/O and more about push parsing vs pull
parsing.

Because testing with a reader and writer interface lets you test errors too,
but now you are talking about error recovery strategies (pull has to pass
through or have smarts, push can know nothing).

I agree in many cases push has the advantages being discussed. I just wouldn't
have called it no-I/O since that doesn't really have the right connotation.

------
clemensley
Can someone point to a good resource for best practises in protocol design? Am
currently faced with this problem and would like to avoid having to reinvent
the wheel...

~~~
jfoutz
Gosh, i don't know of anything off the top of my head.

The only real tip i have is to byte length encode variable length requests,
because scanning strings sucks.

having a frame for messages for messages is nice, stick your fixed with stuff
up front, and create fixed with length indicators. It sucks that you waste a
little space with 0's, but avoiding the scan is pretty wonderful.

    
    
        GET /foo HTTP/1.1
    
        GET HTTP 01.10 0004/foo
    

Some things need to happen in order, you need to be authenticated before being
granted access to resources, but that doesn't necessarily require the client
knowing. the state at each step,

Traditionally you'd have something like

    
    
        HELO jfoutz hunter2
            OK
        REQEUEST 1
            OK <bytes for resource 1>
        REQUEST 2
            OK <bytes for resource 2>
    

but really, the server knows if you're authenticated so i could send

    
    
        HELO jfoutz hunter2
        REQUEST 1
        REQUEST 1
    

and then get back either

    
    
        OK
        OK <bytes for resource 1>
        OK <bytes for resource 2>
    

or

    
    
        NO
        NO
        NO
    

or, whatever. generally it's just going to be a bunch of request/response
pairs. sometimes you can't make later requests without actually looking at a
response. You don't know what css or images to request until the html is
parsed, for example. Usually, you can request a bunch of stuff at once, and
it'll work out however it works out.

~~~
userbinator
_The only real tip i have is to byte length encode variable length requests,
because scanning strings sucks._

This is really, _really_ good advice to use whenever possible: it means
clients can determine apriori how much data they need to read, and perhaps
decide whether the length is valid and/or allocate sufficient memory. One of
my annoyances with the traditional "almost-text-based" protocols like HTTP,
FTP, SMTP, etc. is that parsing them is not trivial and often you need to keep
reading until you hit the delimiter or reach an internal limit. In contrast,
"read a length, then read _length_ bytes" makes for simple and efficient
implementation.

There are certainly cases when the amount of data cannot be determined ahead
of time, and in those cases I'd suggest chunked-length-prefixing; delimiters
are really a method of last resort.

------
JoshTriplett
This nicely documents the "why", as does the linked PyCon talk. But both don't
go into much detail on the "how", making it difficult to follow suit.

~~~
Lukasa
Hi, I'm the linked PyCon speaker!

It'd be good to know what additional information you'd like on the "how". I
think at a high level I addressed that in my talk, but if it didn't make it
across or you needed more information I'd love to know what you need. Ideally
I'll turn this into a blog post at some point so it'd be great to have an idea
of what extra info is needed.

~~~
JoshTriplett
You definitely addressed it at a high level (I saw your talk at PyCon), but
I'd like to see the low-level details as well.

This might just be a symptom of so few libraries following this pattern. I'd
love to see some specific examples of handling various kinds of protocols (not
just HTTP) with this approach, to see how it addresses various kinds of
protocol components. For instance: variable-length data structures with length
prefixes, variable-length data structures with "number of elements in the
following array" in the middle, variable-length data structures with a
terminator, text structures that require parsing tokens, and so on. Right now,
the main documentation for those kinds of patterns seems to be "hope HTTP has
a similar pattern and read the corresponding code in h2 or h11".

I'd also love to see some reusable components that make it easier to build
such protocol libraries.

~~~
Lukasa
Yeah, so that's very reasonable.

In Python-land this is all pretty easy. For example, HTTP/2 is a protocol of
the first kind ("variable-length data structures with length prefixes") at the
framing layer, which is implemented in a Python packager called hyperframe.
This uses a combination of the `struct` module and bytestring operations to
achieve its results. A similar approach works for the second kind as well.

Basically, in Python this is almost always _much_ easier because struct sizing
and memory allocation isn't a concern like it is in a C-like language (though
even there, dynamically sized structures and pointers are your friends).

But I agree, there is a lack of good discussion about "how do I actually do
this?" I'd like to elaborate on that at some point for sure, because the
reality is that it's remarkably simple.

~~~
JoshTriplett
> Basically, in Python this is almost always much easier because struct sizing
> and memory allocation isn't a concern like it is in a C-like language
> (though even there, dynamically sized structures and pointers are your
> friends).

I definitely don't want C anywhere near parsers for untrusted data, for so
many reasons, this among them.

> But I agree, there is a lack of good discussion about "how do I actually do
> this?" I'd like to elaborate on that at some point for sure, because the
> reality is that it's remarkably simple.

Perhaps it would help to have some worked examples for some additional
protocols?

Would you be interested in collaborating on a Python parser for some non-
trivial data structures? I have a collection of such parsers as part of BITS
([https://biosbits.org/](https://biosbits.org/)) that really need reworking to
decouple them from I/O, and I suspect the result would make a good article
and/or conference talk.

~~~
posborne
I am the author of one such library for the problem of writing parsers
(particularly for binary protocols). The declaration of the protocol
structures are separate from anything involving I/O. Not trying to push it too
hard but it is one approach: [https://github.com/digidotcom/python-
suitcase](https://github.com/digidotcom/python-suitcase)

There is also Construct which has a different syntax but is similar in many
ways:
[http://construct.readthedocs.io/en/latest/index.html](http://construct.readthedocs.io/en/latest/index.html)

Both suitcase/construct are definitely better suited for parsing binary
protocols -- In my line of work, that limitation hasn't been a deal breaker.
With suitcase, at least, I haven't done much work to optimize performance
(mostly because if I cared, I wouldn't be using Python).

~~~
JoshTriplett
Both of those look great; thanks for the pointer to them!

------
olalonde
Node.js has pretty elegant read/write stream abstractions for this
([https://nodejs.org/api/stream.html#stream_stream](https://nodejs.org/api/stream.html#stream_stream)).

------
voltagex_
Has anyone got any examples of this kind of design in C#?

------
knite
Could we get "Python" in the title?

