

Linguistics, Turing Completeness, and teh lulz - Hitchhiker
http://boingboing.net/2011/12/28/linguistics-turing-completene.html

======
Groxx
It sounds like, with the no-length-field part, they're saying a protocol
should define _all_ metadata _and_ data formats? I get this from the packet
portion; if skipping over the data part of a packet is bad, but churning
through it to get to the close-delimiter is good, then you must be validating
the data somehow.

Well sure, validate your data. But why is that part of the _packet_ protocol?
I would think it should only verify that the data is properly escaped (if it
is indeed escaped - with a length field it doesn't need to be). Shouldn't the
contents be validated by what's using it, and not the TCP/IP packet
specification?

Adding a length field certainly doesn't make the protocol Turing-complete, a
pushdown state machine should handle it just fine - push the length onto the
stack, and iterate and decrement until it's exhausted. That even keeps it
context-free (if the protocol is regular, and you just added a length field),
which they claim is a safe zone.

\--

They also claim that escaping is a flawed art, citing SQL injection. Is it? It
seems trivial to do a trivial escape that's not flawed (not claiming CSV here,
just double-quotes or something), and such an escaping act _must_ be performed
if you read open and close delimiters... unless your protocol is also
specifying and validating the contents it's transferring, down to the very
last bit, and it doesn't allow those delimiters in the data.

Doesn't this mean they've just claimed that arbitrary data transport protocols
are all inherently flawed? And that any nested specs (ASCII in a TCP/IP
packet, for instance) are inherently flawed, but single-depth ones (which
could very well specify ASCII as the only data format for TCP/IP packets)
aren't?

\--

All in all an interesting talk, and the core message is a good one - simple,
non-Turing-complete protocols are best, and perhaps should be the only
protocols, because they can be validated. But I don't think I buy some of the
reasoning, though I'd love to be convinced otherwise if it is indeed valid.

~~~
jiggy2011
The point about encapsulating arbituary data formats is a good one. Especially
if you invent some protocol that may have to receive programming language code
or worse images/mp3s etc etc especially compressed/encrypted ones.

Unless your protocol will have very limited use cases it's almost impossible
to know what data will be sent over it , so you can't really guarentee any
single delimited character. Even if you do make the use case limited someone
will want to use it in a way you didn't intend (look at HTTP for example).

I don't think they were talking about validating everything at TCP/IP level,
especially since it's unlikely you or I could affect meaningful change to the
TCP/IP spec anyway. I think it's meaning application layer protocols that get
invented for specific programs where what you are doing infact is creating
another type of packet at a higher abstraction level.

I think the length field example was probably a bad one but I think the tl;dr
is to KISS when defining a protocol. If somebody else re-implements your
protocol and you can't easily run their implementation through a series of
tests to know if it is correct or not then you run the risk of creating a so
called "weird machine".

~~~
Groxx
KISS seems to demand a length field on the receiving side of a protocol,
though - you don't need to restrict, validate, (un)escape, or even _look_ at
the data you're transferring, so you can easily transfer anything. Testing
length is far easier than escaping - do the bits after skipping match the spec
for the next chunk of metadata? You can therefore create tests which _only_
test the metadata (your actual spec), and completely ignore the data.

The only thing left is ensuring your tests fail when you specify the wrong
length, which _should_ be assured by your spec (either a master-length value
or a protocol-terminating character like \0 (where you touch on the escaping
problem again)). This is the exact same thing as ensuring start/end delimited
tests pass/fail when you are adding/missing bits of your packet, or if the
bits/bytes are misaligned, which you should be doing regardless.

Given this, using lengths means you can avoid creating or testing any
(un)escaping code, and the bit-alignment tests are identical between the two -
lengths are simpler and more easily testable. KISS favors lengths.

On the sending side, length is still easier than escaping, unless you're
streaming data of unknown length. In that case, you're probably _still_
breaking it up into packets that can be individually sent and validated with a
known length, or you have to resort to escaping and delimiters (not that it's
a bad option, just that it's the only remaining one AFAIK).

~~~
jiggy2011
_Testing length is far easier than escaping - do the bits after skipping match
the spec for the next chunk of metadata?_

What happens if the wrong length is supplied but something inside the actual
data portion happens to match the spec for the next part of the metadata
(either by accident or design)? This could result in too little of the data
being read as input. You will then probably end up with some data/metadata
further along the line being read by a piece of code it wasn't supposed to be
since everything is out of whack.

This _shouldn't_ have security implications as far as I can think , but could
lead to data that should be valid causing errors in some cases.

Could be fixed maybe by having a hash right after the data portion and
checking this after reading it. That would fix "by accident" but not by
design. Although if your purposefully crafting packets to confuse the parser
at the other end you deserve to have them dropped on the floor.

This is really pascal strings vs C strings all over again.

~~~
Groxx
If something 'happens to match':

Assuming only lengths, the packet length won't match - you'll reach the end of
the structure before the end of the data. The point of all this is to validate
before interpreting, so it fails validation, and no harm is done.

Assuming a \0 termination, which requires that it does not exist in the data
(an escaping problem): there won't be a \0 after the should-be-last piece of
metadata, so it fails validation, and no harm is done.

And _all_ of this assumes you have a correctly-transmitted packet to begin
with. What if you don't? If your spec+validator can't detect it, then you are
introducing possible attack vectors - your tests _must_ test wholly-invalid
structures, which your validator _must_ fail, to be 'safe'. I don't see how
delimiters protect this any more than lengths, and they add complexity due to
escaping (which I do think can be done safely + correctly, but we're also
assuming they cannot be because that was essentially a claim in the video),
and you _must_ escape if you use delimiters and allow arbitrary data.

------
kabdib
One of my favorite bits from Vernor Vinge's _A Fire Upon the Deep_ -

>>> WARNING! The site identifying itself as Arbitration Arts is now controlled
by the Straumli Perversion. The Arts' recent advertisement of communications
services is a deadly trick. In fact we have good evidence that the Perversion
used sapient Net packets to invade and disable the Arts' defenses. Large
portions of the Arts now appear to be under direct control of the Straumli
Power...

"Sapient net packets" seem like a really, really bad idea . . .

------
jiggy2011
Interesting talk in the video.

What I'm interested in is the performance implications of context-free parsing
of some arbituary data without a length field.

Someone asked this in the talk and a paper was referenced and she thought
context free could be _faster_ but I don't really have time to read it.

Let's say you get delimited data over a network socket and no length field.
How do you know how big to make your receive buffer without losing efficiency
(i.e scanning the data twice or allocating more than you need) or re-
allocating and copying the buffer at some interval?

