
The Bugs We Have to Kill [pdf] - vezzy-fnord
https://www.usenix.org/system/files/login/articles/login_aug15_02_bratus.pdf
======
munin
Some things about this are a mess.

For one, they claim that CompCert doesn't have a formally verified C parser,
but it does, as of 2014. Instead of citing the CompCert release page, they
cite a paper written in 2011 that says that CompCert does not have a verified
parser.

The concerns about proofs in Hoare logic (they also don't get the notation for
Hoare triples right...) are mis-guided as well. The way that type checking
works in strong, statically typed languages, the conditions they worry about
don't arise by construction. They say that isn't possible, but it totally is
possible and is done in projects like CompCert, bedrock, and some Haskell
frameworks.

Also, they say that Heartbleed is a parser error. I don't see how, because the
heartbeat message is perfectly well formed, the size of the data requested is
larger than the size of the buffer. How is this a parser error? The patch
wasn't a change in parser behavior either, so...

~~~
jbangert
Verifying length fields is absolutely a parser issue. Heartbleed and similar
bugs arise from the fact that the length is encoded (at least) twice -- once,
in the explicit length field, once implicitly in the length of the transmitted
data (i.e. the TCP packet length).

If multiple copies of the same redundant information are not identical, then
that is definitely a case of invalid input.

I try to address this class of parser vulnerability with my Nail parser
generator (OSDI '14 , github.com/jbangert/nail ), which is inspired by
Meredith's hammer.

~~~
heinrich5991
To the application, there's no such thing as TCP packet length. Is there a TLS
packet length?

~~~
richm44
There's a length field in a TLS record and also one in the heartbeat message
itself. Heartbleed happened when the length field of the heartbeat message was
longer than the length of the tls record.

------
hyperion2010
Something, something, the LISP guys have known this forever something,
something.

edit: This article almost perfectly articulates why I'm so furious at the
entire web-as-a-platform movement. If you want general computation lets
develop a platform for it that is isolated and doesn't try to turn what should
be nothing more than text into a full blown programming language. The browser
should be (good luck getting it back there) for viewing and retrieving data,
not executing that data.

~~~
Zigurd
This is why, on mobile devices, where you have a new kind of UX, almost
everything needed a native UI implemented as an app. The Web is a nice
programmable hypertext system. Sometimes it's worth pushing the limits of
that. But not for everything.

------
qznc
If I understand it correctly, it demands to use simpler input languages,
namely languages you can parse with a context-free grammar (in CS speak LL(k)
or even better SLL(k)). To clarify, it is not about so much about how the
programmers implements it (parser generator, combinator, or hand-coded), but
more about what languages we specify.

Well, how many languages (or data formats) are SLL(k)? SQL is not, for
example. A context-free grammar for SQL accepts "INSERT INTO example (a, b)
VALUES ('test');". You need an additional check to find that you specify two
columns to insert, but provide only one value.

This is common practice: Have a context-free grammar which accepts a superset
of the actual language and filter it afterwards. Maybe you can fit the grammar
more to the language, but it becomes quite ugly and big then. I don't think it
is a good idea that we try to avoid the filter-afterwards and use big
grammars, which we implement with shiny verification tools.

~~~
TheLoneWolfling
Hence: languages with a lexer pass followed by a parser pass.

~~~
qznc
I was mostly talking about the semantic analysis pass after the parser pass.
Type checking, name binding, etc.

You are correct that the lexer pass allows you to expand a language despite
the use of a context-free grammar in the parser. Significant whitespace lexed
as indent/dedent tokens is a common example.

------
comex
Getting parsers right will always be important, since they're often "on the
front lines" exposed directly to untrusted input and thus the easiest code to
manipulate. That includes making sure they can't be made to crash, eat up
unbounded amounts of memory, or return semantically invalid to the next layer
of the program. But in most cases, it _shouldn 't_ require ensuring they can't
be made to execute arbitrary code, because we should be writing them in
languages that do not allow for such vulnerabilities. Over the years, the
arguments against doing so are getting weaker as the costs of not doing so get
bigger. Soon there will be no excuse.

------
m1el
There is another bug I think we must get rid of: generating HTML using printf
techniques.

Not only we should have verified parser, but also verified serializers.

[http://m1el.github.io/printf-antipattern/](http://m1el.github.io/printf-
antipattern/)

(EDIT: I know it is badly written and there are errors, but, I'm looking
forward to rewriting it)

------
vu3rdd
The paper seem closely related to another paper by all the three authors and
the late Len Sassaman, titled "The Halting Problems of Network Stack
Insecurity".

[http://langsec.org/papers/Sassaman.pdf](http://langsec.org/papers/Sassaman.pdf)

------
arielby
Why the focus on context-sensitivity? You can perfectly well validate a
(restricted subset of) HTML with a regex - regexes are in fact very good at
input sanitization. Parsers are dangerous because formats have exponentially
many edges for you to get cut on. This is the case for regular, context-free,
and worse formats.

~~~
rspeer
What subset of HTML are you talking about here? HTML is not regular. Regexes
can't do recursive things, such as matching opening and closing tags, and
abusing regexes to sort of match HTML is generally considered a terrible idea.

~~~
vezzy-fnord
_Regexes can 't do recursive things_

Basic regular expressions cannot, but regexes actually can. PCRE pioneered the
technique AFAIK, and it later spread to Perl, Python, Ruby and other runtimes.
Perl has this feature called lazy regular subexpressions which can be used to
evaluate Perl expressions upon matching a subexpression, thus giving you the
ability to recurse.

~~~
rspeer
So, yes, some implementations of regular expressions are Turing complete
because they run arbitrary code, but that is rather the opposite of a way to
make parsers safer.

On that note, once you can run arbitrary functions on your matches, you could
match /.*/ and then the function you run is html5lib.parse. Is that still a
regular expression?

