
Systematic Parsing of X.509: Eradicating Security Issues with a Parse Tree - snaky
https://arxiv.org/abs/1812.04959
======
jcranmer
There's one thing that gives me pause here:

The single most common error is listed as DNS/URI/email format violations.
There is absolutely no discussion as to what kinds of violations these break
into, nor is there even a discussion as to what the paper thinks the correct
formats ought to be. This is unfortunate because the format of these
parameters is one thing where specifications often have a view of the world
which is completely incongruent with reality. As a simple case, you will
sometimes come across documentation that thinks that DNS names cannot start
with a digit, which does not match reality at all.

~~~
bluejekyll
> As a simple case, you will sometimes come across documentation that thinks
> that DNS names cannot start with a digit, which does not match reality at
> all.

I wish that were true. It’s seriously annoying that DNS names are nearly
indistinguishable from ipv4 addresses.

~~~
technion
To complicate this further, there are certificates issued with IP addresses as
names. One of the early bugs in CT Advisor[0] involved not knowing what to do
with such a thing. I'd be interested in whether those writing this report
considered these a valid URI.

An obvious example: [https://1.1.1.1/](https://1.1.1.1/)

[0] [https://ctadvisor.lolware.net/](https://ctadvisor.lolware.net/)

~~~
XMPPwocky
I'm seeing the SAN for the cert served by 1.1.1.1 to contain the IP 1.1.1.1,
not the domain 1.1.1.1. Or am I misunderstanding?

~~~
technion
Yes, the name is listed in the SAN on that cert. That said, that's one of the
fields that's parsed, and potentially a source of issues in this paper.

------
lsh
see also langsec.org

non-Turing complete languages, formal grammars and context-free parsing are
fascinating and the current state of tooling is really sophisticated but
sparse. So much boilerplate code and adhoc parsing exists in my code and I
never really appreciated how much until I asked myself if I really needed a
Turing complete language to tell me if an input really is an integer or a
string of 4 chars, etc.

I'm terrified what would happen if a fuzzer ever went to town on my python
code.

~~~
zvrba
The last couple of times I had to parse/generate string according to something
that can be described with a grammar, I resisted the temptation to implement
an ad-hoc parser just because it was "simple". So I took the time and started
to use Boost Qi/Karma (C++) instead. Just BECAUSE the formats are simple, they
are the perfect opportunity to start learning more powerful tools.

------
sneak
It constantly underscores how early we are in the development of reliable
tooling that such basic errors as these are still regularly being made. I am
glad this sort of research is being done to uncover and identify our societal
technical debt.

~~~
nailer
There's a lot of debt from just ASN1 and X509 parsing itself. The formats are
only popular because of their popularity: their payloads are what matters.

~~~
wahern
The formats are popular because they were popular in closed-source software.
And they were popular in closed-source software because there were (and still
are) good commercial parser generators for ASN.1.

ASN.1 has been a failure in open source because there weren't any good parser
generators. The only open source ASN.1 generator for C code I'm familiar with
is asn1c[1], which was published long after OpenSSL and other projects added
their ad hoc certificate parsing code. I think there may be one or two for
Java, but that's about it.

Moreover, open source projects have historically disfavored using parser
generators. They don't like the dependency, and there's still the sense that
good protocols shouldn't need parser generators--contrast commercial protocols
like X.whatever with SMTP, HTTP, etc.

ASN.1-based formats were _never_ intended to be parsed using hand written
code. Abstraction Syntax Notation refers to the grammar for specifying the
wire-line formats.

ASN.1 is solid technology. The technical debt exists because open source
tooling never developed around it. At first the community thought it was too
complicated and unnecessary. Then when the need arose the community simply
reinvented the wheel (Protocol Buffers, etc).

[1] asn1c is amazing, BTW. Not only will it generate encoders and decoders
given the ASN.1 specification, but it can generate _streaming_ encoders and
decoders, something that most open source alternatives (e.g. Protocol Buffers)
can't do. (And by streaming I mean streaming a _single_ message, which is
important for low-memory environments, either because of minimal hardware
resources, as a performance optimization, or as a security constraint.)

~~~
cryptonick
I agree with Your claim about ASN.1 not intended to be parsed using hand
written code. The ASN.1 is indeed quite close to a grammar specification, as
also shown in the paper. However, I believe that a major issue related to its
parsing complexity is the binary encoding generally used, which is either BER
or DER: both of them employing length fields. While the usage of length fields
is usually ubiquitous in formats related to communications protocols, these
fields are quite annoying to be handled from a grammar design perspective.
Indeed, a length field requires to count bytes of the payload: this operation
is tedious to be done with grammars, while it is extremely easy with hand
written code, in turn generally making the usage of grammar based automatic
parser generators for these formats a less common choice.

From a grammar design standpoint, a delimiter based structure would be
preferable. For instance, in the context of X.509, we proposed a new format[1]
which replaces DER encoding with a new format where there are no longer length
fields but the payload is terminated by a fixed delimiter. The grammar for
this format was way easier than the length field base encoding, without
requiring any hand written code.

[1] A novel regular fomat for X.509 digital certificates,
[https://link.springer.com/chapter/10.1007/978-3-319-54978-1_...](https://link.springer.com/chapter/10.1007/978-3-319-54978-1_18)

~~~
wahern
I believe I read that paper recently :) Ultimately I ended up using PEGs. LPeg
in particular, using LPeg's match time captures to recursively invoke PEGs for
length-encoded objects. (In addition to the match time capture extension,
what's especially nice about LPeg--missing from every other PEG library I've
seen--is that you can build _and_ transform the AST in one shot.)

I've also tentatively rejected translation to a format like that proposed in
the paper. In a secure enclave-like environment I'd rather be dealing with
statically defined C-like structs with stronger invariants--i.e. no optional
or sum types, no variable-length fields; basically, no need for any kind of
parsing whatsoever. If I have to transform, I'd like to transform the message
both syntactically _and_ semantically into the simplest possible form. Parsing
complexity is only part of the equation. The other part is semantic
complexity, which is a different kind of problem that better formats and
parsers can't fix.

In another timeline things could have been different, but we don't live in
that timeline :( We can't let perfect be the enemy of good. Even if we could
move away from DER or even ASN.1 in the open source world, the entire
telecommunications industry (and specifically the cell industry) is built
around ASN.1 and DER/PER/XER. AFAICT the biggest users of asn1c are people
working with 3GPP and similar standards. No matter how sane and secure we can
make our open source ecosystems, ASN.1 and similar older tech will still lurk
in the background, remaining the weakest link in the chain. If we want real
security we have no choice but to develop better tooling in that regard. I
appreciate your proposal is very much of that mindset, I'm just not sold on
the practical utility.

------
nly
My default position on parsing anything more complex than a couple of comma-
separated non-string values these days is to write a grammar and pick up a
tool. I wish more programmers felt this way.

~~~
AllegedAlec
Event that is always highly annoying. How do you deal with (for example) Dutch
decimal numbers, which use the comma for decimal separation, or numbers with a
comma as a thousands separator?

~~~
LeonM
'Dutch' notation can usually be distinguished with enough context (i.e. if the
number has decimals), or by comparing it to other numbers found in a file.

But don't get me started on people mixing American date notation and ISO
notation. Especially mixing the separators.

If you ever have to work with date notations, use a hyphen as a separator for
ISO, and use a forward slash (/) when using American notation. It's the only
way to distinguish dates before the 13th day of the month.

~~~
riffraff
I am sorry to tell you the rest of the world also uses slash with dates, it's
not just a US thing. But with same field order.

~~~
mattashii
Please note that in The Netherlands, the most commonly used date format using
'/' is day/month/year, not month/day/year as seen in the US.

See also
[https://en.wikipedia.org/wiki/Date_format_by_country](https://en.wikipedia.org/wiki/Date_format_by_country)

------
userbinator
Interesting how Apple's SecureTransport seems to be the most permissive of
them all, rejecting 0 certificates that all the others complained about
syntactic errors with in the dataset.

------
k-ian
[comment about x509 being bad]

~~~
Dylan16807
Sucks that you're being downvoted since at the time you posted "X.509" wasn't
in the title and it was reasonable context on _why_ "20% of HTTPS server cert
are incorrect, half considered valid by libs"

