
Nobody expects CDATA sections in XML - dmit
http://lcamtuf.blogspot.com/2014/11/afl-fuzz-nobody-expects-cdata-sections.html
======
NelsonMinar
The irony is that CDATA isn't even very useful; there's no way to escape the
]]> closing tag so you still have to invent some special escaping mechanism to
use it.

Nobody expects entity definitions in XML either, and yet about once a year
some new service or software is found vulnerable to XXE attacks. (Summary: a
lot of XML parsers can be made to open arbitrary files or network sockets and
sometimes return the content.)

XML is a ridiculously complex document format designed for editing text
documents. It is not a suitable data interchange format. Fortunately we have
JSON now.

~~~
adamtulinius
"Fortunately we have JSON now." Which doesn't support big ints. JSON isn't a
silver bullet.

~~~
kazinator
Sure it does:

    
    
      { "mybigint" : -123434580239458203948203982345723458 }
    

You can throw the spec out the window and put as many digits as you want into
a JSON integer. Any half decent parser in a half decent language will
accumulate the token and spit out an integer object.

There is no reason to write a JSON parser that doesn't accept bignum integers,
in a language that has them.

There are only two reasons some JSON implementation doesn't accept them. One
is that the underlying software doesn't handle them. In that case JSON is
moot; it's a limitation of that system. You cannot input bignums into that
system through _any_ means; they are not representable. The other is that the
JSON implementation was written by obtuse blockheads: bignums could easily be
supported since the underlying language has them, but aren't simply for
compliance with the JSON specification. Compliance can be taken to
counterproductive extremes.

~~~
cbsmith
> There is no reason to write a JSON parser that doesn't accept bignum
> integers, in a language that has them.

Couldn't you write that about about almost any data type in JSON? The entire
point of having a standard for such things is so that you don't have to worry
about the differences between individual parsers/serializers.

~~~
kazinator
> _Couldn 't you write that about about almost any data type in JSON?_

Yes, you could.

> _The entire point of having a standard for such things is so that you don 't
> have to worry about the differences between individual parsers/serializers._

Things like the maximum size of strings or integers are implementation limits.
These should be kept separate from the language definition _per se_.

There are _de facto_ different levels of portability of JSON data. A highly
portable piece of JSON data confines itself not to exceeding certain limits,
like ranges of integers. We could call that "strictly conforming".

JSON data which exceeds some limits is not strictly conforming, but it is
still well-formed and useful.

Limits are different from other discrepancies among implementations because it
is very clear what the behavior should be _if_ the extended limit is
supported. If an implementation handles integers beyond the JSON specified
range, there is an overwhelming expectation that those representations keep
denoting integers.

This is different from situations where you hit some unspecified or undefined
behavior, where an implementation could conceivably do _anything_ , including
numerous choices that meet the definition of a useful extension.

~~~
cbsmith
> Things like the maximum size of strings or integers are implementation
> limits. These should be kept separate from the language definition per se.

At least with integers, I think protobuf's ability to specify 32-bit and
64-bit integers has been quite helpful.

> There are de facto different levels of portability of JSON data. A highly
> portable piece of JSON data confines itself not to exceeding certain limits,
> like ranges of integers. We could call that "strictly conforming".

Yeah... and that's how you get yourself in to problems. Life is simpler with
one standard that either works or doesn't. Easily the most annoying thing with
protocol buffers is using unsigned integers because of Java's signed integer
foolishness. Yes, you can argue that's a reason to have the "strictly
conforming" concept, but I'd argue quite the opposite.

> Limits are different from other discrepancies among implementations because
> it is very clear what the behavior should be if the extended limit is
> supported. If an implementation handles integers beyond the JSON specified
> range, there is an overwhelming expectation that those representations keep
> denoting integers.

Hmm... I don't think that is clear at all. In fact, it isn't clear to me when
I have to worry about floating point rounding potentially kicking in.

> This is different from situations where you hit some unspecified or
> undefined behavior, where an implementation could conceivably do anything,
> including numerous choices that meet the definition of a useful extension.

In practice, there seems to be little difference. While there might be some
_idealized_ behaviour that is expected, there appears to be plenty of wiggle
room for a variety of behaviours for these "not strictly conforming" cases.

------
0x0
I've been following posts about this tool for a few weeks and it is really
remarkable how many interesting results are already popping out already. In
particular since static analyzers have been around for years and years.

I'm assuming afl-fuzz is particularly CPU-bound, and it would be interesting
to see some numbers about how many CPU years are being dedicated to it at the
moment - and if we would see even more interesting stuff if a larger compute
cluster was made available.

It's also super scary how "effortlessly" these bugs appear to be uncovered,
even in "well-aged" software like "strings".

~~~
hueving
It would be pretty cool to have a public cluster that anyone can submit jobs
to that are prioritized based on amount of donated CPU cycles. Instead of
"Seti at home" it would be "fuzz at home".

~~~
dalke
I don't think that would be that efficient of a use of computing resources.
Each instance explores the same instruction space. It keeps track of where
it's explored, and uses various techniques to explore different parts of
space.

It's very likely that multiple instances, if run in parallel and with no data
sharing, will explore a lot of the same space.

Also, making a _public_ cluster would be a security challenge. It runs
arbitrary C/C++ code, and can trigger code paths that the developers didn't
even realize. How would your box stand up to multiple grabs of 4GB of memory?

~~~
hueving
Don't have instances explore the same programs starting from the same state.
Have state reported back in intervals so other workers can be scheduled to
resume the work if that one goes down.

Security would be an issue. A VM would probably be a hard requirement. That
can bound the memory usage, hardware calls, etc.

~~~
dalke
Even starting from different states doesn't help, because afl uses semi-random
search techniques. It's very likely that different start locations will still
have a large overlap.

Nor is it obvious that state reporting is useful. I ran afl for about 4 days.
It ran my test code about 1,000 times per second, for a total of nearly 1/2
billion test cases.

That's a lot of data exchange for each program to report and resynchronize.

I'm not saying it's impossible. I'm suggesting that it's likely not
worthwhile. It would be better to support multiprocessing first, before
looking towards distributed computing.

~~~
f-
afl-fuzz can be parallelized fairly easily. The exchanged data amounts to
newly-discovered, interesting inputs that then seed the subsequent fuzzing
work.

~~~
dalke
Great work with afl. I tried it out last week, and found two segfaults and a
stack smash detection in one program. I tried it on another program, only to
have gcc crash with an internal error. :(

By parallelized, do you mean on the same machine or across a distributed
cluster? If they only share the same set of interesting inputs, won't the
different nodes also end up searching much of the same space? .... Hmm, no I
see how I could be wrong. With interesting seeds, boring space is easy to re-
identify, so there's a trivial amount of duplicate work, and the rest is spent
just trying to find something new and interesting.

------
xendo
Recently I find it harder and harder to believe that lcamtuf is just one
person.

~~~
comboy
He just started running afl-fuzz few years ago and redirects fitting outputs
as blog posts.

~~~
schoen
Running it against an AI simulation of a human programmer, pattern-matching
for the output "Wow that's awesome", no doubt.

------
al2o3cr
Heads-up to the "comment without reading the article" crowd: the title is
_not_ bemoaning a lack of handling for CDATA in existing parsers. It's
discussing an interesting behavior of the AFL fuzzer when used with formats
that require fixed strings in particular places...

Related: NOBODY EXPECTS THE SPANISH INQUISITION, either. :)

------
adnam
This is completely tangential, but I'm waiting for someone to create a
breakfast cereal called Funroll Loops. You know, for the kids.

------
seba_dos1
How long till afl-fuzz reaches consciousness?

~~~
jschwartzi
About 13 years, if all goes as planned. Then another 5 after that until we are
all running afl-fuzz.

------
serve_yay
Wow, what an enjoyable read. I recommend the story about randomly generating
JPG files too.

------
mikeknoop
This thread reminded me of a draft post I've been sitting on for a while,
related to ENTITY tags in XML and XXE exploits.

Basically, it's really easy to leave default XML parsing settings (for things
like consuming RSS feeds) and accidentally open yourself up to reading files
off the filesystem.

I did a full write-up and POC here: [http://mikeknoop.com/lxml-xxe-
exploit](http://mikeknoop.com/lxml-xxe-exploit)

------
userbinator
I'm actually not so surprised, given what the fuzzer does - mutating input to
make forward progress in the code. Incremental string comparisons definitely
fall under this category since they have a very straightforward definition of
"forward progress"; either the byte is correct and we can enter a previously
unvisited state, or it's incorrect and execution flows down the unsuccessful
path. It's somewhat like the infinite monkey theorem, except the random stream
is being filtered such that only a correct subsequence is needed to advance.

On the other hand, I'd be astonished if it managed to fuzz its way through a
hash-based comparison (especially one involving crypto like SHA1 or MD5.)

~~~
Houshalter
It's kind of like breaking a password if you only have to guess 1 letter at a
time until you get it right. Reminds me of the Weasel program:
[https://en.wikipedia.org/wiki/Weasel_program](https://en.wikipedia.org/wiki/Weasel_program)

It's just the simplest possible demonstration of evolution, where characters
of a string are randomly changed, and kept if more of the characters match. In
a short amount of time you get Shakespeare quotes.

Obviously hashes are designed to be difficult to break. Although I've never
heard of anyone trying a method like this before. I've heard of people using
things like SAT solvers to try to reason backwards what the solution should
be. But this is the reverse, it's trying random solutions and propagating
forward to see how far they get.

I doubt it would work, I'm just curious to know if this has been tried before
and how well it does.

~~~
gliptic
The problem is that a good, side-channel resistant implementation would always
do the same amount of computation and fail at the same place. You wouldn't get
any information out of your attempts.

------
backspaces
But of course no one uses either when there's Atom/GitHub's favorite: CSON.
[https://github.com/bevry/cson](https://github.com/bevry/cson)

------
nickbauman
Tell the people that created the webservice I have to consume this!

------
bostonpete
I didn't expect a kind of Spanish Inquisition...

------
pjmlp
Maybe C based XML parsers don't, but JVM and .NET based XML parsers don't have
any issues with CDATA sections.

Time to upgrade to more modern tools?

~~~
pascal_cuoq
The reasons why we are still relying a lot on software written in low-level
languages have been discussed to death, and are quite orthogonal to the
insight in the article, which is that seemingly lo-tech techniques can
discover much about an opaque, potentially vulnerable piece of software. And
even some seemingly insurmountable difficulties (“the algorithm wouldn't be
able to get past atomic, large-search-space checks such as …”) may simply,
with a bit of luck, fail to materialise.

Still, quoting from a sentence a few lines down in the article:

“this particular example is often used as a benchmark - and often the most
significant accomplishment - for multiple remarkably complex static analysis
or symbolic execution frameworks”

The author is thinking of backwards-propagation static analysis or symbolic
execution frameworks, for which is it indeed a feat to reverse-engineer the
condition that leads to exploring the possibility that there is a “CDATA” in
the input. Forwards-propagation static analysis needs no special trick to
assume that the complex condition must be taken some of the times and to visit
the statements in that part of the code. The drawback of static analysis
(especially with respect to fuzzing) is then with the false positives that can
result from the fact that a condition was partially, or not at all,
understood.

------
brabbit
I am not sure but what is the actual harm of it?

