Hacker News new | past | comments | ask | show | jobs | submit login
Pulling JPEGs out of thin air (lcamtuf.blogspot.com)
444 points by atulagarwal on Nov 7, 2014 | hide | past | web | favorite | 83 comments


If it's smart enough to learn how to build a JPEG in a day, use it with netcat and it could probably send quite a lot of things down in flames.

Who needs static analysis :) ?

Yeah, netcat would be fun. Although, it seems that it also straces (or similar?) the app it tests.

> it triggers a slightly different internal code path in the tested app

This would be impossible on the network.

Mock it locally, exploit it globally. One more reminder why it's useful to turn off your server signatures, especially if they spew out version information.

Oh ho ho no you don't. That's security through obscurity, and that's never ever OK for anybody.

I never understood this attitude. It has always been my experience that obscurity is in fact an important part of security. It's a weakness when mistaken for security, not when understood as part of it.

Sadly, I do actually have signatures (with version information) to mute.

People knee-jerk say that because they assume it's the only thing being done to secure an asset, when obviously it's a valid defense in depth measure, one with very low marginal cost (setting a few variables in conf files).

This really depends. The marginal cost of "what version is this box running, why doesn't this work, oh we don't have that tool?" could be very high on something like that.

I mean remembering, that the net is full of slow brute-forcers and the like. Just because it takes a few days to run through all the exploits doesn't mean that someone won't do it - that's thinking of security in human, individual terms, as though the threat is targeted rather then general.

But here is where it comes in handy: If you have your version numbers out there in some database and a 0 day for that particular version hits then you're hacked. If not then you might be able to patch your system before a breach happens. It's no guarantee but since it costs very little and gives you possibly a bit more time when you need it badly it does not hurt. Obviously you need to cross all your other t's and dot the i's too.

And if you need your headers to tell you what versions are running and what tools are installed you are doing something else very wrong.

I don't really understand your comment, it reads as if you're setting up a strange counterargument (one I very much disagree with) and then you agree with the original comment right after that.

>Yeah, netcat would be fun. Although, it seems that it also straces (or similar?) the app it tests.

Ah yes, I forgot this little detail. I wonder if you can get it to work on the local machine first, but talking throught a socket instead of stdin.

Then, pipe the result throught netcat !

Conceptually the output generation and the trace collection aren't coupled. As long as you can (a) instrument the target to collect traces and (b) programmatically feed it variant inputs, the same technique will work.

(This isn't a new concept, although afl is a particularly tight implementation of it; you can look up the paper for "autodafe" for a (much) earlier version).

> autodafe

That's got to be the funniest and most appropriate name for a piece of software ever.

It's what you oughtn't to do but you do anyway :)

Combine it with a timing attack and possibly it could ;)

Why impossible? (Hint: 127/8 is a network, too.)

This is a no-fun zone :) I thought the comment author meant to release (the kraken) out in the open - to test publicly (or at least on LAN) available services. Or am I missing something?

If you find problems in a local Apache, it's a good bet those may also be problems with some remote Apaches somewhere on the Internet.

And this is how Skynet was born.

The overhead of executing a binary and of doing network traffic are orders of magnitude in difference. This synthesized a jpeg in a day. The equivalent operation with a network connection assuming an always-available fast server would probably take years.

If you are trying to fuzz a protocol, there is no reason not to test it on a local machine. And it would probably end up _faster_ than the jpeg example because a network request has less overhead than execvp.

Only if you're capable of running the server on the local machine. So yeah, you could fuzz open-source software this way, but that's only going to test the underlying transport protocol, e.g. testing HTTP for nginx. When talking about bringing down services, you presumably need to attack the service itself, and that typically means attacking a server whose code you don't have access to. Open-source services that are run as-is, e.g. databases and the like, usually aren't exposed to the world.

Doesn't have to be open source. Just cause you don't have the source or its not free doesn't mean you can't still fuzz it. Though If some company is running some home brew solution then yes. But plenty of people run services based upon technologies that are semi widely available. Even if not open source.

True, you don't need the source, but you do need the service. And my point was that the software you can run are the software providing the underlying support for the service (e.g. HTTP handling with Nginx, databases with Mysql, etc), and is not the service directly.

At the risk of sounding really stupid. Can someone ELI5 what's going on here and why everyone thinks its so amazing?

It starts with an invalid .jpg (literally a text file containing "hello"), and by trying over and over, changing random bytes and tracing the execution of the decoder program as it is fed the corrupted input, it will drill deeper and deeper into the program until it has gotten far enough that the input is actually a valid .jpg, without any human input.

Fuzzing like this is a very effective technique for finding (security) bugs in programs that parse input, because you will quickly end up with "impossible" input nobody thought to check for (but is close enough that it won't be rejected outright), and whoops there's your buffer overflow.

In this particular case, the fuzzer is going beyond just throwing random input, as it considers which changes to the input trigger new code paths in the target binary, and therefore should have a higher success rate in triggering bugs compared to just trying random stuff. And don't forget, this will work with any type of program and file type, not just .jpgs and the djpeg binary.

>In this particular case, the fuzzer is going beyond just throwing random input, as it considers which changes to the input trigger new code paths in the target binary, and therefore should have a higher success rate in triggering bugs compared to just trying random stuff.

To expand on this, techniques like this are called whitebox fuzzing (or maybe graybox in afl's case). In their extreme whitebox fuzzers even incorporate constraint solvers to directly solve inputs that take the program to previously unexplored paths. One very impressive project is the SAGE whitebox fuzzer [1,2,3] that's in production use at Microsoft (an internal project sadly). I work in the related field of automated test generation, but all my tools are very much research-grade. However, in SAGE they've done all the work of figuring out how 24/7 whitebox fuzzing can be integrated into the development process. I am somewhat envious of the researchers getting to work in an environment where that is possible. If you're interested I very much recommend reading the papers on SAGE.

[1] Poster about SAGE: http://research.microsoft.com/en-us/um/people/pg/public_psfi...

[2] An approachable article on SAGE: http://research.microsoft.com/en-us/um/people/pg/public_psfi...

[3] The paper with all the details: http://research.microsoft.com/en-us/projects/atg/ndss2008.pd...

The main problem with SAGE is that at least outside Microsoft, it exists just as a series of (very enthusiastic) papers :-)

So, while I suspect it's very cool, it's also a bit of a no-op for everybody else. It's also impossible to independently evaluate the benefits: for example, its performance cost, the amount of fine-tuning and configuration required for each target, the relative gains compared to less sophisticated instrumented fuzzing strategies, etc.

Great links, thanks.

Why does it use fuzzed input in the first place? Couldn’t one just use random input from the beginning instead? It would be effectively equivalent but fuzzing of a "hello" string seems to be roundabout.

Well, "hello" is pretty random. :) It was probably just used for dramatic effect in the demo, and you have to start with something - of course even a 0 byte file would be enough.

You could also have a started with a valid .jpg with lots of complicated embedded exif metadata sections etc, and have a good chance of triggering bugs in those code paths without having to "discover exif" first.

From the article: it works without any special preparation: there is nothing special about the "hello" string.

He said it took a day to find good jpg images. If you started the program with a valid input, then it would take much less time to explore the other code paths.

In this case "hello" was just a pseudorandom starter to seed the fuzzer.

This is how I understand it:

1) JPEG file structure is complex (much more so than a BMP file for example). 2) Imagine you didn't know what it looked, but had a tool that could "read" them. djpeg in this case 3) What the fuzzer does is "peek" at how the djpeg tool reacts to random "fuzzed" data. It's smart about it, understanding when a bit of data makes the tool do something new (code path) 4) After millions of iterations it "learns" how JPEG file structure works and can generate valid JPEGs.

This might not be very relevant for JPEGs, given that you can just Google their structure. This is cool because it shows how the tool could figure out (or find a weakness in) something where you don't know how it works.

It doesn't do #4; it just finds one input that makes the program exit with exit code zero (for success)

It is better to look at it as a maze solver that tries to generate instructions for a robot that will lead that robot through the maze, while that robot has its own weird way of interpreting the instructions. It it generates) makes

This describes a program (afl) which is given another program (djpeg, some kind of JPEG parser - takes arbitrary data and decides if its JPEG or not) to play with.

afl starts feeding it random inputs gradually gaining limited understanding about what it accepts and what it does not.

Eventually, it gathers enough info (based on strace'ing, google it, and its output) to start pulling JPEG images out of thin air like it's nothing.

The catch is, that this afl program learns this all by itself, and it can do so with any other parser program (like ELF parsers, GIF parsers, whatever). It is a general purpose tool.

Also, similar concept can be applied to things like web servers, or any other servers (since they parse what clients send them).

It works kind of like http://reverseocr.tumblr.com/ does, but it monitors code paths in the "recognizer" until the software doing the recognition goes to where you want it. It's like a genetic algorithm meats code-path testing way of exercising your code.

>> Can someone ELI5 what's going on here...


I remember a very similar technique being used successfully for automatically cracking software (registration keys/keyfiles, serial numbers) before Internet-based validation and stronger crypto became common; the difference is that method didn't require having access to any source code or recompiling the target, as it just traced execution and "evolved" itself toward inputs producing longer and wider (i.e. more locations in the binary) traces.

That sounds interesting, do you have any more information on this topic?

The author of this article is a hacker from the time, when the word hacker meant something different than it does today. I remember his website from my early teens when I started using the internet via a dial-up connection back in 1998. Lcamtuf, glad to see you're still around. Your fellow countryman.

Potential instructions for trying this on Mac (I was unable to make it work, perhaps we can build upon this):

curl -LO http://lcamtuf.coredump.cx/afl.tgz

tar zxvf afl.tgz

rm afl.tgz

cd afl*

make afl-gcc

make afl-fuzz

mkdir in_dir

echo 'hello' >in_dir/hello

# there is a glitch with the libjpeg-turbo-1.3.1 configure file that makes it difficult to compile on Mac, so I tried regular libjpeg:

curl -LO http://www.ijg.org/files/jpegsrc.v8c.tar.gz

tar zxvf jpegsrc.v8c.tar.gz

cd jpeg-8c/

CC=../afl-gcc ./configure


# error: C compiler cannot create executables

# if the above command worked to build an instrumented djpeg, then this should work

cd ..

./afl-fuzz -i in_dir -o out_dir ./jpeg-8c/djpeg


Install homebrew if you don't have it already, then

   brew install gcc
Then in the afl* folder:

   CC=gcc-4.9 make clean all
Fixes this so that jpeg-8c will compile.

However, we then get stuck as djpeg is a shell file (and .libs/djpeg exits with error 5) and I've got a bit distracted to continue. Good luck!



>if (strcmp(header.magic_password, "h4ck3d by p1gZ")) goto terminate_now;

How impossible would it be to look at the branching instruction, perform a taint analysis on its input and see if there is any part of the input we can tweak to make it branch/not branch. Like, we jumped because the zero flag was set. And the zero flags was set because these two bytes were equal. Hmm that byte is hardcoded. This other byte was mov'd here from that memory address. That memory address was set by this call to fread... hey, it come from this byte in the input file.


Quite possible. More commonly done with higher-level languages rather than machine code, but certainly possible with machine code. A good fuzzer could do this too.

The fuzzer from the article, american-fuzzy-lop (https://code.google.com/p/american-fuzzy-lop/), does something similar to this as it moves forward in execution, trying to find interesting inputs that cause the program to take a different code path. Symbolic execution could accelerate that process, allowing afl to immediately identify the relevant things to fuzz, rather than randomly mutating and looking for interestingness. On the other hand, unless the program in question runs very slowly, or uses many complex compound instructions before a single conditional branch, random mutation seems likely to produce results rapidly from sheer speed.

Symbolic execution does seem like it would work well if you want to reach a specific point in the program, and you have rather complex conditionals required to get there. But it would still have trouble with complex cases. Consider a file format with a SHA256 hash in it, where the header must have a valid hash to parse. Symbolic execution would have a very hard time figuring out the input relationship required to get past that hash check.

Yea I thought of hashes too. Because there are hashes proven (?) to be secure, it follows that it's impossible to make a universally efficient fuzzer (i.e. one that necessarily spends much less than ~exp(parser size) time).

There are no hashes that are proven to be secure. And we aren't likely to get such a proof any time soon: secure hashes can only exist if P != NP.

But it would still have trouble with complex cases.

It seems to me that all these methods would eventually run into the Halting Problem. Trying to fuzz through a hash (or other crypto) this way would essentially involve having to break it by a slightly more "intelligent" version of bruteforce.

I guess one solution could be to give up if the computation becomes too complicated.

>The left side was computed by summing this and this. That was in turn computed by xoring that and... Screw it. The left side can not be controlled. Now, the right side was loaded from this part of the file. Aha! Let's just change that part instead.

I'm no expert but perhaps this symbolic engine could be something to build as a valgrind module.

...or just preload a special implementation of strcmp that makes notes of its inputs.

See also: Microsoft Code Digger [1], which generates inputs using symbolic execution for .net code, and EvoSuite, which uses a genetic algorithm to do the same for Java [2].

[1] : http://blogs.msdn.com/b/nikolait/archive/2013/04/23/introduc...

[2] : http://www.evosuite.org

I like to imagine that given enough time it eventually generates the Lenna [1] jpeg and exits

[1] : https://en.wikipedia.org/wiki/Lenna

I had a brief 'It's alive :O' moment when reading this, imagine seeing face looking at you in one of those pics :)

Nice article, concept of fuzzers was new to me.

I'm not familiar with how the fuzzer was monitoring the executed code path. Would this be thwarted by address space layout randomization?

You build the target binary using a special version of gcc:

    3) Instrumenting programs for use with AFL

    Instrumentation is injected by a companion tool called afl-gcc. It is meant to
    be used as a drop-in replacement for GCC, directly pluggable into the standard
    build process for any third-party code.


    The correct way to recompile the target program will vary depending on the
    specifics of the build process, but a common approach may be:

    $ CC=/path/to/afl/afl-gcc ./configure
    $ make clean all


No. afl requires an instrumented (compiled with extra information) executable, and watches the code paths. When fuzzing a seed finds a new code path, it will recycle that fuzzed version as a new seed.

ASLR can help prevent successful exploitation of bugs that afl might find, but it won't prevent the program from crashing in the first place.

(Plus, since afl requires compiling the binary, I doubt it bothers to enable ASLR. There's no benefit for fuzzing purposes.)

Can you fuzz an uninstrumented executable to add instrumentation?

You can add instrumentation to binaries using DynamoRIO or pin. This isn't currently supported by afl-fuzz out of the box, although there's nothing that makes it fundamentally difficult.

It's probably using OS debugging APIs for tracing/code coverage and/or instrumenting the code during load directly (there's a mention of "lightweight assembly-level instrumentation"). ASLR is unlikely to be an issue for debuggers.

Sounds very much like a genetic algorithm/evolutionary computation.

Sure, it's even called that on the project page :-) It just uses an interesting fitness function that knows nothing about the underlying data format - essentially, "improve the edge coverage in this black-box binary".

Wow, two awesome ideas in a week. Reminds me of this posted just a couple days ago http://reverseocr.tumblr.com/

Now to try this with midi...

But what to feed it into? I could make some musical analysis stuff, but do I need to write it in C to avoid accidentally fuzzing my interpreter?

I'm running it now on an mp3 encoder. So far, no results, but I'll update if I get anything out of it.

Encoder? You'd probably want to try a decoder as the target binary if you want to make MP3s.

Clojure repl + Overtone?

OT: Is there a simple, little fuzzer that just uses grammars as templates for their outputs?

Yes, lots and lots and lots of them. Peach is the most popular example.

"a simple, little fuzzer"

The download for Peach is 190MB??

You can throw afl-fuzz at many other types of parsers with similar results: with bash, it will write valid scripts;

^ that seems fun, I just don't think I would run it on my machine for fear of what it might create (oh.. rm -rf * ok!)

The beginning of this article reads eerily similar to the beginning of Greg Egan's Diaspora, though in a much more limited context.

What if we feed afl a program that checks whether a number is prime? Will it slowly discover a way to make primes?

This is awesome .. "Go Away or I will Replace You With a Fuzz" seems like my next t-shirt order ..

This what a hacker be.

This is totally amazing! Wondering if it would be possible to go the other way around: from generated JPG to a string. If yes, what a cool way to send your password as a... JPG over email.

The original string is not in any way part of the image that's generated. The fuzzer notices that the initial codepath being triggered with the "hello" file would be different if the first byte were 0xff, instead of 0x68. So it changes the file and tries it. The 'h' has gone - it wouldn't matter what it was originally, the fuzzer would always have chosen 0xff.

All the fuzzer is doing is exploring the possible codepaths through the application trying to exercise all the code; many of the codepaths end up with the executable outputting an error message and terminating. Some maybe put it into an infinite loop. Some end up with it completing a JPEG data parse and terminating - so in amongst all the possible paths it explores, of course it will eventually seek out input sequences which bring that about.

thank you very much for explaining that! So it is really quasi-random image generator with the initial string being a seed?

no, the original string is not really a seed, and it's not really quasi-random. It's highly deterministic based on the structure of the program under test, and to a far lesser extent the original seed. In this case, I would be very surprised whether the original bytes of 'hello' have any impact whatsoever on the first valid JPEG image it finds.

Along similar lines, I wonder if this fuzzer can be used to bruteforce passwords for applications. Would it do any better than standard "try all the combinations" method?

If the password is generated with a sane KDF - bcrypt/scrypt/pbdkf2, no.

If it's not, better attacks exist than trying every single password.

If you're trying to crack the application - not the password - maybe, but I kinda doubt it.

Not really, because it depends on collecting traces from the target, and if you can do that you can usually just read the password out of memory.

On the flip side, it could probably be used as a really slow universal keygen for naive license-key implementations :)

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact