

File Content Histograms - gus_massa
http://www.cutawaysecurity.com/blog/file-content-histograms

======
wrwetzel
I did something similar many years ago for identifying bit order and sense in
mu-law encoded audio files as part of a larger audio-file display program.
There were four combinations possible (big-endian, little-endian combined with
bit-normal, bit-inverted) and no consistent standard was adhered to by the
various organizations supplying data.

Although developed specifically for mu-law data I found it quite helpful for
identifying other file types as well, just as the author here described.

Bill

------
sheffield
English vs. Russian vs. Japanese text (UTF-8)

<http://imgur.com/a/vwoIF>

------
nantes
I am fascinated by the Truecrypt file histogram. I'd be interested to see a
histogram of an encrypted partition or whole file system.

Hmm, may not get any work done this afternoon.

~~~
jannes
All binary file formats should be randomly distributed along the spectrum if
their file size is large enough.

Also, I am not sure that this sort of diagram would be suitable to tell
certain file types apart, there is probably always a better way to do that.

You should notice that the histogram of a zip file, for example, should level
out (like the truecrypt histogram does) if the file were as big as the
truecrypt file.

The look of these histograms is largely influenced by the file size. So, I
think only the peaks are worth to interpret.

~~~
wladimir
_All binary file formats should be randomly distributed along the spectrum if
their file size is large enough_

Why would that be the case?

\- Executables: some opcodes are more common than other, and usually contain
text/messages

\- Zip files / compressed images / mp3s: compression might have a distinct
"signature" that has peaks, for example block headers

\- Database files: not all values are equally common in records (probably
peaks around zero, because of fixed-size integers and string padding)

And so on...

~~~
jannes
When I wrote that previous comment I assumed that most binary data would
probably use an encoding that is longer/shorter than 8 bit (that's what was
examined here). But thinking about it, that might not be true. I don't know.

------
mrspeaker
Very interesting... but what does it mean? It it just a method of determining
filetype from its histogram?

~~~
wladimir
Yes... for example, it shows how easy it is to recognize encrypted (or random)
data, as it is too uniformly distributed to be anything else.

------
iwwr
The link is dead, anyone got a mirror?

~~~
nantes
It was up for me, but here is a mirror:

[http://www.cutawaysecurity.com.nyud.net/blog/file-content-
hi...](http://www.cutawaysecurity.com.nyud.net/blog/file-content-histograms)

~~~
iwwr
Can you post it somewhere else or pastebin?

------
etherealG
does anyone know if this is reliable for guessing file type? how would you
code a file type guesser using this techinque?

perhaps relative distributions would be close enough to the "general pattern"
for a particular file type to be guessed?

~~~
sp332
It's not reliable, but it could work for certain filetypes. Notepad uses it
for Unicode detection. Try putting "this app can break" into Notepad, save it
and reopen it :)
[http://blogs.msdn.com/b/oldnewthing/archive/2004/03/24/95235...](http://blogs.msdn.com/b/oldnewthing/archive/2004/03/24/95235.aspx)

