Hacker News new | comments | ask | show | jobs | submit login
This file's a Win Executable, PDF, Java executable (or Python script), and HTML (code.google.com)
198 points by Swizec on Aug 1, 2012 | hide | past | web | favorite | 20 comments

Anyone more knowledgeable in assembly and file formats care to expand on this:

>It serves no purpose, except proving that files format not starting at offset 0 are a bad idea

What exactly does it mean to start at offset 0 and why don't these file formats do that? Is there an advantage in not starting at offset 0 or is it simply oversight/indifference? Any kind of background on the problem would be appreciated, I'm really quite intrigued.

Every major file type (or nearly every, anyway) has a set of signature bytes, a "magic number" or something equivalent that identifies it as being of that type. This lets programs identify what kind of object a file represents without requiring this information to be supplied by the user.

Most file types have this magic signature as the initial few bytes of the file. For example, a Windows executable always begins with the ASCII characters "MZ".

The point is that with non-overlapping magic signatures, a single file can be simultaneously identified as more than one type.

File format trivia:

"MZ" are the initials of Mark Zbikowski, one of the developers of MS-DOS. :)


I'm not an expert on file formats so I looked into Wikipedia. Here's what it says on PNG[1]:

  A PNG file starts with an 8-byte signature.
  The hexadecimal byte values are 89 50 4E 47 0D 0A 1A 0A;
  the decimal values are 137 80 78 71 13 10 26 10. 
So if a file starts with 89 50 4E 47 0D 0A 1A 0A, you know it may be a valid PNG, otherwise you know it's not.

GIF starts with another marker at zero offset, so no valid GIF is a valid PNG, and vice versa.

Some formats are mutually exclusive because they “fight” for contents of first several bytes.

Some formats are more relaxed and introduce the exploited possibility of carefully engineered ambiguity.

edit: removed a section that was utterly wrong

[1]: http://en.wikipedia.org/wiki/Portable_Network_Graphics

It's a little more complicated than that, actually. Any given application of a file format may use various obfuscation techniques on the file's header or contents that render the file invalid from the perspective of the published standard (if there is one; it is also common in these cases to change the file extension to further disguise what format the file actually uses). Programs that do this may or may not de-obfuscate the file prior to use, depending largely on how and why the file was obfuscated.

For instance, a common obfuscation method is simply removing the magic number from the file; in this case, the program may simply try to use the file as the given format and return an error (or crash; we are talking largely about proprietary software in these cases after all) if the file can't be read.

When a file format starts at offset 0, it simply means that it starts at the first byte of the file.

Other than that, I can't provide any information on file formats allowed to start at offsets other than 0, or why this may or may not be a good idea (I suppose maybe it would allow an enterprising programmer to hide a malicious file by embedding it in an otherwise-innocuous format?), though I am certainly curious as well.

I think you're on to the right answer (though I don't know for sure myself).

It seems to me that if all file format identifiers started at the zero offset, it would be impossible for a single file to identify as more than one format. However, when different formats use different offsets to identify themselves, it is possible to construct the file in such a way that it validly identifies as more than one format.

I've seen files have been distributed on 4chan before via a .rar file embedded in an image.

That's kind of a different issue though, my understanding is that .jpeg has an unlimited size footer and .rar has an unlimited size header. It gets similar results, though.

A lot of archive formats start at the end because you don't know what is going to be written beforehand. But there is very little reason not to have magic bytes at either the very start or end of a file.

Did you test how different antivirus programs respond to this?

Here you go https://www.virustotal.com/file/1fc14ab461828afd34f92c69e34d...

Edit: someone posted results for .exe file inside the .zip, which are a bit different (it seems like some antiviruses don't try to unpack it?), but then deleted the comment. Here's the link for .exe: https://www.virustotal.com/file/2a9c7a16cdb3c3f2285afaf61072...

Given what its doing and how it's doing it then those virus alerts listed are understandable and if anything I'd have to say kudo to panda AV for being the most honest about it. Probably breaking the PE and the CRC checksum aspects would get it flagged as it has in some and the html/exe flagging is also explained as well having read thru how it works.

Still impressive stuff and also given the use of undocumented opcodes and x86 foo it does raise a new question:

Given some VM's will fail on some of the instructions instead of running on bare metal, is it possible to have a virus that will only trigger on bare metal or VM machines thru use of undocumented op codes and the like.

Non the less a wonderful definition in hacking in its truest sence and educational on undocumented OP codes and how for some things you cant beat pure assembly for fun and jollys.

My corporate proxy chokes on it too.

An error occurred while performing an ICAP operation: File decompression/decode error; File: CorkaMIX.zip; Sub File: No file name available; Vendor: Kaspersky Labs; Engine error code: 0x00050000; Engine version:; Pattern version: 120801.124000.8311194; Pattern date: 2012.08.01 12:40:00

It being an .exe and a JAR file doesn't surprise me at all. JAR files follow the ZIP format, and self-extracting ZIP files have always worked by being simultaneously a valid EXE and ZIP file.

You could make this a valid Ruby script without the "extra byte" problem with making it a Python script.

Why is this a bad thing? and not a good thing

Below is a valid program of:

* perl * ruby * python2 * python3 * lua

In fact, they all return the same result![1]

== the program ==

print ("howdy")

[1] visually. If you ignore the newline.

.rar also has this issue.

What other formats don't need to start at offset 0?

it's not RAR don't need to start at offset 0, it's Self-extract RAR could be an exe. And WinRAR accept files like these.

Applications are open for YC Summer 2019

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact