It's a trick I stole from ext2, and simplified. In that filesystem there are three bitsets: one for reading, one for writing, one for fsck. If you don't understand a bit you can't do that action.
For most protocols there's only reading and writing, so you can use odd bits to mean "backwards compatible features, you can read even if you don't understand" and even for "stop, we broke compat".
That's a good idea for filesystems. But OpenTimestamps Proofs aren't really "written to". They're created, and then later validated. Also, being cryptographic proofs, my philosophy is the validator should almost always understand them 100%, or not at all, to avoid any false proofs.
That's also why I picked a binary encoding: it's difficult to parse an OTS proof incorrectly. An incorrect implementation will almost always fail to parse the proof at all, with a clear error, rather than silently parse the proof incorrectly.
I think the idea of all these is to make the file not be recognised as text (which doesn't allow nulls), ASCII (which doesn't use the high bit), UTF-8 (which doesn't allow invalid UTF-8 sequences).
Basically so that no valid file in this binary format will be incorrectly misidentified as a text file.
I think preventing the opposite is more pressing. Imagine creating a text file and it just so happens that the first 8 characters match a magic number of an image format. Now when you go back to edit your text file it is suddenly recognized as an image file by your file browser.
Git interprets a zero byte as an unconditional sign that a file is a binary file [0]. With other “nonprintable” characters (including the high-bit ones) it depends on their frequency. Other tools look for high bits, or whether it’s valid UTF-8. PDF files usually have a comment with high-bit characters on the second line for similar reasons.
These recommended rules cover various common ways to check for text vs. binary, while also aiming to ensure that no genuine text file would ever accidentally match the magic number. The zero-byte recommendation largely achieves the latter (if one ignores double/quad-byte encodings like UTF-16/32).
the high bit one is pretty ancient by now. I don't think we have transmission methods that are not 8-bit-clean anymore. And if your file detector detects "generic text" before any more specialized detections (like "GIF87a"), and thus treats everything that starts with ASCII bytes as "generic text", then sorry, but your detector is badly broken
There's no reason for the high-bit "rule" in 2025.
I would argue the same goes for the 0-byte rule. If you use strcmp() in your magic byte detector, then you're doing it wrong
The zero byte rule has nothing to do with strcmp(). Text files never contain 0-bytes, so having one is a strong sign the file is binary. Many detectors check for this.
that might be true for ASCII but there are other text encodings out there
And again, if a detector doesn't check for the more specific matches first, before falling back to "ah, that seems to be text", then the detector is broken
Wikipedia has a good explanation why the PNG magic number is 89 50 4e 47 0d 0a 1a 0a. It has some good features, such as the end-of-file character for DOS and detection of line ending conversions. https://en.wikipedia.org/wiki/PNG#File_header
That is unfortunate. Not enough standards have rationale or intent sections.
On the one hand I sort of understand why they don't "If it is not critical and load-bearing to the standard. Why is it in there? it is just noise that will confuse the issue."
On the other hand, it can provide very important clues as to the why of the standard, not just the what. While the standards authors understood why they did things the way they did, many years later when we read it often we are left with more questions than answers.
At first I wasn't sure why it contained a separate Unix line feed when you would already be able to detect a Unix to DOS conversion from the DOS line ending:
0D 0A 1A 0A -> 0D 0D 0A 1A 0D 0A
But of course this isn't to try and detect a Unix-to-DOS conversion, it's to detect a roundtrip DOS-to-Unix-to-DOS conversion:
Unix2dos is idempotent on CRLF, it doesn’t change it to CRCRLF. Therefore converted singular LFs elsewhere in the file wouldn’t be recognized by the magic-number check if it only contained CRLF. This isn’t about roundtrip conversion.
It's also detecting when a file on DOS/Windows is opened in "ASCII mode" rather than binary mode. When opened in ASCII mode, "\r\n" is automatically converted to "\n" upon reading the data.
I can count the number of times I've had binary file corruption due to line ending conversion on zero hands. And I'm old enough to have used FTP extensively. Seems kind of unnecessary.
“Modern” FTP clients would auto detect if you were sending text or binary files and thus disable line conversations for binary.
But go back to the 90s and before, and you’d have to manually select whether you were sending text or binary data. Often these clients defaulted to text and so you’d end up accidentally corrupting files if you weren’t careful.
And, if you were using a Windows client talking to a Unix server, you didn't want to get a text file in binary mode, since most programs at the time couldn't handle Unix line endings. This is much better nowadays, to the point that it rarely matters on either side of the platform divide which type of line endings you use.
> I don't understand why the default would be anything but "commit the file as is"
Because it’s not uncommon for dev tools on Windows to generate DOS line endings when modifying files (for example when adding an element to an XML configuration file, all line endings of the file may be converted when it is rewritten out from its parsed form), and if those where committed as-is, you’d get a lot of gratuitous changes in the commit and also complaints from the Unix users.
For Git, the important thing is to have a .gitattributes file in the repository with “* text=auto” in it (plus more specific settings as desired). The text/binary auto-detection works mostly fine.
Up until just a few years ago, Notepad on Windows could not handle Unix-style line endings. It probably makes sense now to adopt the as-is convention, but for a while, it made more sense to convert when checking out, and then to prevent spurious diffs, convert back when committing.
Line endings between windows and unix-like systems were so painful that when I started development on my shell scripting language, I wrote a bunch of code to all Linux to handle Windows files and visa versa.
Though this has nothing to do with FTP. I’d already abandoned that protocol by then.
The magic file (man magic / man file) is a neat one to read. On my Mac, this is located in /usr/share/file/magic/ while I recall on a unix distribution I worked on it was /etc/magic
The file itself has a format that can test a file and identify it (and possibly more useful information) that is read by the file command.
# Various dictionary images used by OpenFirware FORTH environment
0 lelong 0xe1a00000
>8 lelong 0xe1a00000
# skip raspberry pi kernel image kernel7.img by checking for positive text length
>>24 lelong >0 ARM OpenFirmware FORTH Dictionary,
>>>24 lelong x Text length: %d bytes,
>>>28 lelong x Data length: %d bytes,
>>>32 lelong x Text Relocation Table length: %d bytes,
>>>36 lelong x Data Relocation Table length: %d bytes,
>>>40 lelong x Entry Point: %#08X,
>>>44 lelong x BSS length: %d bytes
Unpopular opinion: this is all needless pedantry. At best this gives parsers like file managers a cleaner path to recognizing the specific version of the specific format you're designing. Your successors won't evolve the format with the same rigor you think you're applying now. They just won't. They'll make a "compatible" change at some point in the future which will (1) be actually backwards compatible! yet (2) need to be detected in some affirmative way. Which it won't be. And your magic number will just end up being a wart like all the rest.
This isn't a solvable problem. File formats evolve in messy ways, they always have and always will, and "magic numbers" just aren't an important enough part of the solution to be worth freaking out about.
Just make it unique; read some bytes out of /dev/random, whatever. Arguments like the one here about making them a safe nul-terminated string that is guaranteed to be utf-8 invalid are not going to help anyone in the long term.
> Magic numbers aren't for parsing files, they're for identifying file formats.
Unpopular corollary: thinking those are two separate actions is a terribly bad design smell. What are you going to do with that file you "identified" if not read it to get something out of it, or hand it to something that will.
If your file manager wants to turn that path into a thumbnail, you have already gone beyond anything the magic number can have helped you with.
Again, needless pedantry. Put a random number in the front and be done with it. Anything else needs a parser anyway.
> What are you going to do with that file you "identified" if not read it to get something out of it
Anything where there's a decision based on the file format but not its contents. This happens all the time.
* Telling the user what file type it is.
* Choosing a parser to load a file with.
* Restricting file types on upload forms.
* Identifying files for linters.
* Associating files with programs.
Ok some of those use file extensions because it's a lot easier and text based formats often don't have a magic number, but it's still a valid use case.
Speaking of needless pedantry, why do you start with this?
> Unpopular corollary:
To me at least it comes across as condescending and unnecessarily edgy or contrarian for the sake of it. I am perfectly able to gauge whether a comment is popular or not.
The magic number isn't about recognizing specific versions. That's just an added benefit if you choose to add that to the magic number.
It is to solve the problem of how to build a file manager that can efficiently recognize all the file types in a large folder without relying on file name extensions.
If you don't include a magic number a file manager would need to attempt to parse the file format before it can determine which file type it is.
Filename extensions are pretty useful. They were adopted for very good reasons, and every attempt to hide them, pretend they don't matter, or otherwise make them go away has only made things worse.
You still need a way to make it hard to fool people with deceptive extensions, though, and that's where the magic numbers come in.
> The magic number isn't about recognizing specific versions
Yes it is, though. Does your file manager want to display Excel files differently from .jar files? They're both different "versions" of the same file format! Phil Katz in 1988 or whatever could have followed the pedantry in the linked article to the letter (he didn't). And it wouldn't have helped the problem at hand one bit.
I agree. The author could go after msgpack, they don’t have a magic number, but support using the .msgpack extension for storing data in files. Since a magic number isn’t required at all, it shouldn’t be required to be good.
Then there is mkv/webm, where strictly speaking you need to implement at least part of an EBML parser to distinguish them. Possibly why no other file format adopts EBML, everything just recognizes it as either of mkv or matroska based on dodgy heuristics.
Many modern file formats are based on generic container formats (zip, riff, json, toml, xml without namespaces, ..). Identifying those files requires reading the entire file and then guessing the format from the contents. Magic numbers are becoming rare, which is a shame.
I think what the author likes is the fact that the first 4 bytes are defined as 0x7F followed by the file extension "ELF" in ASCII, which makes it a quite robust identifier.
And to be fair, including the 4 byte following the magic number make the ELF-format qualify at least 3 out of the 4 'MUST' requirements:
_ 7F 45 4C 46
- 0x04: Either 01 or 02 (defines 32bit or 64bit)
- 0x05: Either 01 or 02 (defines Little Endian or Big Endian)
Maybe, yes. There are certainly worse offenders than ELF, but I still don't see how it satisfies 3 out of the 4 MUSTs. There is no byte with the high bit set and it is a valid ASCII sequence and therefore also valid UTF-8.
When it comes to the "eight is better" requirement, at least Linux does not care what comes after the fourth byte for identification purposes, so I think that does not count either.
I agree that it isn’t a particularly good example, especially with reference to the stated rules. Many binary-detection routines will treat DEL as a regular ASCII character.
Most of those make intuitive sense, except this one:
> MUST include a byte sequence that is invalid UTF-8
Making the magic number UTF-8 (or ASCII, which would still break the rule) would effectively turn it into a "magic string". Isn't that the better method for distinguishability? It's easier to pick unique memorable strings than unique memorable numbers, and you can also read it in a hex editor.
What would be the downsides?
Or is the idea of the requirement to distinguish the format from plaintext files? I'd think that the version number or the rest of the format already likely contained some invalid UTF-8 to ensure that.
The key part of magic numbers is that they appear early in the file. You shouldn't rely on something that will probably appear at some point because that requires reading the entire file to detect its type. A single 0x00 byte, ideally the first byte, should be enough to indicate the file is binary and thus make the question of encoding moot. However, 0x00 is technically valid UTF-8 corresponding to U+0000 and ASCII NUL. So, throwing something like 0xFF in there also helps to throw off UTF-8 detection as well as adding high-bit-stripping detection.
If you really wanted to go the extra mile, you could also include an impossible sequence of UTF-16 code units, but I think you'd need to dedicate 8 bytes to that: two invalid surrogate sequences, one in little-endian and the other in big-endian. You could possibly get by with just 6 bytes if you used a BOM in place of one of the surrogate pairs, or even just 4 with a BOM and an isolated surrogate if you can guarantee that nothing after it can be confused for the other half of a surrogate pair. However, throwing off UTF-16 detection doesn't seem that common or useful; many UTF-16 decoders don't even reject these invalid sequences.
If there is any foreseeable need that the format will benefit from being executable, I would make the magic bytes looks like this:
#!/usr/bin/whatever^@^@^@^@^@[HDR]
A hash bang path terminated by a null, followed by some (aligned) binary material with version information and whatnot, all fitting into around 32 bytes.
The header format could allow for variability in the path; the #! and [HDR] part could be enough to give it identify it.
> For every system to be able to parse it without loading the entire file
It also solves the ambiguity problem, zip files have the magic numbers at the end, and most other files like pdf have the magic numbers at the beginning, so you can have a file that is both a pdf and a zip file.
Any time two parts of a system disagree on how to interpret a given input, there's an exploit waiting to happen. One of the more famous examples of this is HTTP request smuggling.
As a more concrete example of how file type confusion can bite you, you can imagine a hypothetical photo sharing service that lets users upload both individual images and zip files containing images; The basic structure of the server looks something like
function user_upload_hook(file):
if(is_zipfile(file)):
extract(file, tempdir)
else:
move(file, tempdir/file)
for image in tempdir:
create_thumbnail(image)
...
The developers are aware that zip files can contain zip bombs, so they decide to place some off the shelf ZipCop middleware in front of their application. ZipCop rejects all "bad" zip files, including files that aren't zip files at all. That's almost what they want, so they glue it all together with a shell script that first runs `file` (the POSIX command) on the user-supplied files and only feeds them through ZipCop if the file type isn't on a whitelist of image files. ZipCop rejects bad zip files, and image files are treated properly. All is well and there is much rejo- BANG! A zip bomb blows up in production.
A malicious user has concatenated a JPEG of a cute kitten with a zip bomb. `file` reports that the uploaded file is a JPEG, so it's fed through unchecked to the server. The application's `is_zipfile()` correctly identifies that the file is a valid zip file, so the application extracts it and DOSes the server. The two different layers of the stack disagreeing on how to classify the offending file directly lead to an exploitable vulnerability.
Don't know their rationale, but back in the day it was really popular to cat a zip file onto the end of a JPEG or PNG and upload it onto 4chan, to be able to smuggle zip files where only images were supposed to be allowed. I remember back in the day seeing the "mods are asleep, post sinks" threads where people would share pictures of sinks, and I thought it was people being goofy, but later somebody told me that people were sharing zips of CP in those threads. I don't know if it's true or not, and never cared to find out.
If your security depends on no format with undesirable properties existing then you have no security. The problem here is not the zip format but insufficient validation for the images you accept - the hidden data could be any ad-hoc format. Message smuggling in image files in particular is only something you can prevent if you re-encode the image -- and even then it's possible to hide messages in the image data in ways that will survive re-encodes.
0xDC 0xDF are bytes with the high bit set. Together with the next two bytes, they form a four-byte sequence that cannot appear in any valid ASCII, UTF-8, Corrected UTF-8, or UTF-16 (regardless of endianness) text document. This is not a perfectly bulletproof declaration that the file does not contain text, but it should be strong enough except maybe for formats like PDF that can't decide if they're structured text or binary.
X X x x: Four ASCII alphanumeric characters naming your file format. Make them clearly related to your recommended file name extension. I'm giving you four characters because we're running out of three-letter acronyms. If you don't need four characters, pad at the end with 0x1A (aka ^Z).
The first two of these (the uppercase Xes) must not have their high bits set, lest the "this is not text" declaration be weakened. For the other two (lowercase xes), use of ASCII alphanumerics is just a strong recommendation.
0x01 0x00 or 0x00 0x01: This is to be understood as a 16-bit unsigned integer in your choice of little- or big-endian order. It serves three functions. In descending order of importance:
It includes a zero byte, reinforcing the declaration that this is not a text file.
It demonstrates which byte ordering will be used throughout the file. It does not matter which order you choose, but you need to consciously choose either big- or little-endian and then use that byte order consistently throughout the file. Yes, I have seen cases where people didn't do that.
It's an escape hatch. If one day you discover that you need to alter the structure of the rest of the file in a totally incompatible way, and yet it is still meaningfully the same format, so you don't want to change the name characters, you can change the 0x01 to 0x02. We both hope that day will never come, but we both know it might.
> When would someone ever want a binary file that's not zip, SQLite, or version controllable text?
It feels like there’s an infinite number of answers to this, but to choose one: when choosing the format to allow memory mapping makes some operations simpler or more performant?
> But first, ask yourself why you are designing a binary format, unless maybe it's a new media container.
> When would someone ever want a binary file that's not zip, SQLite, or version controllable text?
Maybe I'm not getting the humour here, but in case you are being serious binary files do have a few advantages over text formats.
1. Quick detection (say, for dispatching to a handler)
2. Rapid serialisation, both into and out of a running program (program state, in-memory data, etc)
3. Better and safer handling of binary data (no clunky roundtrips of binary blobs to text and back again)
4. Much better checksumming.
Binary files are useful, but binary files that aren't either zip, sqlite, or a media container seem pretty niche.
It makes sense for model weights and media and opaque blobs where you don't need to load just a part of it, but I see a lot of custom binary save files that don't seem to make any sense.
If it's a server, everything is probably in a database, and if it's a desktop app, eventually something is going to make an 8GB file and it's probably going to be slow unless you have indexing.
People are also likely to want to incrementally update the file as well.
If you're sure nobody will ever make a giant file, then VCSability is probably something someone will want.
> But first, ask yourself why you are designing a binary format, unless maybe it's a new media container.
Binary formats are ideal for critical applications where you want to either 1) parse the file correctly, 2) fail to parse the file at all. Non-binary formats (and re-use of existing binary formats) tend to have failure modes where you parse the file incorrectly due to a bug, resulting in something bad happening like a security exploit.
Tagged files are too useful. 4 byte tag name, 4 byte length of the object, then the binary data of the object. You see these all the time. Sometimes you see the size before the tag name.
Occasionally, you also see a file header, followed by a size, and an "x", that often indicates a block of ZLIB compressed data.
1) Starts with a null-byte to make it clear this is binary, not text.
2) Includes a human-readable part to make it easy to figure out what the file is in hex dumps.
3) 8 bytes of randomly chosen bytes, all of which greater than 0x7F to ensure they're not ASCII.
3) Finally, a one-byte major version number.
4) Total length (including major version) is 32 bytes to fit nicely in a hex dump.