
Designing File Formats (2005) - vector_spaces
https://www.fadden.com/tech/file-formats.html
======
lucb1e
I really don't like checksums. The reason the author gives is twofold: detect
things like CRLF conversions and bit rot.

For the former, a static code that includes a CRLF does just fine and allows
modifying the file without having to write code to recompute your annoying
checksum.

For the latter, I find it questionable whether it's the file format's
responsibility to prevent bit rot. What's it going to do anyway? Reject the
file? Knowing whether the file is undamaged is good, but most of the time
people will want it to do a best effort attempt at reading it anyway, and
knowing whether integrity has been maintained is actually not that important
for most files (if it is, it's usually a cryptographic feature and you'll not
want to use CRC32 or MD5 as the author suggests). To recover from bit rot,
you'll need a backup (or an error correction scheme but that seems out of
scope of this article). If every file format has its own checksum, your backup
solution can't check the files for integrity anyway, so it seems to me like it
makes more sense to have checksumming be a filesystem feature.

I'd like it if checksums could just not be a thing in file headers anymore.
The number of times it helped me is exactly zero, but I did run into issues
when wanting to open an old file or modify a file. I would argue checksums
prevent hacking in the HN sense of the word.

~~~
joosters
Checksums are also a pain for larger files. If a 4GB movie has a checksum, is
the app supposed to read all the file in to calculate and validate the
checksum prior to playing the movie? That’s inefficient.

It also doesn’t help programs dealing with bad data. Apps still must handle
invalid files gracefully, even if the checksum is good. Otherwise, hackers can
create malicious invalid files with correct checksums.

Checksums also make small edits to a file painful, since every program must
now know how to update the checksums. Imagine if text files had checksums -
how could you possibly do simple regex replacements, e.g. using ‘sed’?

~~~
pharrington
Regarding your second thought—one obvious way a checksum helps in gracefully
handling invalid files is that a checksum pass/fail decisively rules out with
a problem occurred during the data transfer. For example, a checksum always
you to know when it is/isn't appropriate to automatically attempt a redownload
or tell the user to try to redownload a file.

------
makapuf
An interesting choice is RIFF. This generic binary format was used from amiga
to pcx (paint) to wav files and have the same structure and building blocks,
made of chunks with different headers. Or just use sqlite or json.

------
teddyh
> _If you create a binary file format, document what every byte means._

If possible, look into using a pre-existing notation like ASN.1 for this. No
need to re-invent the wheel.

~~~
memling
some free tools exist for asn.1, but it may be easier to go with protobuf or
something similar for new development today compared with 2005.

------
colejohnson66
Off topic, but in the section of the best one being PNG, there’s this:

> The second-to-last byte is a Ctrl-Z, used on some systems as an end-of-file
> marker in text files. Not only does it detect improper text handling, it
> also stops you from getting a screen full of garbage if you "type" the file
> under MS-DOS.

Why does MS-DOS stop reading at an EOF market and not the true end of the
file? I admit, I’m too young to have used DOS, but didn’t FAT(12/16) have file
sizes?

~~~
jolmg
A more interesting question I think is why it would stop you from getting a
screen full of garbage if it's at the second-to-last byte? If it were at the
second byte, or if one read files from the end to the beginning then OK, but I
doubt that's the case.

------
carapace
Rule #1: Don't.

> There are many, many file formats...

And one or more of them almost certainly fits the bill. It's 2019, "someone
else has had this problem."

"Some words of advice on language design" comment on LtU by Frank Atanassow
was mentioned yesterday in re: pg's Bel. [http://lambda-the-
ultimate.org/node/687#comment-18074](http://lambda-the-
ultimate.org/node/687#comment-18074)

Everything in it applies to file formats as well, I think, eh?

(If you must make a new format specify a grammar and make it simple.
[https://en.wikipedia.org/wiki/Chomsky_hierarchy](https://en.wikipedia.org/wiki/Chomsky_hierarchy)
)

~~~
ken
I wish!

> It's 2019, "someone else has had this problem."

Sure, and that format only support 98% of what I need, and they're not
interested in adding one new feature, or no longer available to maintain the
format. Now I have a choice: write my own from scratch, or extend theirs in an
unofficial way.

Or maybe they are willing and able to make a "1.1" version of their file
format, and now 10 other programs that support the 1.0 file format won't
support my files consistently, since they were never designed or tested for
it.

As your LTU link suggests, in many cases:

> your solution belongs in a library, not a language

Sounds great, but what's a "library", in the world of file formats? It would
be fantastic if every file format were extensible in all the ways that were
useful to me, but they're not.

~~~
defanor
> Sounds great, but what's a "library", in the world of file formats?

I don't quite understand the linked programming language design advice (i.e.,
if existence of SKI is a sufficient argument to discard languages based on
cleaner models and fewer axioms, it seems that any language can be discarded
on step 4 because there are Turing complete languages already), but I think a
sensible course of action that would fit the "language versus library" analogy
is to use an existing serialization format and/or data model, only defining
its schema/ontology (if there is no suitable one). Similarly to separation of
layers (particularly presentation and application ones, and perhaps down to
transport layer, given the article's advice to include checksums) in network
protocols.

~~~
ken
I like that interpretation, though existing file formats don't tend to be
implemented with such layers. At best it will make documentation a little bit
simpler, but won't help with my implementation.

There's no straightforward way, in any language/library/system I've used, to
be able to say "take the lexer from file format XYZ, and I'm going to change
the data model it produces" \-- or vice versa. In theory I thought that's
supposed to be what inheritance is about but nobody writes that way, and
object-oriented programming isn't magic pixie dust you can sprinkle on any
design to make it automatically reusable.

~~~
defanor
> "take the lexer from file format XYZ, and I'm going to change the data model
> it produces"

Apparently we've imagined slightly different things (and with the sibling
comment too; perhaps I should have provided examples at once). What I had in
mind is rather standard formats and markup languages, which are already used
in practice (though not always): such as XML (SVG, OO XML, XHTML being based
on it) and JSON; BER and others for standard encoding rules; RDF for a generic
data model with various serialization options. That is, not taking a lexer
from another format's implementation, but using a standard one. Likewise with
integrity checks: many formats already rely on integrity being guaranteed by
file systems and network protocols, and don't include checksums; a similar
story with compression, and with encryption (both of which some formats would
still redefine and include, but often they can be applied separately as well,
or at least using an existing standard/algorithm/library).

------
ken
> The all-time coolest identification string goes to the PNG graphics file
> format

HDF5 uses the same format, except of course with "HDF" instead of "PNG". I
wonder if any other file formats have adopted this.

~~~
greyfade
They're just variants on the IFF theme.[1] All of these files are broken into
chucks with a FourCC header (sometimes with letters alternately upper or lower
case, depending on features and version) that includes a chunk length and
occasionally a checksum. It's a very simple and flexible format that almost no
one since the '90s seems to use any more for new file formats, the only
exceptions being the likes of PNG, MPEG, and a couple other things.

[1]
[https://en.wikipedia.org/wiki/Interchange_File_Format](https://en.wikipedia.org/wiki/Interchange_File_Format)

------
rkagerer
Love pretty much all that article, except "If you want to be trendy, use XML."

