

Have experience with PNG/IFF/4CC formats? I want design help. - dalke

I'm designing a new format for a sub-niche of cheminformatics. I've used ideas common to FourCC formats: blocks with a size, 4-character name, and data chunk. I have blocks &#62;2GB so I'm using 64-bit sizes instead of 32, so I can't leverage the formats I know about exactly.<p>I have questions about the design, and would like feedback:<p>1) PNG ends the chunk with a CRC-32 check value. The other IFF/FourCC formats don't. How important has the CRC-32 been in practice? How often does a PNG chunk fail to pass the check?<p>1b) (Assuming the check value is useful): Given multi-gigabyte chunks, should I use CRC-64-ECMA-182 instead? I'm assuming that there will eventually be 10-100TB of data across the world in this format. Is this something I should worry about?<p>1c) How often should I validate? I don't want to do it automatically on load because it takes cat &#62; /dev/null 43 seconds to process a ~3GB file. But single-lookup command-line queries are sub-second. Should it be a user-defined paranoia test? My feel is that those never get run.<p>2) The PNG format uses the NUL character as separators, e.g.: the "tEXt" is "an uncompressed keyword or key phrase, a null (zero) byte, and the actual text." The end position of the 'actual text' is determined by the end of the chunk. Wouldn't a NUL terminator would make the code easier for C programs to handle? Is there a reason to not using NUL-terminated fields, other than to save a byte?<p>3) The PNG format uses bit 5 (the upper-case/lower-case bit) on the chunk code to encode things like "Safe-to-copy". This seems cute idea, but has it proved useful? Does it work? That is, do people add their own chunk types, and find that other software mostly tends to follow those bit settings? Or does it end up that most software just ignores/skips chunks that it doesn't know about?
======
vitovito
This is an interesting question, but I don't know how many file format
neckbeards you're going to find on HN, it's an awfully young community. Maybe
Stack Overflow? Maybe the File Formats wiki[1] will have links to format
inventors? Maybe read the specs for the formats and ask the authors, as well
as the authors of standard libraries for them?

1:
[http://fileformats.archiveteam.org/wiki/Electronic_File_Form...](http://fileformats.archiveteam.org/wiki/Electronic_File_Formats)

I will answer what little I can based on some work I did in 2005-2006 that
resulted in me abusing JPEG restart markers.

1\. Part of the reason PNG is chunked is because it's designed for network use
(that's the N), where you can lose or corrupt a portion of the file and still
get most of the file usable. That's what JPEG restart markers are for: you can
repeat the headers throughout the file (rather than CRC throughout the file)
so if something in the middle gets messed up, you can always pick right back
up. Formats expected to sit on a disk and never move and trust rotational
media don't have those checks because it's assumed you check it when you write
it and then you're fine for a reasonable value of forever. So: are you
transferring over a network constantly like you do with JPEGs and PNGs, or can
you do an MD5 sum (or whatever) once and be done with it?

1b. I don't know what these words mean.

1c. This should be answered by 1.

2\. This is probably explained somewhere in the PNG spec or the PNG mailing
list archives, but specifying the length of a run of data instead of looping
until you find a \0 helps you optimize and prevents buffer overflows.

3\. Yes, lots of things use user-defined chunks and all sorts of crazy stuff
gets stored in them. A video game once stored its character data as PNG files
with custom chunks so you could put them online as images, and then right-
click-save-as and import them into the game. But no, most software doesn't
understand other chunks and ignores them (but preserves them).

~~~
dalke
Thanks for your response.

I tried programming.stackexchange a year or two back. (I've been waiting for
funding and/or a specific need to work on this format, which I now have.) The
one response then also pointed out PNG was designed to work with the network,
particularly dialup, but could give no numbers as to the failure rates.

Based on what I've read, internet packets have a failure rate of something
like 16 million to 10 billion packets, which is O(1 TB) and Bram Cohen
concurs, saying that BitTorrent sees failures in the 1 per 10TB range. I
wonder about when/if I should worry about this sort of problem.

Most of the data files will be sitting on disk, mmap'ed for use. I'm going to
take your advice that this problem is mostly theoretical for the situation I'm
in, and suggest that users manually md5 if they want to verify a copy, or use
BitTorrent if they want better guarantees on data transfer over a flaky
network.

1b. was a fancy way to say CRC-64 instead of CRC-32. There's several -64s, so
I picked the one in xz.

1c. Yep. I've given up worrying. Now it's intellectual curiosity.

2\. I should ask the PNG list, yes. The documentation doesn't explain the
logic, and the 15+ year old mailing lists are only available via ftp'ed zip
files, making them a bit harder to trawl than a web search.

I don't believe the logic about length vs. NUL since the first field is NUL
delimited, making it susceptible to the same optimization and possible
overflow attacks. The iTXT field, for example, has four NUL separators instead
of using a length parameter.

3) That's neat! I'll need to think more about including this PNG-like sort of
functionality.

