|I'm designing a new format for a sub-niche of cheminformatics. I've used ideas common to FourCC formats: blocks with a size, 4-character name, and data chunk. I have blocks >2GB so I'm using 64-bit sizes instead of 32, so I can't leverage the formats I know about exactly.|
I have questions about the design, and would like feedback:
1) PNG ends the chunk with a CRC-32 check value. The other IFF/FourCC formats don't. How important has the CRC-32 been in practice? How often does a PNG chunk fail to pass the check?
1b) (Assuming the check value is useful): Given multi-gigabyte chunks, should I use CRC-64-ECMA-182 instead? I'm assuming that there will eventually be 10-100TB of data across the world in this format. Is this something I should worry about?
1c) How often should I validate? I don't want to do it automatically on load because it takes cat > /dev/null 43 seconds to process a ~3GB file. But single-lookup command-line queries are sub-second. Should it be a user-defined paranoia test? My feel is that those never get run.
2) The PNG format uses the NUL character as separators, e.g.: the "tEXt" is "an uncompressed keyword or key phrase, a null (zero) byte, and the actual text." The end position of the 'actual text' is determined by the end of the chunk. Wouldn't a NUL terminator would make the code easier for C programs to handle? Is there a reason to not using NUL-terminated fields, other than to save a byte?
3) The PNG format uses bit 5 (the upper-case/lower-case bit) on the chunk code to encode things like "Safe-to-copy". This seems cute idea, but has it proved useful? Does it work? That is, do people add their own chunk types, and find that other software mostly tends to follow those bit settings? Or does it end up that most software just ignores/skips chunks that it doesn't know about?