Hacker News new | comments | show | ask | jobs | submit login
Have experience with PNG/IFF/4CC formats? I want design help.
4 points by dalke 1373 days ago | hide | past | web | 3 comments | favorite
I'm designing a new format for a sub-niche of cheminformatics. I've used ideas common to FourCC formats: blocks with a size, 4-character name, and data chunk. I have blocks >2GB so I'm using 64-bit sizes instead of 32, so I can't leverage the formats I know about exactly.

I have questions about the design, and would like feedback:

1) PNG ends the chunk with a CRC-32 check value. The other IFF/FourCC formats don't. How important has the CRC-32 been in practice? How often does a PNG chunk fail to pass the check?

1b) (Assuming the check value is useful): Given multi-gigabyte chunks, should I use CRC-64-ECMA-182 instead? I'm assuming that there will eventually be 10-100TB of data across the world in this format. Is this something I should worry about?

1c) How often should I validate? I don't want to do it automatically on load because it takes cat > /dev/null 43 seconds to process a ~3GB file. But single-lookup command-line queries are sub-second. Should it be a user-defined paranoia test? My feel is that those never get run.

2) The PNG format uses the NUL character as separators, e.g.: the "tEXt" is "an uncompressed keyword or key phrase, a null (zero) byte, and the actual text." The end position of the 'actual text' is determined by the end of the chunk. Wouldn't a NUL terminator would make the code easier for C programs to handle? Is there a reason to not using NUL-terminated fields, other than to save a byte?

3) The PNG format uses bit 5 (the upper-case/lower-case bit) on the chunk code to encode things like "Safe-to-copy". This seems cute idea, but has it proved useful? Does it work? That is, do people add their own chunk types, and find that other software mostly tends to follow those bit settings? Or does it end up that most software just ignores/skips chunks that it doesn't know about?

This is an interesting question, but I don't know how many file format neckbeards you're going to find on HN, it's an awfully young community. Maybe Stack Overflow? Maybe the File Formats wiki[1] will have links to format inventors? Maybe read the specs for the formats and ask the authors, as well as the authors of standard libraries for them?

1: http://fileformats.archiveteam.org/wiki/Electronic_File_Form...

I will answer what little I can based on some work I did in 2005-2006 that resulted in me abusing JPEG restart markers.

1. Part of the reason PNG is chunked is because it's designed for network use (that's the N), where you can lose or corrupt a portion of the file and still get most of the file usable. That's what JPEG restart markers are for: you can repeat the headers throughout the file (rather than CRC throughout the file) so if something in the middle gets messed up, you can always pick right back up. Formats expected to sit on a disk and never move and trust rotational media don't have those checks because it's assumed you check it when you write it and then you're fine for a reasonable value of forever. So: are you transferring over a network constantly like you do with JPEGs and PNGs, or can you do an MD5 sum (or whatever) once and be done with it?

1b. I don't know what these words mean.

1c. This should be answered by 1.

2. This is probably explained somewhere in the PNG spec or the PNG mailing list archives, but specifying the length of a run of data instead of looping until you find a \0 helps you optimize and prevents buffer overflows.

3. Yes, lots of things use user-defined chunks and all sorts of crazy stuff gets stored in them. A video game once stored its character data as PNG files with custom chunks so you could put them online as images, and then right-click-save-as and import them into the game. But no, most software doesn't understand other chunks and ignores them (but preserves them).

Thanks for your response.

I tried programming.stackexchange a year or two back. (I've been waiting for funding and/or a specific need to work on this format, which I now have.) The one response then also pointed out PNG was designed to work with the network, particularly dialup, but could give no numbers as to the failure rates.

Based on what I've read, internet packets have a failure rate of something like 16 million to 10 billion packets, which is O(1 TB) and Bram Cohen concurs, saying that BitTorrent sees failures in the 1 per 10TB range. I wonder about when/if I should worry about this sort of problem.

Most of the data files will be sitting on disk, mmap'ed for use. I'm going to take your advice that this problem is mostly theoretical for the situation I'm in, and suggest that users manually md5 if they want to verify a copy, or use BitTorrent if they want better guarantees on data transfer over a flaky network.

1b. was a fancy way to say CRC-64 instead of CRC-32. There's several -64s, so I picked the one in xz.

1c. Yep. I've given up worrying. Now it's intellectual curiosity.

2. I should ask the PNG list, yes. The documentation doesn't explain the logic, and the 15+ year old mailing lists are only available via ftp'ed zip files, making them a bit harder to trawl than a web search.

I don't believe the logic about length vs. NUL since the first field is NUL delimited, making it susceptible to the same optimization and possible overflow attacks. The iTXT field, for example, has four NUL separators instead of using a length parameter.

3) That's neat! I'll need to think more about including this PNG-like sort of functionality.

I asked on one of the PNG lists, and dug up the Jan.-Feb. 1995 discussion, where the CRC discussion mostly took place.

It seems there was an early advocate for CRC-32 support. This person had experience as the main UnZip developer, and knew that large archival files need this sort of check. Zip is a container format and PNG is a container format, so I see how works, but PNG isn't really seen as an archival format.

Still, there was an early push for CRC, and that was taken up. Though I found almost no discussion about failure modes or frequency of failures. One of the few examples I did find argued that uuencoded images on Usenet can be corrupted mid-stream, and PNG was better able to detect those problems.

The original proposal seemed to have put the CRC in the terminal chunk. Someone decided it was more elegant to have it at the end of every chunk. Others agreed. It was put in.

CRC failures don't occur often. I wonder if the failure due to incorrectly implemented CRCs is higher, given that Firefox has a switch to disable CRC check failures in 'ancillary' chunks after the image data, as a workaround for someone's bad chunk.

I've decided not to have the CRC in my own format.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | DMCA | Apply to YC | Contact