They definitely did not implement PDF parsing, even a subset of it. They make some assumptions that will definitely result in incorrect parsing. For instance, they assume, objects are tightly packed. They're not required to. They should be to save space but are not required to. Moreover, it is possible to place objects inside other objects. It's not advised but not prohibited. As far as I can tell this is where their PDF parsing ends. They don't parse the objects themselves (not regular objects, nor stream objects). So they've chosen PDF "because it is the most complicated format to our knowledge" but ended up just (incorrectly) chunking the stream by offset table.
> For instance, they assume, objects are tightly packed. They're not required to. They should be to save space but are not required to.
The PDF 2.0 spec says in section 7.5.3, "The body of a PDF file shall consist of a sequence of indirect objects representing the contents of a document." I'd read that as establishing the entire contents of the file body. Of course, real-world PDFs might have all sorts of garbage that a practical parser should be prepared for, but I don't think that it's condoned by the standard.
> Moreover, it is possible to place objects inside other objects. It's not advised but not prohibited.
I think the standard tokenization would prevent any string "obj" inside of an indirect object from actually being a keyword obj that starts a new indirect object. (And if the file body as a whole weren't tokenized from start to end, then "a sequence of indirect objects" would be nonsensical.)
Yeah this work is far away from what a real PDF parser requires. It’s not uncommon for the lengths at the beginning of streams to be wrong or the data to be encoded in a format different from the one claimed. The offset table can also be wrong or missing.
Malformed file is a whole another can of worms a good parser should know how to deal with but here it doesn't even format compliant.
I think they wanted to demonstrate that their work can slice a stream by offset table, in a declarative fashion. It is a useful property. I think they would've better picked OTF/TTF for demonstration of this particular feature.
Sounds to me like that's more of an issue with the PDF specification than with the work presented in the paper, in which case that's hardly the metric by which we should measure its merit.
I’m not saying PDF is a good format. I’m pointing out that they’ve made a poor choice going for PDF. There are other formats they could’ve used to demonstrate this specific technique. Like OTF/TTF which is a more traditional binary format with a whole range of approaches, including offset tables.
The abstract says "We have used IPGs to specify a number of file formats including ZIP, ELF, GIF, PE, and part of PDF". Sounds to me like they threw PDF in there to test the limits of the technique _after_ using it on a bunch of other unrelated formats.
In fact, the authors state "PDF is picked because it is the most complicated format to our knowledge, which requires some unusual parser behaviors. We did not implement a full PDF parser due to its complexity, but a functional subset to show how IPGs can support some interesting features (...) PDF is a more complicated format. Our IPG grammar for PDF does not support full PDF parsing but focuses on how some interesting features in PDF are supported. As a result, the parser generated from our IPG PDF grammar can parse simple PDF files"
Kinda sounds like you didn’t read what they actually wrote in its entirety but have still taken the first possible chance to jump in and quite aggressively tell them how what they’re doing is wrong.
> To further utilize the idea of intervals, an interval-based, monadic parser combinator library is proposed.
This sounds like a well-behaved variant. Adding local attribute references simplifies the grammar and is tractably implemented.
This might support classifying and implementing formats by severability + composability, i.e., whether you can parse one part at the same time as another, or at least find/prioritize precursor structures like indexes.
The yet-unaddressed streaming case is most interesting:
> We can first have an analysis that determines if it is possible to generate a stream parser from an IPG: within each production rule, it checks if the attribute dependency is only from left to right. After this analysis, a stream parser can be generated to parse in a bottom-up way
For parallel composition, you'd want to distinguish the attributes required by the consuming/combining (whole-assembly) operation from those only used in the part-parsing operation to plan the interfaces.
Aside from their mid-level parser-combinators, you might want some binary-specific lowering operations (as they did with Int) specific to your target architecture and binary encodings.
For the overall architecture it seems wise for flatbuffers et al to expressly avoid unbounded hierarchy. Perhaps three phases (prelude+split, process, merge+finish) would be more manageable than fully-general dependency stages possible with arbitrary attribute dependencies.
I would hate to see a parser technology discounted because it doesn't handle the crap of PDF or even MS xml. I'd be very interested in a language that could constrain/direct us to more performant data formats, particularly for data archives like genomics or semantics where an archive-resident index can avoid full-archive scans in most use-cases.
> ZIP files that are prefixed by random garbage can still be extracted by unzip but fail to be recognized by a parser that conforms to the format specification
To be fair, the ability to stick a ZIP file at the end of any other kind of file enables all sorts of neat tricks (like the old self-extracting zips).
And this is in fact what the spec lays out, contrary to the quote from the paper. The PK header is a convention. Conforming parsers don't require it, but lazy implementations often do. This has led to more than one security incident over the years.
Reminds me of binarylang[0] (a library for binary parsing written in Nim). I used it in a small hobby project to parse ELF binaries in a declarative manner (well just the headers + string table)[1].
Is this really a new thing? It feels like they've just crammed a sliver of the same bog-standard parsing we've been doing for decades back into the CFG.
I guess that's good for preventing off-by-one-based parsing errors, but surely there's prior art from long ago.
DOCX, PPTX, and XLSX Microsoft Office files are actually ZIP archives (which the paper addresses). You can append a ".zip" extension onto the end of them and explore.
Last time I tried to parse .docx it was full of opaque binary blobs, it might be a zip but parsing the data is like summoning arcane magic. It might have changed in the last decade, but considering the Microsoft has no incitement to make the situation better parsing it is always going to be a "fun" exercise.
I was writing an indexer (ca. 2018), and I don't recall encountering opaque blobs, but parsing the ZIP file and XML (with a small C XPath scanner) was straightforward.
The old office binary formats are basically a FAT file system containing streams of unremarkable records. Knowing what those records do is the hard part!