Hacker News new | past | comments | ask | show | jobs | submit login
New Ghostscript PDF interpreter (ghostscript.com)
218 points by diskmuncher on July 31, 2022 | hide | past | favorite | 92 comments



Years back, I raised how evolved Ghostscript had been over a very long time, together with the huge complexity of the PDF specs, as a potential source of vulnerabilities.

(But maybe wasn't as much on people's radars, with all lower-hanging fruit of other technology choices and practices going on, outside of PDF.)

New code for a large spec is also interesting for potential vulns, but maybe easier to get confidence about.

One neat direction they could go is to be considered more trustworthy than the Adobe products. For example, if one is thinking of a PDF engine as (among other purposes) supporting the use case of a PDF viewer that's an agent of the interests of that individual human user, then I suspect you're going to end up with different attention and decisions affecting security (compared to implementations from businesses focused on other goals).

(I say agent of the individual user, but that can also be aligned with enterprise security, as an alternative to risk management approaches that, e.g., ultimately will decide they're relying on gorillas not to make it through the winter.)


Is there any work in this space on some oddball "contamination protocol" type of security? Like you would assume everything is contaminated and you do things that eliminate the potential for cross contamination entirely, like they do in lab settings with aseptic technique. In this case, it could mean printing out the contaminated pdf on a system you don't care about being contaminated, then scanning it with an airgapped scanner to recover a 'sterile' pdf. It seems convoluted but I'm sure for some applications that could be a good solution that requires no improvement to pdf protocol.


I've heard of measures like that, including for the other direction (i.e., redacting documents without leaking information in the effectively opaque PDF format).

IMHO, having well-engineered tools handle data, and being conservative about the trust/privileges given externally-sourced data is at least complementary to the current "zero trust" thinking among networks and nodes.

(Example: Does your spreadsheet really arbitrary code execution, in an imperfect sandbox, for all your nontechnical users? Should what people might think is a self-contained standalone text document file really phone home, to disclose your activity and location, or have the potential to be remotely memory-holed/disabled, along with attendant added security risks from that added complexity and the additional requirements it puts on host systems/tools to try to enforce that questionable design?)


There are two relevant computer security ideas here -- "sandboxing" is used to place risky work (such as Chrome decoding some media) into an isolated process which lacks privileges to e.g. abuse access to files or networking, and "taint tracking" is used to reason about what attacker-supplied input can influence.


DARPA is funding fundamental research in this space, specifically through programs like SafeDocs[1].

[1]: https://www.darpa.mil/program/safe-documents


Qubes OS can do that. It basically starts a disposable vm just for printing the PDF.


But why is the doc running as our user anyways? I didn't create the documebt so it doesn't make sense that it runs with the rights of my user. It can certainly ask for certain permissons.

Zero days will alwsys exist it seems, even Chrome has these, with hundreds of security researchers eyes on it


More trustworthy thank Adobe...

Not hard


Not sure why this is being posted now as this is from March...

But anyway - I understand why they have changed their interpreter however the lack of major version bump threw me off. I use ps2pdf to optimize pdfs (long story short - makes their size smaller) and was alarmed when my pdfs suddenly ended up without the jpeg backgrounds. Instead, purely black (although this did result in a very small file size so who knows... :) )

Thankfully you can add `-d NEWPDF=false` to your command to use the old parser. I'm yet to submit a bug report but it would be nice if it was backwards compatible...


Do you mind reporting this over at https://bugs.ghostscript.com/ ? I work on MuPDF myself, but I'm sure my colleagues working on Ghostscript would want to have any differences fixed. Thank you! :)

You can also reach us developers over at our ghostscript Discord channel https://discord.gg/H9GXKwyPvY (https://discord.gg/SnXWzqzjKs for mupdf).


Will this new PDF interpreter also go into MuPDF? For use in e.g. `mutool draw`


Because Acrobat will open these files, there is considerable pressure for Ghostscript to do so as well, though we do try to at least flag warnings to the user when something is found to be incorrect, giving the user a chance to intervene.

Anyone who has done PDF composition for a "print ready" job (what a lie) from a client has run into this so many times. All we have to do is rearrange the pages in the right sorted order, add some barcodes, and print, right? Acrobat can open the file, so why is your printer crashing? Ironically, some of those printers used an Adobe RIP in the toolchain and this conversion PDF->PS on the printer was where things went wrong (I once tracked down a crash where a font's gylph name definition in the dict was OK in PDF but invalid syntax in PS, due to a // resolving into an immediately evaluated name that doesn't exist) but it's not something a technician could help with.

It was so bad that Ghostscript was one of many tools - we'd throw a PDF through various toolchains to hope one of them saved it in a format that was well behaved. Anyway I'm almost sad I've moved on from that job now so I can't try it out with some real world files. But in the end most of the issues came down to fonts and people using workflows that involve generating single document PDFs and merging them, resulting in things like 1000 subset fonts which are nearly identical and consuming all the printer memory, so I'm not sure how well this would help.


Many years ago I worked in print (mostly RGB to CMYK stuff, small run) and the very expensive RIP software chocked on what seemed like every PDF a customer supplied.

I ended up with a fairly large set of shell scripts over Ghostscript to convert them into high DPI tif's to be able to reliably print them, it worked remarkably well considering that one was open source and free and the other was 1000's per license.


Yeah you just moved the RIP upstream, rasterize before the rasterizer :) We did that for a few jobs that caused trouble.

I haven't worked on the innards of those machines but my suspicion is that it's a combination of 1) Not much RAM, to keep costs down, 2) An inability to handle a large number of resources i.e. no swapping out to slow storage on a least-recently-used principle or similar, and 3) extremely strict conformance to avoid surprises in output.


Ghostscript (well, gv) got me through the 1990s and beyond as part of my TeX -> dvips -> gv workflow.

Kudos and thank you to those who maintain it and the associated packages!


Yep, and I remember the moment of surprise and delight when I once included an eps file into my TeX output with some color and saw the color show up in my gv output.


> As time has gone on, and we have encountered more and more PDF files with ever more unexpected deviations from the specification

Does anyone know of a collection of malformed PDF files? It would be useful for testing PDF processing programs.


Technically not all of these are malformed (sometimes the document is well-formed ISO PDF but the software won't accept it), but this corpora has a dump of all PDFs that were reported problematic in many software including Ghostscript, PDF.js (Mozilla) and PDFium (Chromium): https://www.pdfa.org/a-new-stressful-pdf-corpus/

(note that the majority of them are relatively-harmless rendering issues but some PDFs here have caused crashes or even RCEs and process takeovers for certain malicious PDFs)


There are some here, as test files in the qpdf library: https://github.com/qpdf/qpdf/tree/main/qpdf/qtest/qpdf

(But still, note: A couple of months ago I wrote a low-level PDF parser—just parse the PDF file's bytes into PDF objects, nothing more—and fed it all the PDF files that happened to be present on my laptop, and ran into some files that (some) PDF viewers open, but even qpdf doesn't. I say "even" because qpdf is really good IMO.)


Artifex has a public suite of PDF files here:

http://git.ghostscript.com/?p=tests.git;a=tree;f=pdf;h=2ce4f...

They're not all malformed, and they're mostly used for snapshot testing, but they cover a wide range of corner cases.


One trick you can do is fuzz pdf your self by getting any PDF file and opening it using vi or vim. Then write over anything you see and save it. Crude but if all you need are some broken PDF files, that will do it.


Fuzzing sounds like a very good idea to employ right from the beginning when writing parsers for complicated file formats.


I wasn't able to readily find any collections, and searching for anything plus the keyword "pdf" returns links to articles written in pdf

That said, this GitHub topic may have some pointers: https://github.com/topics/malware-samples


Most important part of the announcement - you can still revert back to the former interpreter by setting the `-dNEWPDF=false` flag.

While progress is always nice to see - I am also pleased that we don't necessarily need to update all the scripts that depend on ghostscript at once but can keep them running in their current state.


It's particularly fun for them to introduce this in a point release. If this didn't warrant a major version bump I'm frankly not sure what would.


In the past when we had to use Ghostscript for PDF processing, we always separated it out into its own process and added a whole lot of error management externally.

Even if the application was fine, you would always encounter PS/PDF files in the wild that kept stress-testing the application's memory safety.


"But Ghostscript’s PDF interpreter was, as noted, written in PostScript, and PostScript is not a great language for handling error conditions and recovering."

Isn't C, their chosen replacement of PostScript, also particularly bad at this?


I also had a slight chuckle at this. However, I'm sure C is still a great step up from Postscript.

It is however quite entertaining to read the predictable comments from Rust/Java/C++ fans who are upset that they didn't choose their favourite language.


I'd say a language is bad at error handling if it doesn't let you check if a procedure failed or not. What C does it that it compiles even if you ignore this, which is a different issue. Java, Rust, etc. wouldn't compile if you totally ignored it, but you that doesn't mean you have to do proper error handling, beyond satisfying the compiler/type system.


Are there any languages that are bad at error handling then, according to that definition? That don't let you return values, set global flags, mutate arguments or in any other way communicate back from a procedure?


It mostly depends on API design I guess, but missing language features can certainly more complicated than necessary. I guess the GhostScript authors felt error handling in PostScript to be difficult since it is a concatenative programming language (related to functional programming languages) with a dynamic type system, even though it has error handling facilities.


Does anyone know much about the Artifex team? How big it is etc?

They seem to be the kings of working with PDFs. I’ve not really looked at the Ghostscript code (and I’m surprised to hear their interpreter was still in postscript), but I’ve looked through the mupdf code and what I saw was really nice.

In any case, I appreciate the work they’ve done in providing fantastic tools to the world for decades now.


I don't know the current team, but I have met its founder: L. Peter Deutsch [1].

James Gosling, inventor of Java, once described him as the "greatest programmer in the world". They both used to work at Sun Microsystems.

[1] https://en.wikipedia.org/wiki/L._Peter_Deutsch


Three of the greatest programmers I've experienced worked there, Peter, Tor, Raph. Hats off.


Strangely this appears to be a new implementation not based on MuPDF, so Artifex now has two implementations of a PDF interpreter.

I wonder what made them decide to reimplement it instead of reusing their existing code.


Currently working at Artifex.

AFAICT, it's roughly 30 people, mostly seniors.

> but I’ve looked through the mupdf code and what I saw was really nice.

It is! Best onboarding experience I've ever had.


> Since there is no means to ‘verify’ that a PDF file conforms, creators fall back on using Adobe Acrobat, the de facto standard. If Acrobat will open the file then it must be OK! Sadly it turns out that Acrobat is really very tolerant of badly formed PDF files and will always attempt to open them.

I'm grinning widely when reading this.

Until last year I had an opportunity to help maintaining a pdf tools written using Golang. This case where a pdf doc that is not conforming with the standard could be opened in Acrobat but not on other pdf reader tools (including ghostscript) came a lot from our clients and I had to find a way to be able to read/extract the content with a minimum issue because of that.


Funny thing: I remember hand coding Postscript patterns to play around on the first LaserWriter.

PDF became such a weird mess that I’m not surprised Postscript is now just a subset of it (to a degree), but writing an entirely new interpreter has had to be a hefty chunk of work..


Given the mention of security issues in their custom PostScript extensions, and that PDF files are often malformed, I wonder why they chose C as the language for the new interpreter. I don't want to write a typical HN comment (cough use Rust for everything :)) but surely there is _some_ better language for entirely new development of a secure and fast parser in 2022.

The post has no explananation of this choice. Does anyone know?


Beyond a lack of memory safety, C has another issue that makes me dislike it for this kind of application: C has a very minimal set of built in data structures. Combined with a lack of generics, this means that using, say, a dictionary means that quite a bit of the implementation gets hard coded into every site that uses the dictionary. This is almost invariably done with lots of pointers (since C has no better-constrained reference type), and the result can be bug-prone and difficult to refactor.

For all of C++’s faults, at least it’s possible to use a map (or unordered_set or whatever) and mostly avoid encoding the fact that it’s anything other than an associative container of some sort at the call sites. This is especially true in C++11 or newer with auto.


[WUFFS](https://github.com/google/wuffs) is made for stuff like this, and it has a library available as transpiled C code.


> this means that using, say, a dictionary means that quite a bit of the implementation gets hard coded into every site that uses the dictionary

I don't understand this part of your comment. There's nothing preventing you from designing a nice well-encapsulated map/dictionary data structure in C and I'm sure there are many many libraries that do just that.

I do agree though that having such basic data structures in the standard library, as modern C++ does, is usually preferable.


Lack of generics will do that, unless you consider that blindly casting `void ` all over the place counts as "well-encapsulated". Even with macro-soup designing a good agnostic dictionary implementation for C is rather challenging. Linked lists are okay* if you use something like the kernel's list.h, but even then it's macro-heavy and has its pitfalls.

In my work as an embedded developer I still use C a lot and it's probably the programming language I know best and have the most experience with but it would never cross my mind to write a PDF interpreter in it unless I had a tremendous reason to do so. There are so many better choices these days.


Type safety and encapsulation are distinct issues. The Linux kernel uses many well-encapsulated interfaces but it's written in C and the typing reflects that limitation.

Personally I haven't used straight C in years and would never choose it over C++ unless platform constraints required it, but a vast amount of very complex software has been and continues to be written in C, including all the widely used OS kernels, so I don't find it very surprising that a new feature in a very old piece of software would be written in it.


Except when you need to build from source; you'll need yet another whole compiler toolchain that may or may not behave well on a specific environment - eg, do you kow how well rust (or other "modern" language) works in late-nineties mips systems? The c compiler is the lowest common denominator.


> There's nothing preventing you from designing a nice well-encapsulated map/dictionary data structure in C

When you write a set function for your map data structure, what type do you make the key parameter?


Code from yalsat (stochastic SAT solver) [1] made me learn something two years ago. I can declare an array of some elements and make access to elements statically typed. Same with maps, sets and others.

[1] https://github.com/msoos/yalsat/blob/main/yals.c#L49


this is a pointer-based language so there are lots of ways to solve that, but you know that already.. this is a setup question.. of course its not useful to re-invent critical, secure functions over and over yet, what if I am not writing critical, secure functions anyway?

I would choose a key type that is natural to the environment and problem.. unsigned integers are useful. Which unsigned integer size? there are only a couple of practical answers to that.. unless there is some massive dataset, use a 32bit unsigned integer, like so much of the software does right now.


size_t key_size, void *key


And then eschew type safety


> nice well-encapsulated

...

> void *


Type safety and encapsulation aren't the same thing. Encapsulation is about hiding implementation details from the user of an API, which is what the comment I originally replied to was claiming you couldn't do in C.


The void * is (should have been!) an implementation detail, and you're leaking it in the interface - that's not encapsulation.

For example if I want to store a __int128 on a 64-bit machine I'll have to deal with stuff like memory allocation and lifetime myself, when the data structure should do that.


It's not an implementation detail, it is part of the interface. The data it points to belongs to the caller and the caller is responsible for managing it, just like anywhere else in C.


> The data it points to belongs to the caller and the caller is responsible for managing it

Going back to the start of the thread where you said you didn't understand this:

> this means that using, say, a dictionary means that quite a bit of the implementation gets hard coded into every site that uses the dictionary

You can see you've just said yourself what you were saying you didn't understand - with data structures in C you're responsible for manually managing things that the data structure would via things like generics in most other languages, with you having to instead hard-code them at the use-site to turn the data into something that can be pointed to and managing the lifetime.


Code reuse is achievable by (mis)using the preprocessor system. It is possible to build a somewhat usable API, even for intrusive data structures. (eg. the linux kernel and klib[1])

I do agree that generics are required for modern programming, but for some, the cost of complexity of modern languages (compared to C) and the importance of compatibility seem to outweigh the benefits.

[1]: http://attractivechaos.github.io/klib


My guess is that since the rest of the project (not in PS itself) is in C, it’s in C. And it may be borrowing from the PS interpreter codebase. I dunno.

Requiring another skillset, toolchain, etc. is onerous and has to be weighed in those decisions. Rust is cool for sure, but difficult to adopt in brownfield projects because of humans more than tech.

Also, it wasn’t written on in 2022, just made the default now. GS is a venerable codebase, and jumping on a “new” language bandwagon may have seemed dangerous at the time it was started.

All conjecture. I’m not an expert or involved.


We (Latacora) previously advised clients to encapsulate GhostScript processing in something with a hard security boundary (like a Lambda) and I am not expecting the new implementation to change that.


Is this AWS Lambda or what kind of "Lambda" is this about?


Yep, AWS Lambda.


It looks like it needs to interoperable with the rest of their codebase which was already written in C

> The new PDF interpreter is written entirely in C, but interfaces to the same underlying graphics library as the existing PostScript interpreter. So operations in PDF should render exactly the same as they always have (this is affected slightly by differing numerical accuracy), all the same devices that are currently supported by the Ghostscript family, and any new ones in the future should work seamlessly.


That is not an argument at least for rust since its super easy to consume and offer a C interface. I think it's more of a shift in mentality that needs to occur.


while it doesn't prevent rust from being used, it is still a hurdle which must be overcome. Building and maintaining a multi-language build system has significant costs, especially with a project with as much history and wide use as ghostscript.


It is so easy and well documented that first page of google results for “rust autotools” does not contain anything about how to integrate rust code into existing autotools project.

Another issue is general subtle brokenness of rust tooling on anything that is not linux on amd64.


I suspect they need portability more than most projects.


Are you kidding? Many other languages are as portable, if not more portable.[α] Your point would be valid in 1972, not in 2022. I can't believe you're regurgitating the same "portability" from 50 years ago, today (unless you meant it as a joke and forgot to include a /s).

[α] Languages targeting LLVM or supported by GCC are portable to every target machine code / ISA / architecture supported by those toolchains. JVM, JS, etc are portable to all the platforms they support. You don't need to do any extra work (of recompiling) if you use a bytecode VM / platform (for example, like JVM).


Well, there's portability and then there's portability. Getting LLVM to emit artifacts on a given target is easy. Getting assurance that big, complex interfaces that integrate with the underlying OS in extremely specific ways (i.e. your programming language's IO or concurrency system) behave correctly on that target, and have appropriate testing, community support, and documentation is another thing entirely.

Like, I get it. The claim that "rust isn't portable" is often used as a thought terminating cliche, and is often wrong or irrelevant in context. But the claim "X uses LLVM, LLVM can target environment Y, therefore X is fully compatible with Y" is just as reductive and misleading.


does an LLVM requirement fit the social and license goals of this eco-system fundamental project?


I don't even actively code with rust but just from the fact that its been packaged as a dependency has been enough of a headache for me. The latest issue is with some homebrew package that has rust as a dependency. It turns out on macos mojave rust needs to be built from source since there is no bottle. I let it build for a full day and it still didn't finish building, so I gave up. Then I installed rust independently with rustup and successfully linked that install to brew, which nearly worked, but failed with the cryptic "rustup could not choose a version of cargo to run..." error that I can't make any sense of, because the solution it gave for that error to download the latest stable release and set it as your toolchain with 'rustup default stable' didn't do anything because that was already done. The real salt on the wound is that modern google search bringing up nothing relevant.


WUFFS seems like a great option for this.


One reason may be that they want to build a high level wrapper of that C API, something that is well documented in some languages (i.e. Python)


No, not more Rust activism. Please, anything but more of this. Have some shame.


Using C sounds like it will bring a whole new list of exploits with it.

Not good!!


C is not inherently unsafe. Sure, it hasn't "memory safety" as a feature. But there are loads of applications considered safe written in C. An experienced C programmer (with the help of tooling) can write safe C code. It is not impossible.


That would explain all the vulnerabilities in systemd and Linux. They just aren't experienced enough. Linus needs to get in touch with an expert.


I’m looking forward to your efforts in rewriting it in Rust


So is everyone else! Can't happen soon enough.


SQLite is the most stringently developed C code I'm aware of--the test suite maintains 100% branch coverage, routinely run through all of the sanitizers, and it is regularly fuzzed.

It still accumulates CVEs: https://www.sqlite.org/cves.html.


As I recall, one of the advantages of C over Rust is that the SQLite authors have the tooling to do 100% branch coverage testing of the compiled binaries in C. They tried Rust, but Rust inserts code branches they are unable to test.

The tradeoff then is the small number of bug causing the denial of service bugs listed, vs. not having 100% branch coverage. And they chose the latter.

(The authors also believe Rust isn't portable enough, not handles out-of-memory errors well enough - https://www.sqlite.org/whyc.html#why_isn_t_sqlite_coded_in_a... .)


Are you aware of a way to develop fault free code? Please share this knowledge then, please.


It's easy to develop fault-free code: just redefine all those faults as (undocumented) features!

That's not a helpful answer, but it's basically the same thing you're doing--redefining memory safety vulnerabilities that would be precluded entirely by writing in memory-safe languages as programmer faults.


He's aware of a way to develop memory-corruption-fault free code, obviously.


I guess "experienced C programmers" must be short supply although they have been writing C for years.


The effort is massive and the experience to do so at scale is very rare.


When people write code which doesn't have memory safety & lacks the compactness needed for a mature product, it is not C language's fault

C is a well tested compact language - the fact that Linux kernel, BSD kernels, device drivers and a whole lot of games and physics engines are written for performant systems is a testament to it's reliability.

Additionally, I think it's the sane move. A language which is hot cake today (Yes, Rust) may or may not be in fashion 5 years from now when there's a new hotcake. Choices are made keeping 10-15 years project development in mind


When Michael Abrash wrote his books, C compilers weren't known for the quality of their blazing machine code.

It also has a proven record that no matter what, exploits are bound to happen, making the whole industry turn into hardware memory tagging as the ultimate solution to fix C.


Memory exploits in Rust : Yes much less. That's where the world is (hopefully) headed to.

But then a lot of people would disagree on "Lets jump on the <new-hot-language> bandwagon ASAP". Even the transition to parts of Linux kernel components in Rust is slow and cautious. In that sense, C is still a wide choice. Plus there are a lot of people actively working on C. Rust is only but picking up. it's more likely people will write buggy code on a new language than something which has been around for a while.


No need to bring Rust into this talk, Modula-2 from 1978 would already sort out most of C's mistakes, or even JOVIAL from 1958.

C became a forced choice thanks to UNIX's free beer, and like JavaScript with the Web, it tainted us all.


Of course, let's better use a PostScript interpreter also written in C, so your exploits leveraging both at least look like art.


Stop this.


Surprised the decision wasn't made sooner.


How interpreting PDF in Postscript became untenable




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: