Hacker News new | past | comments | ask | show | jobs | submit login
Ask HN: Why is the PDF format so inaccessible?
99 points by shawnfrostx on May 4, 2022 | hide | past | favorite | 104 comments
I am working on some typographical software that is supposed to generate PDFs at the end. It seems like there is no accessible information on how to do this. The PDF ISO specification is behind a paywall and has a dead link to a 2008 spec. There are open source converters like pandoc, but nothing that actually writes to PDF that I can find. Is there any resource that goes over the process of PDF generation?



The PDF spec is officially available here: http://www.adobe.com/go/pdfreference

There’s also this book which provides a good introduction and overview and is useful for understanding how the format works (although the PDF reference itself is pretty decent too, as far as specs go): https://www.oreilly.com/library/view/developing-with-pdf/978... (You can find a PDF copy if you look around.) EDIT: There’s also https://www.oreilly.com/library/view/pdf-explained/978144932... which might be even better.

However, be warned that the PDF format can be quite complex and is not exactly for the faint of heart. It’s best to use an established library to generate PDF output, like PDFBox, iText, PDFSharp, PDFKit, etc. Those tend to have their own tutorials.

For emphasis: Do not generate PDFs “by hand”! You risk inadvertently generating PDFs that do not fully conform to the spec, and not noticing it because PDF readers are quite lenient in what they accept. A lot of PDFs in the wild are not standard-conforming in some way or other, because their generators were not carefully written against the spec, but against “whatever Acrobat Reader accepts”. This is the bane of every software on the receiving end that needs to process PDFs.


I've always found it amusing from a "bootstrapping" perspective that the PDF spec is itself a PDF.

That said, if you're only writing and not reading existing ones, it's as straightforward or as complex as you want to make it. I wrote a little program many years ago to convert plaintext to PDF in <1kLoC of C --- and its output was actually often many times smaller than what the commercial PDF-generators do, because I just used the defaults and bare-minimum necessary. I wrote it after being confronted with a requirement to use PDF, and the output of contemporary generators seemed rather bloated in comparison. The spec has an example of a bare-minimum; it's quite easy to programmatically generate.


Please make your code available on github!


The plain text to PDF is easy if you use defaults for the font, and how many words and characters are per line, exclude control characters, exclude Unicode, most of the fancy stuff like letterheads, or such.


Just make sure that you follow the spec. Many such programs actually leave out mandatory parts of some objects, making the whole PDF invalid.


It's certainly true that there are a lot of PDF renderers out there with subtle incompatibilities (and bugs), and also a lot of PDF files with subtle nonconformances. However, that doesn't seem like a good reason to not write a new PDF generator!

Instead, write a conformant one. Better, write one that not only conforms, but also isn't affected by any of the bugs in popular PDF renderers, by testing against all of them. Shawn Davis at LevelUp Research, working on the same DARPA project I'm currently on, has written the amazing SPARCLUR https://youtu.be/6I6E1N3CJzQ (no sound) https://github.com/levelupresearch/sparclur https://pypi.org/project/sparclur/ which will feed your test PDF to Ghostscript, MuPDF, PDFium, PDFMiner, Poppler, QPDF, Xpdf, and some other PDF engines, and compare the results. That way you can see not only if any of them produce errors and warnings, but also if they render it differently or extract different text from it. SPARCLUR is Apache-licensed, written in Python, and very well integrated with Jupyter.

(We've developed some other tools for this as well, but they're not as accessible.)

SPARCLUR doesn't test your PDF against Adobe's implementation, or for that matter Foxit, and I don't remember why.

But basically if you're going to generate PDF files you might as well feed them to SPARCLUR's Spotlight and automatically verify that they work identically, or near enough, in half a dozen independently implemented PDF renderers. Hopefully SPARCLUR will dramatically improve the software quality and compatibility of future PDF generators.


> against Adobe's implementation

Hard to automate, Windows, macOS and Linux versions have differences

> Foxit

No Linux version, even harder automation than Adobe's


Yeah, that would explain it.


Thanks for mentioning SPARCLUR - I didn't know that one and will certainly have a look at it.


It's brand new!


That’s just a sanity check though, not an actual conformance test.


While I agree that it's not an actual conformance test, in the sense that it won't detect deviations from the spec that all of the parsers forgive (for example, using a space instead of a line ending between "startxref" and the offset of the xrefs), it is in many cases more rigorous than an actual conformance test, because there are dark corners of the PDF spec that no PDF renderer implements correctly.


The spec is pretty amazing and very readable, for anyone wondering if they should look into it.

My top tip for understanding PDFs is to take one that you have a decompress it then open with a text editor.

    mutool clean -d in.pdf out.pdf
What you'll find is very approachable for a developer. It's a tree of nodes of different types. Some are dictionaries, some are streams of other data. All of them are documented in the spec. There's all sorts of wonderful corners, like spot printing colors.

PDFs are actually ok, wait till you dig into the fonts. Now there's the real dark art of the ancients.


I still wake up screaming about inconsistent metrics.


mutool is described at https://mupdf.com/docs/


Thank you for the books and the suggestions. I understand the warnings since I did not really provide any information on my project in the post. Essentially, I am writing a batteries included language to act as a replacement for LaTeX (or at the moment, a small subset of it). I cannot use any existing library since they do not work for my own PL and I do not want to go through PostScript -> PDF since the reason I started this project was because the time delay between saving a .tex file and seeing the output on my pdf reader was becoming very long.


That’s fine, and I actually applaud avoiding complex dependencies like LaTeX/PostScript. Using a good PDF library might still make sense for your use case, as they generally provide enough low-level support. In particular for handling the COS object system, which you really don’t want to do by hand if you can avoid it. Or you can at least start with using a library to get familiar with the higher-level PDF peculiarities, and only switch to fully hand-written output later when and if you find that necessary.


Ok! I will look into those libraries. Thanks a lot.


Which programming language are you using to write your LaTeX replacement? There are PDF libraries available in most languages (granted, there is wide spectrum of them with respect to how much/good they implement the PDF specification).


Why not make it compile to HTML instead? It's much more accessible.


> Do not generate PDFs “by hand”! You risk inadvertently generating PDFs that do not fully conform to the spec, and not noticing it because PDF readers are quite lenient in what they accept.

We have that with websites too. Sometimes it is even hard to spot an error like a missing tag because browsers just assume (often correctly) that it was missing in the first place. But yes, probably also a reason nobody wants to write new browser engines.


I think Skia's PDF render support isn't fully compliant with the specification, and that's a Google product with dozens of Google engineers working on it.

Can you comply with a subsection of the specification?

I guess my point is that nobody will care to create an app with attention to the hundreds/thousands of programming tasks required for full compliance with a 600 page specification...


> The PDF spec is officially available here:

When did this happen? I swear some years ago I looked and it was a $300 standard?


The ISO version costs money, but the virtually identical Adobe version is available freely. I believe that has been the case since at least when version 1.7 was published in 2008.


> The PDF spec is officially available here

I'm afraid that's the old 2008 spec, which specifies versions up to PDF 1.7.

The current standard is ISO 32000-2, released in 2020, which specifies PDF-2.0. I'd love to get my hands on it, but alas it's paywalled.


PDF 2.0 isn’t very relevant currently, because there’s not a lot of software yet that fully supports it. On the other hand, there also hasn’t changed all that much since PDF 1.7. In any case, for generating PDFs it’s best to stick with 1.7 (or earlier) for the time being.

This page lists the major changes: https://www.loc.gov/preservation/digital/formats/fdd/fdd0004...

The FDIS of PDF 2.0 is available here for those who are curious: https://cdn.standards.iteh.ai/samples/75839/ad216d84afd34f96...

I don’t know if there are any significant changes between the FDIS and the published version, but it’s better to assume there may be some.


As for libraries, it seems PDFKit is the dominant one.

https://github.com/foliojs/pdfkit

As to why it’s so inaccessible…because Adobe created this monstrosity to do just about everything. Text, fonts, vector graphics, raster graphics, forms, color spaces, JavaScript, encryption, signatures, 3D artwork, video, audio, Flash, and probably more. It’s bonkers as to what it can possibly include, and it was developed during a way different time.


When you put it like that, sounds like there's a log4j somewhere in there, easily.


There have been many. Acrobat CVEs were a dime a dozen in the 2010s


You can embed flash in PDF?!?


In the good old days you could embed executable code in a PDF. Interactive flash games. Java applets.

Sanity has prevailed and a lot of that just doesn’t work anymore, but IIRC Adobe wanted PDF to be “the” file interchange format.


And those weren’t even things when the PDF was first created.


We recently had a ticket from a user where an uploaded PDF wouldn't load in browser (Chrome). Turns out the PDF had embedded Flash content. I was blown away that in 2022 some application somewhere was still embedding Flash in their PDFs.


I'd count yourself lucky whatever that spooky shit was couldn't execute!


Yeah, I had forgotten that when I was looking through the version history. I remembered most of those features, but when I saw Flash, I had to sit and slowly blink for a minute.

(A lot of those “interactive” features are exposed in Adobe InDesign, hence my passing, regrettable familiarity.)


Yu can embed all multimedia types, plus internally attach files like xls, or whatever you want.

This part of the reason it's such a security nightmare.


Most recently I used ReportLab for direct PDF writing from Python¹, but generating them from PostScript is often easier², depending on what you're doing. https://en.wikipedia.org/wiki/PDF#External_links has a lot of information; also the "Further reading" section has some links which Adobe has broken at the moment, but archive.org versions of them like https://web.archive.org/web/20200127173721/https://www.adobe... work. Also I think Adobe put PDF 1.7 on the Archive themselves: https://archive.org/details/pdf1.7

The ReportLab APIs mirror the PDF file structure relatively closely.

Don't listen to the people who are nattering on about how PDF is proprietary on purpose. I think that may have been the case in its early years but it hasn't been the case this millennium.

PDF 1.7 (the spec from 02008) and even earlier verions are most often used, as you'll see if you run head -1 *.pdf in a directory with a lot of random PDFs. PDF 2.0 is not important and you may want to intentionally write an earlier version for broader compatibility. The big incompatibility is actually PDF 1.5 to 1.6: 1.6 added compressed object streams, and a lot of readers still don't support those.

______

¹ https://github.com/kragen/dercuano/blob/master/genpdf.py

² http://canonical.org/~kragen/sw/laserboot/Makefile


Hello. Thank you for the suggestions. I will look into ReportLab and the older versions of PDF. I am trying to avoid the whole PostScript / GhostScript route since my primary goal is to generate a PDF as fast as possible.


Usually what's slow is the layout computation, not the PDF text serialization of it, which is pretty efficient. Though I found that ReportLab was adding lots of metadata to all my links, resulting in an overall large filesize, and because it targets a pre-1.6 version the links were all uncompressed (though the page contents were compressed, which has been a feature of PDF for a long time).


It's old, proprietary, modeled after the PostScript printer control language, from an era before XML, and never had the intention of being open.


> It's old, proprietary, modeled after the PostScript printer control language, from an era before XML, and never had the intention of being open.

I have the impression you have no direct contact, experience, or first-hand knowledge with PDF.

If age mattered (which it doesn't) then PDF's latest update was published on 2020, which is far fresher than XML's spec.

Nevertheless, it's absurd to compare a document format with a markup language. at most, you should compare ooxml with PDF, if that comparison mattered. If it did then you'd certainly be surprised when you'd discover that PDF is far simpler and more readable and easier to reason about than ooxml+XML.

Nevertheless what makes PDF complex is that it has about a dozen versions which support everything and the kitchen sink, including incremental document updates which can also be comprised of ad-hoc version updates.


> If age mattered (which it doesn't) then PDF's latest update was published on 2020, which is far fresher than XML's spec.

Standards are dragged down by their oldest version, not the newest.

EDIT: Sorry, that was a nice punchline but didn't actually explain very well. My point is that unless they actually start over, which almost never happens, newer versions tend to just be more layers of stuff to deal with. It's not just a "who has the most recent revision date", but "who has the least 'interesting' historical baggage baked into the spec".


> Standards are dragged down by their oldest version, not the newest.

Not really, specially if you keep in mind that newer standards have erratas and newer versions, like PDF 2.0, deprecate features.


How's that going for JPEG2000? HTTP2? IPv6?

Not saying they don't have adoption, but to this day you are debilitatingly limiting yourself if you ignore the older ones, and their cruft.


My previous startup worked with parsing PDFs, trying to apply NLP to the texts within PDFs - extracting titles, paragraphs, tables, bullet points etc. Oh my that was a nightmare. Sure we were doing difficult things, so that made us unique, but it was a slog. Working with different dimensions, pages upside down, sentences spanning across multiple pages etc etc.

I've also recently worked on a small tool called scholars.io [1] where I had to work with PDFs. I wasn't doing anything like parsing, but I just used existing PDF tools and libraries, which were much more pleasant, but still working on top of PDF is a challenge.

[1] - https://scholars.io (a tool to read & review reearch papers together with colleagues)


People often forget that PDF is not a "document" format, but a printing format. If you want to work with a document format, that's universally accepted, you work with RTF. DOC/DOCX from Microsoft are monstrosities just like PDF - and just like in PDF, also in DOC/DOCX you can embed anything (movies, pictures, executables, flash, God and the multiverse, etc etc).

A printing format is something finished, that you don't go back from. Or you can try to go back from, but you get a lot of pain in return. Hence your previous startup problems.


Make use of TeX and Friends source code for handling PDF symbols, then it is much easier to check different implementations. For example, TikZ/PGF package has both PS and PDF implementations of the same graphical objects. So you can see how PDF literals or PS specials come into object stream.

Also it is really not that cryptic but very much laborious, hence many people rely on classical tools to generate PDF instead of handcrafting pdf files from scratch. Here is a nice introduction from a decade ago for you https://blog.idrsolutions.com/2010/09/grow-your-own-pdf-file...


Thank you. I will look into Tikz source code.


I see no mention here of the most straightforward way to generate a pdf - pdfmarks.

Create blank pdf with Adobe as your base, then add what you want to it using pdfmarks and distilling.

I spent a very, very long time diving into the rabbit hole that is pdf to come to this conclusion.

There are lots of libraries out there, but none I came across that met my needs would do named destinations, for one example. I think there might be some very expensive ones that might, but pdfmarks will get you sorted.

Here is the manual, if you search around there are few other references.

https://opensource.adobe.com/dc-acrobat-sdk-docs/acrobatsdk/...


> There are open source converters like pandoc

I don't think Pandoc knows anything about the PDF format. It can't read it https://github.com/jgm/pandoc/tree/master/src/Text/Pandoc/Re... or write it https://github.com/jgm/pandoc/tree/master/src/Text/Pandoc/Wr.... It uses other tools to do that.


You could have a look at how ghostscript does it: http://git.ghostscript.com/?p=ghostpdl.git;a=tree;f=pdf;hb=r...


The only link I could find was an archive link https://archive.org/details/pdf1.7

The open PDF standard now costs 250 USD. Adobe is supposed to have an archive of the 1.7 spec online but they do not care enough to keep that up it appears.

I am trying to think of a reason why they would do such a blindly dumb thing but the C++ people used to do the same thing.


Not really answering your question but you could consider generating postscript output and then using ghostscript to convert it to pdf. That would let you create and write arbitrary stuff. I think pandoc uses pdflatex to generate a pdf via latex from the internal pandoc representation.

Imagemagick also writes to pdf I believe, but it may only convert raster images. With postscript you can generate a vector pdf


Hello. My project is essentially a replacement for a small subset of LaTeX. I use LaTeX for a lot of my own work but the massive amount of converters going from LaTeX to PDF is the reason I started this project in the first place.


Understood. Is your project open source? I'm curious to see what the solution looks like when you sort it out. I feel like there must be a lot of heavy lifting (but also a lot of cruft if all you care about is a narrow case) being done by pdflatex to get to an output pdf. It will be interesting to see what the minimal solution looks like.


Hello. It will be open source once I release it -- I am hoping by August, since I have the summer off by a stroke of luck.


Imagemagick just shells out to ghostscript.


Most of the best, most comprehensive PDF libraries are written for Java. There are libraries for other languages but they tend to be incomplete or flawed. There’s also some great paid libraries for C#, but if you want free, I’d recommend looking into Apache PDFBox.


I recently had this revelation. I remember how easy it was doing this in java a decade ago...and how underwhelmed I was finding a library to achieve the same in Python. There seems to be no clear winning library in Python and some havent been maintained for years on.


In Python I just used ReportLab. It seemed okay but I had a lot of problems because I didn't understand the PDF format. Which library did you try and what was your experience?


As others have already written, there is a free version of the PDF 1.7 specification available and using this you are (nearly) able to implement a PDF reader/writer. I wrote nearly because of the many malformed PDFs out there and because of some ambiguities in the spec that will have you look at certain parts of the implementation of existing libraries.

That said implementing a basic PDF reader/writer is not that complex and can easily be done in a few months. However, since you seem to also want to generate PDF pages with content, a whole lot of things have to be considered, like fonts (Type1, TrueType, CFF) and how to actually generate the content.

Adding some straightforward text using some built-in PDF font onto a PDF page is easy. But if you want to use a (subset) TrueType or OpenType font, have ligatures, contextual character substitutions, (LaTeX like) line wrapping, tagging for accessibility, ... you will open a can of worms ;-)

This is certainly also doable but gets quite complex and is the reason many PDF libraries only implement basic typographic features that are easy. You can probably count the PDF libraries supporting advanced OpenType typographic features on one hand...

However, if you are already in the process of writing a typographical software, this last part may actually already be done in your case. So if you have, as output from that software, the glyphs and their position, there is not that much complexity to implement and you could probably use a basic PDF library to do the PDF writing for you.


Doesn't e.g. cairo solve this problem? https://en.m.wikipedia.org/wiki/Cairo_(graphics)


If you're comfortable handling the (typo)graphical aspects of the PDF yourself and have the ability to consume a C++ library, I've had good experiences using the Apache-licensed qpdf[1] library to handle the low-level structural aspects of the PDF standard. It's particularly convenient when your application requires structure-preserving integration of existing PDF content.

Simple example applications, each completed in 2–3 days, both in C#, using C++/CLI to integrate libqpdf:

1. Overlaying fixed-format text on pre-existing blank PDF form pages, ensuring the content of each distinct form page is embedded exactly once, and that all necessary assets (fonts, images, etc.) from the blank form PDF pages are included in the output PDF.

2. Losslessly combining a sequence of PDF, TIFF, and JPEG images into a single PDF with bookmarks pointing to the first page of each source file and existing image compression maintained where possible. In this application, only the source TIFFs were anything other than arbitrary (i.e., the TIFFs were more-or-less baseline images coming from a small number of scanning systems, but the JPEGs and PDFs came from all sorts of different applications).

[1] https://github.com/qpdf/qpdf




Find a library.

Couple years ago I needed to generate PDF reports, relatively complicated ones: headers/footers/backgrounds, page numbers, complex tables, jpeg bitmaps, custom vector graphics in diagrams, etc. This one did the job: https://www.nuget.org/packages/iTextSharp-LGPL


For Node.js there is a nice library called PDFKit (https://pdfkit.org/) which offers a canvas-like API for drawing graphics, text with ttf fonts, and other graphcal elements. I would say it's pretty good if you need exact control over the PDF output.


Aspose.PDF is great library for create and edit pdf documents :

https://products.aspose.app/pdf

and you can render your pdf documents online for free from many source formats in

https://products.aspose.app/pdf/conversion

for examle, this libraly generates documents from scratch with ISO specificaton from Adobe :

https://docs.aspose.com/pdf/net/create-document/


Cause it belongs to Adobe and they clearly don’t want to make it easy for developers to work with it.


The PDF specification, hosted by Adobe, free for you to download… (pay attention, this is a big PDF) https://opensource.adobe.com/dc-acrobat-sdk-docs/standards/p... Adobe specifically negotiated to make this freely available.

From the document:

This document you are now reading is a copy of the ISO 32000-1 standard. By agreement with ISO, Adobe Systems is allowed to offer this version of the ISO standard as a free PDF file on it's web site. It is not an official ISO document but the technical content is identical including the section numbering and page numbering.

I include that, so you can Google up a copy when the URL changes again. (Copy is disabled, I had to retype it. The misuse of "it's" is present in the original.)

As afar as I know, they have always posted the PDF format for free. ISO’s business model is different, they pay the bills by selling the documents.

For generating PDF, assuming you aren’t some sort of sociopath that wants to embed JavaScript or some custom plugin, then you can just drop back to 1.3 or so and deal with a simpler spec. Use the parts you need, ignore the rest.

Some time around 2000 I wrote a PDF generator to do my type setting, so that was 1.2 or 1.3. Very straightforward format.


> Very straightforward format.

The spec is 756 pages. For the 2008 spec. How is that very straightforward? My god.

Even the 2003 version clocks in at 696 pages and apparently includes "Interactive Forms", "Movies" and "Sounds".

Here's the link https://web.archive.org/web/20101214132912/http://partners.a...

How is that _straightforward_?


If you are generating documents you ignore all the stuff you don’t need. You are left with a sane dictionary based format with some optional compression. Ignore the compression until you feel the need to optimize. There is a bit of a complicated bit for random access where you need to remember and regurgitate the offset of various dictionaries. Beyond that it is just a bunch of drawing commands. Of course these are 20 year old memories and maybe the horror of part of it has burnt that from my memory.

I do agree that writing a reader would be brutal.


Thank you. I'll take a look at the older specs. I assumed that the 2.0 version would be the one with the best print quality and I should target that.


No, in general there is no difference in print quality between any PDF versions. I can think of only two exceptions:

· If you're embedding ICC color profiles, you need to be using at least PDF version 1.3, which came out in 01999.

· If you're embedding lossily-compressed raster graphics like JPEGs, while you can always improve the print quality further by switching to losslessly-compressed graphics like PNGs, you may be able to get better quality at a given filesize by using better lossy compression algorithms like JPEG-2000 or (for bilevel images) JBIG2. JPEG-2000 support was added in PDF 1.5 and is excluded from, I think, PDF/A. JBIG2 support was added in PDF 1.4, and also includes a lossless format.

I don't know of any features in 2.0 that would improve print quality over 1.5 in any way.


It’s probably worth spending a few minutes looking into PDF/A, which is a subset of PDF designed for archival. While it might not be a perfect fit for your needs (though it almost certainly is), being ISO standardised it might be a good source of documentation.

https://en.m.wikipedia.org/wiki/PDF/A


Note, however, that this is not the current version of the standard. The current standard requires paying money to ISO (who are the big baddies in this story and not Adobe—they have all their standards locked behind a paywall). That said, this version is probably adequate for most needs.


I had to recreate some PDFs at work that were created by "iText by Lowagie" which must have been a java library at the time.

I redid it with the FPDF library for php, and it worked out fine. I tried some new features of tcpdf, and it wasn't much work to convert.

Using inkscape to make an EPS out of an svg was also challenging.

I know that postscript and PDF is based on a forth stack machine, if I really had to get that low.

http://fpdf.org/

https://tcpdf.org/

https://wiki.c2.com/?ForthPostscriptRelationship


PostScript is based on a stack machine, but a different one from Forth. PDF is not, although it sure looks like it. But no, it's just using syntax from PostScript, but without all the dynamicity.


I did this once. Maybe my small journal will be useful.

https://github.com/jchv/resume/blob/master/journal.md


Thank you. It was helpful!


To everyone thinking about writing their own code to generate PDF, I'm begging you, please either implement tagged PDF support for accessibility, and test it with Adobe Reader and a screen reader, or consider using an existing PDF generator that supports tagged PDF, such as LibreOffice, iText, or a recent version of Chromium. The web already has enough untagged, inaccessible PDFs to provide no shortage of work for multiple document remediation businesses, including my own. But I'm an accessibility advocate first, and as the saying goes, an ounce of prevention is worth a pound of cure.


From the title, I thought you meant inaccessible as in providing little to no affordances for users of assistive technology (you know, accessibility, a11y… alt text, semantic markup, that sort of thing)


Ah, sorry. That's not what I meant -- in hindsight it does feel a bit clickbaity. But it doesn't look like there's any way for me to edit the title on HN.


Oh, no worries. I didn’t think of it as clickbait!


This can be done for free using bash and ghostscript, or even using the free tools from ImageMagick (like convert). But there are decent third-party proprietary solutions, such as the Enfocus PitStop Pro suite of softwares. Don't use Adobe products, please.


Wouldn't that be nice.

Not directly answering your question, but I suppose the solution is to just pick the closest thing and convert. HTML&CSS being the most full-featured/generic. Markdown simplest for basic 'word processing'. Latex good for more advanced such cases. Images good for others. Maybe ePub would suit your 'typographical' needs (I think it's a lot more open than PDF, and itself HTML based)?


> Not directly answering your question, but I suppose the solution is to just pick the closest thing and convert. HTML&CSS being the most full-featured/generic.

PDFs specify fixed format. You need some out-of-band info to generate PDFs from HTML.

> Maybe ePub would suit your 'typographical' needs (I think it's a lot more open than PDF, and itself HTML based)?

EPub is basically a zip with HTML+CSS+some metadata. Newer versions were based on HTML5 while older ones were based on XHTML.


Simplest way to generate pdf is to generate xsl-fo document which is good old XML and then convert it to PDF using one of the processors, e.g. Apache FOP.


Seconded. Don't waste life wrangling raw PDF.


Years ago I wrote a PDF generator for Passepartout (http://www.stacken.kth.se/project/pptout/) from reading the PDF book. I though it was a well designed format. It is a binary format though, for the sake of efficiency.


I am amazed how most of the answers suggest reading the specs. Why overcomplicate things when you just need to generate pdf? They simplest way is to generate HTML files, then use a headless browser to conver them to PDF. Simple!


There's an open standard version of pdf called PDF/A, which Libre Office can write and read.

https://en.wikipedia.org/wiki/PDF%2FA?


Because the spec is terrible and full of corner case.


I can highly recommend pagedjs.org and CSS paged media. This is used by asciidoctor pdf JS and it is an absolute dream to work with.


I know there is a perl module that can write low level parts of pdf but it only supports 1.4 or 1.5 version


I think it's the same as ms-office, they make a format, and they are the only ones that make the software to use the files in that format, it's funny, because adobe reader is very bad, and they ended up "giving it up", now ISO handles the spec, as for generating PDFs, maybe check libreoffice? or some other software that creates PDFs with the source available


MS Office switched to XML tentatively in 2003 and fully in 2007. The latter formats were standardized (Ecma, ISO) into Office Open XML, but even the previous version were accessible. I don’t remember whether the spec was publicly available, but you could study the output of some valid documents and figure out what went where.


why there's no native PDF API on Windows? macOS has PDFKit.Win32 has great foundations to make a good pdf library:Direct2D/DirectWrite.But it's so inconvenient to do pdf programming on Windows.


ask Leonard R. -- it stretches back to the pre-Internet "multimedia" competition..


TeX? Troff? PostScript?


I think PDF is no good; I think is too messy. I think that better formats can be possible, such as maybe PCL, and I have some of my own ideas of making a better format, too.

However, PDF is a commonly used format.

When I wanted to generate PDF (or other formats such as PNG), I just wrote a PostScript program to do (and then run it through Ghostscript). (Drivers could also be added to make other output formats too if wanted.)


The fact that you have to pay to see the standard should tell you everything you need to know.


that it's an ISO standard and everyone just implements the most recent draft?


What's even the point of having standards if only rich people can read them?




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: