Hacker News new | past | comments | ask | show | jobs | submit login
Show HN: PDFs from HTML (math.dev)
742 points by abhinav22 on April 4, 2021 | hide | past | favorite | 265 comments



There is something really pleasing about reading PDFs. It’s perhaps how static it is and it won’t change on me. I can zoom in or “operate” on it without some reorganization. It puts the mind to ease. There is no reflowing. There is no columns shifting. It’s just is. Like a piece of paper as an analog - the intent of the author and the designer is retained and frozen in time. Fonts are embedded and chosen by the creator. Haters of PDFs do not understand the human aspects of it - they just see it as a specification (which is convoluted).


Isn’t zooming in and having text reflow a feature, not a bug, of HTML? PDFs are pretty much impossible to read on a phone because of the endless amount of zooming in and out and horizontal scrolling (unless they were designed for mobile — and then they’re hard to read on a desktop). Never mind users on a desktop who just like their text large for ease of reading — their screen might not be wide enough to fit the text without horizontally scrolling.

As an author, my intent is that the content be easily readable to all readers. I don’t see why I should want or get to dictate the layout and aesthetics to my readers.


I think there is a spectrum of commodity vs. artistic mediums in all forms and we often talk past each other when debating the finer points. If your goal is to send out a press release to the public, perhaps layout/aesthetics isn't as important (sometimes it is though so its not a hard/fast rule). In artistic media, especially in magazines and mixed-media books, layout and aesthetics are an integral part of print media. It is inseparable. Just as in music, you don't want to add an equalizer ruining the original intent of the artist, books created by artists in 1890 still are with us in print format - exactly how they were intended to published to readers. But it is entirely different if the "music" is a podcast - I want to use an equalizer to bring up the higher frequencies for better audibility. Similarly, if I am reading a novel on an epaper display and I want to increase the font-size or type, we should allow that as you said.


I agree. What annoys me is when not very artistic mediums like scientific articles force a fixed page layout. It gets even worse when you have to hunt down the relevant figures over the following pages because they couldn't be put on the same page due to lack of space. Also opening a figure in a separate window isn't much a thing either for pdfs.


I definitely prefer to read research papers in html. I like to zoom in a lot when reading a long piece on my computer since it helps me read faster and keeps me from getting distracted. I've been thinking about working on a side project where I convert pdfs to html for academic papers.


One benefit of PDFs for research papers is that you can easily save them to your own computer, build up a library of them, highlight lines with functionality built into most PDF readers. I generally prefer HTML for reading, but PDF has some benefits, too. Granted, most of these features are also available for HTML. But for some reason you need to look for browser plugins in order to highlight HTML pages, whereas in PDF you can just use the feature. And PDF is always about the content whereas HTML also typically contains navigation and other distractors.


Regarding a library of HTML documents: https://github.com/gildas-lormeau/SingleFile#install


I have installed an extension named "single file" in my browser, which allows me to save any webpage, as it looks right now in my browser, as a single HTML file. Images and CSS is inline, javascript (I think) is removed. Quite handy when you prefer a folder and file based workflow.


What about epub? Under the hood it is basically html, but viewers know to treat it as a written work.


Epub is ok, but it has no support for math equations (practically all implementations just dump them into raster images) and HTML's typography leaves much to be desired.

There are plenty of good reasons why TeX and LaTeX are still the workhorse of scientific publishing in spite of the emphasis on fixed format layouts.


It's much more restricted than html, even less support for animations than pdf.


I really don't get why mhtml has been discontinued by browsers??


It hasn't. Chrome can still save .mhtml just fine, and will also open them if you copy/paste the path into the URL bar.


What about Firefox ?


It never had support for mhtml, only through an extension that's no longer supported.


Hmm, I might be confusing with Opera ?


Somehow related: I find Snappy Snippet extension (for Chrome) very interesting. It's supposed to let you make a "live" screenshot of a DOM element. Unfortunately, I've not tried it much as I only rely on Firefox in day to day browsing.

https://github.com/kdzwinel/SnappySnippet


I currently have Adobe's Creative Cloud All Apps, which includes Adobe Acrobat. I've bought tons of books on Java, Javascript, HTML, C++, etc.


Ugh, I really hate when I stumble upon a research paper only available in html and not in a standard two-column pdf - I find it much harder to read in general


> Just as in music, you don't want to add an equalizer ruining the original intent of the artist

You might not. Does that mean that music should only be distributed in proprietary formats designed to prevent anyone from plugging in an equalizer?

What about the naturally different frequency response curves of different speakers?

What about room acoustics?


What you're saying is enjoying in the fine details. However you can do that only if those details in different parts of the page are not too correlated, so you can zoom in one part of the page, without it really be dependent on some other details somewhere else. In some situations, like say in comic books having to constantly zoom in/out totally kills the artistic impression that you'd have with an actual 2 pages book opened in front of you. Only having a bigger screen can really help with that.


Are you open to the argument that PDFs are not trying to solve the issue that you’re describing?


Depends on what the question is. If we‘re debating pros and cons of document formats, and not solving a particular problem is a con, then not trying to solve that problem doesn‘t make it a non-con.


What's a feature in one context can be a bug in another. I get where OP is coming from: when I zoom in and the text reflows, it's easy to lose my place. PDFs don't have that problem.

Also, digital, free-flow media lose basically all sense of space. PDFs are much better for finding a piece of content again later, because I can remember the location on the page and roughly how many pages into the document.


Usually, if you have reflow, you can disable it. However, if you don't have reflow, you cannot usually enable it!


Many PDF readers / eBook readers now at least try to reflow PDFs. This often works, though it often doesn't.

Though too be fair, reflowing HTML / displaying it without all the added-on cruft also often fails these days. Tools such as Reader Mode make heroic efforts, but also very frequently fail (or are apparently blocked or have sites blacklisted by tools).


Opera does better, or at least it used to.


A5 pdfs are really nice on mobile and especially tablets. If there were both A4 and A5 pdf renders available more often that would be grate!


I wonder why browser bookmarks don't save the position.


The answer is likely obvious in that it depends on what content is the visible content at the time of the bookmark, which will further depend on the content itself (it can change since this is a bookmark on an alive web page), page styles, zoom level/scaling, window size etc.

Basically, for a bookmark to fully store a position, it would have to store all of the above (and probably more), and it would only be really usable on the same device as long as the underlying content does not change.


The underlying content of most pages that I bookmark does not often change. And storage is so cheap now that I would even be happy to have my browser save the rendered page whenever I save a bookmark so that when I ask to see the bookmarked page I get exactly what I was looking at. If I want to see what the page at that address looks like now I refresh the page.

And I am almost always on the same device.


Browser bookmarks save position when there is a position to be saved, which is represented in the form of IDs being pointed to by links with a fragment.

Any other scenario doesn't even ensure that the content will continue to exist, let alone have the same structure.


> when I zoom in and the text reflows

Zooming in/out should not trigger a reflow. Only things like changing the geometry of a page or font settings require a reflow.

It seems you're blaming a document format for a UX problem created by an implementation.


For many people yes, if you're dealing with straight-up text.

But so much content has images, diagrams, footnotes, sidebars, meaningful indentation, and so forth, that text reflow often mangles or scrambles or relegates to the end of the chapter/document. And not to mention that reading on a phone sometimes freezes the zoom so you can't even zoom into images when necessary.

When I read a published PDF I'm usually getting a presentation that was carefully thought out for legibility, scale, etc. The locations of images, footnotes, sidebars, etc. all make sense.

And I find that reading PDF's on a phone is actually no problem at all, even on my small iPhone SE 2. Just hold your phone in landscape and zoom so the width of the phone is the width of the text column. Generally it works perfectly well.

So as a reader, when an author/publisher takes responsibility for well-organized and legible layout and aesthetics, I appreciate it greatly.


It is a feature if it works, but whenever the text the text fails to reflow correctly, or an attempt to zoom into a figure causes the entire page to reorganise itself and (worst case) replace the figure, then it really annoying. The fact that reader mode is often the best way to read websites on mobile suggests that either HTML has got this feature badly wrong or has provided features that are being very badly used. In comparison the fairly common two column format of scientific papers is quite readable on my phone after zooming into a column.


I think that reflowing text is perfectly fine for reading fiction which lacks graphs and meaningful formatting beyond section titles.

For non fiction I find reflowed epubs sometimes inferior to a pdf perhaps to a degree more aesthetically than in terms of actual usability which is harder to quantify. Below a certain size this has exactly the defects you describe however I find that on a fairly large wide screen in landscape orientation it is quite readable.

For example a screen 6in wide lend themselves well to reading without zooming. This is largish for phones or smallish for tablets.

Regarding dictating layout and aesthetics for practical purposes most of your users aren't actually dictating much of anything beyond screen size, platform, and zoom level. Just because other settings exist doesn't mean most people use them.

For practical purposes there are small screens where text must be heavily reflowed because not much fits on the screen and screens big enough to show a whole document depending on font size. For most things you want to support the first use case if any portion of your users are going to be on phones which is nearly always true.

This doesn't mean that there isn't a case for designing a non mobile version of content especially if its mostly consumed outside a limited and limiting screen or benefits from such.


It's because we all have different expectations and use-case. A PDF is great if you need to ensure the formatting looks the same on all platform, while HTML and its derivative (ie: ePub) are useful when you have a myriad of display size where you want the text to reflow and maximize the screen usage and readability (font size, etc) without caring too much about the layout.


It would not be hard to read on desktop you can simply show multiple pages like in book you usually see 2 pages. The same concept works if you have place for more pages. Its just not something people do (create PDFs for mobile). Almost all PDFs are meant to be read at approximately the size of DIN A4 for one page. In a time everyone is and should be disencouraged from printing stuff this is not really needed.


for android, ebookdroid in landscape with auto crop pages makes >99% of books readable on any of my phones (with >= 6" screen).


For anyone trying to read pdfs on phone: landscape mode and vertical scrolling.


It should be noted that the usefulness of landscape mode is limited in some docs, such as those with multiple columns or docs which include elements like tables and graphics, specially if you need to look at them while following the text.


The necessity of zooming is a shortcoming of the device, not of the text format.


Zooming seems like an inevitable consequence of how screens and eyes work, what am I missing?

Forget text for a second, if I want to see fine details in an enormous image I'm going to have to zoom in. I normally adjust font size rather than zooming text but it's nice to have both available.


> The necessity of zooming is a shortcoming of the device, not of the text format.

Zooming is a very fundamental usecases, which is linked to the need to analyse some parts of a document with more detail (i.e., look at a graph, a section of a table, etc).

Moreover, accessibility is important. There are plenty of good reasons why even Apple provides magnifying glass apps integrated into the OS and that have their own system-wide dedicated keyboard shortcuts.


The great thing about PDFs is that you can open huge documents and page through them quickly. I feel this is under-appreciated in today's world, where scrolling is being forced on us everywhere. Scrolling sucks for reading text. Every time you scroll, you have to pay attention to how much you scrolled, and then find your place in the text again.

As for paging speed, just try using GoodReader or PDF Expert on an iPad. I can flip through thousand-page manuals and datasheets as quickly as if it were a paper book. And a 12" iPad shows an entire A4 page without the need for zooming and panning.

In my experience, people who dislike reading PDFs have only tried doing so in Acrobat Reader (which is hot garbage, and slow), on a small screen that is wider than it is tall, zoomed in so that only half a page is being shown. That is a sub-par experience indeed.


> I feel this is under-appreciated in today's world, where scrolling is being forced on us everywhere. Scrolling sucks for reading text. Every time you scroll, you have to pay attention to how much you scrolled, and then find your place in the text again.

This is incredibly important, and something that dedicated book readers like Kindles get right, but I've never seen done well in long web pages. Discrete "pages" (that correspond to "screens") make it much easier to find your place as you go to the next page. Note that multipart web pages often have you scroll through each "page" separately, and give you the worst of both worlds. Sure, PDF isn't always best for reading on a computer or phone screen, but infinite scrolling is annoying too.


Anchor tags have existed in HTML for thousands of years.


1992 to be precise! I actually made a minimal SSG based on them: https://portable.fyi/#2020-11-27-this-blog-is-a-single-html-...

Printing this page to PDF outputs the whole website, one post per page.


Be it pdfs or html, I find my place through chapters, rather than pages.


Haven't you ever been forced to stop reading in the middle of a chapter? It happens to me all the time.


Depends what you mean.

Temporary interruptions yes. But then the location is kept.

Interruptions for a very long time ? I might have to reread the whole chapter anyway...


Maybe it's simply a force of habit at this point, but I can only read/study/memorize technical stuff from PDF/paginated source--I memorize the overall "picture" of the page and it's location along with the bits I actually need, and it's really tough to do with non-paginated sources.


Yes! I totally get this. I think it's akin to the "mind palace" method popularized in media.


That's an apt observation, actually. I used to practice mind palace when I had to memorize lists, and I do feel more comfortable when information is "physically" placed somewhere like a page, I guess it's all connected.


I wonder if source code could be read and understood more effectively if it were paginated.


Source code often does facilitate it's own method of paginating, into different files. One could argue that is the whole point of the practice.

Pagination or not, both pagination and files provide some degree of spatial sense just as 'loci' and memory palaces.

Edit addendum:

There are theories about which senses are our dominant ones, and how they affect our learning processes. Some may lean towards visual ques in their mental life, others on kinetic or sound. Personlly I experience my mental models as spatial. Even abstract thoughts become situated "somewhere", if not by itself, then by contrast of other things on my mind.

"Everything is a Memory palace."

Needless to say, when I'm deep off in a terminal with something, I don't think I'd describe it as text-based.


Actually the whole different people learn with different senses has been disproved in neuroscience. It's a myth that holds itself persistently in education. There is a very good article about this: https://www.nature.com/articles/nrn3817


Source code has a bunch of other properties that make pagination less useful.

I.e. we strive for short functions, we use indentation heavily, it is commonly rendered in fixed-width fonts (this helps with spatial memory/overview too), etc.


Good points. Also, code can change frequently, so the "visual memory" reinforced by pagination becomes less useful, maybe even a hindrance.


I'm happy that someone came Forth with that idea here.


I think PDFs are also less distracting since they tend to have very minimal navigation elements, very few advertisements if any, rarely ever have animation or video content, etc.


Sounds like scrolling is actually a positive then, by forcing you to focus on understanding the content rather than relying on rote memorization.


Spatial memory is not rote memorisation.

https://en.wikipedia.org/wiki/Method_of_loci


Not sure what you're trying to say, that article pretty much only talks about its applicability to (inherently rote) memorization contests. It doesn't say _anything_ about meaningfully integrating the information as knowledge.

Which also seems to align with the article's description of how it works: you're not trying to figure out the underlying structure of what's going on, you're making up a new structure as you go based on the surface-level patterns.


False.

The method of loci (loci being Latin for "places") is a strategy of memory enhancement which uses visualizations of familiar spatial environments in order to enhance the recall of information.

Yates's work, cited, expands on this.

The method is useful for both specific itemised momorisation (rote) and more general (holistic, integrative, syncretic, synoptic, networked, dynamic) recall and understanding.


I don't have much experience with PDF as a spec, but I guess I'm a "PDF hater."

It's the things that didn't need to be PDFs, but inexplicably are, that annoy me. Like data dumps from local governments that could have been machine-readable, or announcements that are distributed in print and emailed as PDFs, rather than lifting the content into the message body.


You're right - it's just the wrong format and it isn't intended for that. Gov should be publishing text/csv/parsable-formats not PDFs when it comes to data dumps.


Agreed, but there exist solutions [0] to make even PDF tables machine-readable (optionally making use of machine learning techniques). It's incredibly backwards and much harder than, say, CSV, but it might get the machine-reading job done.

0: https://camelot-py.readthedocs.io/en/master/


> data dumps from local governments that could have been machine-readable

It's annoying, but if they were produced from a database (as opposed to scans), they're still usually machine-readable by converting the PDF to text, and then running a few regexes as needed to convert to something like CSV, if it's tabular in the first place.

In theory the text could be gibberish because of font subsetting that intentionally scrambles the glyphs, but that's rare and generally only implemented when a publisher is intentionally trying to thwart text extraction and/or font extraction, which I wouldn't expect a local government to either intend or to enable accidentally.


Our city council posts PDF files which consist of scans of documents someone printed out. Sometimes they turn on OCR, but not often.


Sometimes that's on purpose.

I know of a company that was required to send HR data to a union (time clockings over a period of time). They didn't like it. They just printed a badly-organized spreadsheet to a pdf. There, they sent the data, and it was unusable.


> data dumps from local governments that could have been machine-readable, or announcements that are distributed in print and emailed as PDFs, rather than lifting the content into the message body.

You should thank PDF for giving you any useful electronic copies at all.

If it's scanned-in papers, sticking them loosly in an e-mail or web page would be much more difficult to read through.

If it's text data, then perhaps it was primarily composed to be printed, and PDF allows easy creation of readable electronic copies with minimum of effort from any input. Before PDF you might have gotten nothing at all, because most people don't have readers for various obscure proprietary input formats.

And PDF is far easier than other formats to convert into another format for your own consumption. Do you have a command-line tool which will extract the embedded images out of a Microsoft Word document? Or one that will convert it to plain text, preserving formatting? pdfimages and pdftotext -layout are very widely available.


> You should thank PDF for giving you any useful electronic copies at all.

I think the point is that data dumps in PDF format are not useful at all.

I take objection to your statement that it’s easier to convert. The only reason there are so many tools to do so is because it’s so hard/impossible in the first place.


> The only reason there are so many tools to do so is because it’s so hard

What exactly do you base that on? Have you written any PDF or postscript utilities?

Images are easily located in rather discrete chunks, and they are conveniently stored in standard formats like JPEG. Preserving the layout of text output takes a bit of work, but otherwise extracting text and images is just about a necessary first early step in writing any PDF viewer. And I do believe even very early PDF viewers allowed arbitrary copy/paste of text.


I’ve attempted, and given up on, writing and reading PDF files more often than I can count.

Conversely, I’ve only ever tried to write a (new) word file once, since it all worked right away.


Also pdf has very bad support of animation formats.


That's a good thing, I don't want to see animations in a PDF.


And I would like to have an actual portable document format for electronic documents, but since mhtml support was for some reason dropped except in e-mail clients (as .eml), I'm stuck with this frankenformat that is pdf !


ePub (based on HTML as it happens) seems to come close, though even it has its warts.


mhtml support is there in both Firefox and Chrome. I didn't check other browsers.


How can you save a web page as a single mhtml file in Firefox ?


Not a mhtml, but you can download a page as a single html file with the TagSpaces Web Clipper extension. It convert images to inline format.

https://addons.mozilla.org/es/firefox/addon/tagspaces/


Interesting, I was using SingleFile myself, will have to look into the differences...


Single File works better IMHO.


I have the opposite experience. To me, opening a PDF puts me on edge. My computer is likely to slow to a crawl. It never renders correctly - sometimes it will slowly render each page, one at a time, flashing the screen as it goes. Or maybe the PDF is using some features that my reader doesn't support so it renders incomplete and incorrectly.

Links are often hard to pick out. What is a link and what isn't? What happens when I click on something, is it going to stay in the PDF or open a browser or something?

Don't get me started on moving around in PDFs. There are always 2 sets of page numbers, one for the PDF and one of the document. Extremely confusing.

Searching. Ugh, searching a PDF is a nightmare I don't want to even think about right now. Ctrl+F is broken 99% of the time.

Or at least, that's my experience over the last 20 years. Sure, it's gotten better recently, but, not enough to make my mind 'at ease' exactly. Very stressful to open a PDF still, usually.


> Don't get me started on moving around in PDFs. There are always 2 sets of page numbers, one for the PDF and one of the document. Extremely confusing.

It's only a tiny subset of all PDFs in circulation, but the LaTeX PDFs I produce using appropriate settings (mainly KOMAScript class) always nail this. The current page number always corresponds to what is printed in the PDF. This can be alphanumerical (e.g. page "a" / 300, where 300 is the total number of all pages) or roman, for the frontmatter. The PDF viewer will then literally show e.g. "Page XII / 300".

So in that sense, it's in the hands of the party producing the PDF to get this right, not an inherent limitation in the standard.

But now, new problems arise. If you're on the printed page XII but your viewer displays "Page 22 / 300", you know where you are in total. "Page XII / 300" is "correcter" but can be anything.

> Searching. Ugh, searching a PDF is a nightmare I don't want to even think about right now. Ctrl+F is broken 99% of the time.

Don't share this experience. It's a the same level as in browsers, where CTRL+F is also quite limited (I'd give a kidney to have regex available everywhere---ripgrep-all gets close on the desktop). The only different thing in PDFs is if hyphenation occurs, which is arguably less common in browsers (simply because of poorer typographical standards/people care more in proper PDFs). Your search term will indeed be invisible to CTRL+F. The only other time it breaks down in PDFs if the PDF is corrupt/poorly produced/bad OCR.


PDFs can be produced by mapping completely unrelated characters to glyphs that appear as actual characters (basically, they'd embed a "compacted" font that only has the glyphs required for the document, and then map them to ASCII or something). This was quite common with pdftex documents not using ASCII characters in the past, thus making text unsearchable (and even more so when going the ps2pdf route). For example, you'd have a Cyrillic document in which when you selected text in, highlighted text would be some jumbled ASCII characters.

The fact is that PDF can display one thing and have underlying semantic text be something else entirely (frequently used for OCRing: you show actual scanned images of text, and put the invisible OCRed text as searchable ).

It works in the other direction too: you could solve the hyphenation problem in the same way by having PDF include invisible non-hyphenated word in place of the hyphenated one for searching.

Still, PDF is mostly a laying-out format, and while tools have evolved to provide some "meaning" to rendered content, it is never going to be semantic in the sense markup languages can be (i.e. there is no "emphasis", "quote" or "header" command for PDF, instead, it just uses a different font). To put things into perspective, TeX files can be semantic (if a semantic TeX .fmt like LaTeX is used) like HTML/ePub, but PDF is an output format, just like DVI is.


> To me, opening a PDF puts me on edge. My computer is likely to slow to a crawl. It never renders correctly - sometimes it will slowly render each page, one at a time, flashing the screen as it goes.

That probably isn't the fault of the PDF, but the PDF reader you're using.

> Searching. Ugh, searching a PDF is a nightmare I don't want to even think about right now. Ctrl+F is broken 99% of the time.

Now this is actually the fault of PDF and how it does positioning of stuff within it - but about 50% of the blame lies with whichever software generated a shit PDF.


>Now this is actually the fault of PDF and how it does positioning of stuff within it - but about 50% of the blame lies with whichever software generated a shit PDF.

Arguably, it's still the fault of the horribly overcomplicated pdf spec -- html manages to do it just fine, with a plain text format, to boot


Well, it’s an interesting competition, but I believe HTML(+CSS to be fair) would win the complexity of the spec price hands down.


Shoe me the html document where I can search through a 60 page text. That's actually the thing. Html is fine if you only read text that would be maximal 2 pages on print. But if you read longer documents you want pagination which destroys the ability to search (unless the author implements a separate search or you use Google)


>Shoe me the html document where I can search through a 60 page text.

epub.

>An EPUB file is an archive that contains, in effect, a website. It includes HTML files, images, CSS style sheets, and other assets.

Search works just fine on ereaders.

The flexibility of html is that you can render it however you want, for whatever viewport or feature. If you want pagination, just render it differently.

>But if you read longer documents you want pagination

For long form works, I do actually prefer having them on my kindle, but that's because I don't want to read long-form text by staring at a screen, and I want a lower line width. PDFs tend to be worst case scenario there, because they often render with the assumption that you're trying to read it on standard letter paper.

Also, we've had "pagination" for text files for decades. It's called "less".


> That probably isn't the fault of the PDF, but the PDF reader you're using.

Completely agree. Try something that isn't Adobe Reader!


First thing I do when opening a Word document at work is converting it to PDF and reopen it in Sumatra Reader.

It just feels a lot better to me. It opens faster, it opens at the position I left it at, zooming in and out is fast, scrolling is smoother, and even if I wanted to, I couldn't modify it on accident.

It just feels a lot more reified than something that is responsive or editable.

Not great for mobile, but that's not what I care for at work.


I mean I dislike PDFs for the exact reasons you cite, they don't reflow and fonts are embedded and chosen by the creator.

Occasionally these choices can be good, but often I want to resize the text-size to make reading more comfortable which is easy with HTML or an EPUB, but with PDF I can only zoom so much before I must pan to actually read the entire line. Similarly, I think that the creator's font choice is often the wrong choice, it's very common for me to change fonts on an EPUB, but I can't do that for a PDF which is frustrating.


It is funny this is being upvoted as the top comment on HN, last time something similar was said, HN were piling in how awful PDF is and HTML is going to take over the world of publishing along with the coming death of PDF.

>Haters of PDFs do not understand the human aspects of it

Unfortunately UX is not something HN audience or Tech in general are good at. ( Apart from Apple )


I know! The first thing I do with an .epub or .mobi is convert it to PDF. That way, I can read it and feel my "place", highlight passages that get saved in the file (not my reader application), and annotate in margins, circle big sections, etc. I feel like the annotation abilities of PDF's get ignored, but they're HUGE.


Same here. It helps to have a solid PDF viewer (I use Okular), but what's also really underappreciated is the vastly better typography in PDFs. I suppose it depends partly on the viewer, but epubs tend to be hot garbage to look at. Fortunately ebook-convert (from the Calibre project) generates great looking PDFs and lets you set a bunch of options like fonts and so on.

This is especially important if I'm on mobile, because I can create an easily readable PDF in portrait mode that works great with the Android default PDF reader, whereas the obvious choice to open epubs (Google Books) is terrible: it's slow and battery heavy, and requires you to upload the epub so that it can be converted into the native format. (Once the conversion process choked and I somehow ended up with a 1 GB file.)


> The first thing I do with an .epub or .mobi is convert it to PDF.

b r u h


"It’s perhaps how static it is and it won’t change on me."

You might be surprised that's not always true.

https://www.pdfscripting.com/public/FreeStuff/PDFSamples/Jav...


The typesetting is better because PDF documents don't have to flow and resize across screens. This means a post-processing step can tweak hypenation, letter spacing, etc. to give the whole document a pleasing look. On the web it's a lot more difficult, but not impossible, to do the same and work on every possible size screen. There's also growing work to make typesetting style hyphenation and such work in browsers in the near future: https://developer.mozilla.org/en-US/docs/Web/CSS/hyphens


I've been obsessed with dynamicity and having everything adaptable.. but yeah static stuff have this weird feeling of 'right'.


As I discovered by accident fonts are not automatically embedded in PDFs.

PDFs are great when you want to read them on a large enough screen. They are not great on a Kindle or a phone.

I wish there was a way to have several different layouts in one PDF file, so you could have the same content but with different layouts and then your device could select the most appropriate one.


Do you know if it would be possible to get the same font(s) used on the webpage also used in the PDF?


I don't see why not, the other way is more difficult as a PDF might only have the part of the font it uses embedded.


Furthermore, even on mobile devices, everything just work. Ctrl+f is non invasive, if I close the document in the middle of 100s of pages and reopen the document, I do not need to scroll, unlike many web apps it preserves the scroll location. It does what it needs to do without any hiccups.


PDFs are designed to be printed and HTML tends to be overrun by ads. The former is nicer for reading and the latter is nicer still with ad-blocking, reader mode and/or a tasteful design.


Also they can be justified in ways browsers can’t (yet) do. Every web page has a ragged right.


I have been trying to place the justification format I have seen in some print books. In most cases it will add spaces between words to make up for sentences with words that can't be split easily, but it also doesn't fully right-justify the last letter of the line. It seems to be something like "add inter-word spacing unless spacing exceeds X"


The gold standard for computer text justification is the TeX algorithm: https://en.wikipedia.org/wiki/TeX#Hyphenation_and_justificat...


Like this (the demo at the top is interactive)?

https://developer.mozilla.org/en-US/docs/Web/CSS/text-align


No, like ‘text-justify: inter-character’: https://developer.mozilla.org/en-US/docs/Web/CSS/text-justif...


Interesting, there’s absolutely no difference between justify and align left on iOS Safari. I wouldn’t have guessed that’s an unsupported feature. I guess it’s not common enough for me to notice throughout the web.


Justify works in iOS Safari (I just checked). I like to use it in my designs sometimes :)


Works fine for me.


Copy/pasting can be infuriating though. The web is catching up fast however, but at least I can view source/inspect.


Wouldn't an SVG be better, then?


I love "printing" webpages to PDF files. I've been doing this for more than 15 years. I delete most of the images first so that I end up with files of 50-500KB. I then store said file in appropriate labeled directories.

Now 15 years later I have a private stash of websites and wikipedia articles that I can consult by simply pressing command+spacebar (the files are indexed in MacOS search).

To make a PDF file out of a website I currently use Printfriendly.com, but it wasn't always this way.

Back in the days I loved to use Arc90's Readability (a firefox extension). I don't know what happend to that extension though, there are plenty of old HN articles about that Wonderfull plug-in though:

Post from 2010, probably I started using it right after finding this post... https://news.ycombinator.com/item?id=1153343

https://news.ycombinator.com/item?id=3246081

https://news.ycombinator.com/item?id=3243097

My joy knows no bounds !!

I actually ducked for "What happend to arc90.com ?" and found as the 7th item in the list this website: https://ejucovy.github.io/readability/

It still hosts a working version !!!

Okay kids uses these settings and thank me later: * Style: Athelas * size: small * Margin: narrow * Convert hyperlinks to footnotes

Whenever a pages is worthy of saving, press the button for Readability and pres ctrl+P and save to PDF... that's it.


That's a really interesting workflow, thanks for sharing!


There's also Bindery, a JavaScript library for book creation which also leverages the print-to-PDF feature built into modern browsers: https://evanbrooks.info/bindery/

On top of that and the in-browser Markdown renderer Markdeep, I've built a tool for typesetting undergraduate theses: https://github.com/doersino/markdeep-thesis/

And, coincidentally, just a few days ago I've written a blog post about controlling the settings in Chrome's "Print" dialogue with CSS (other browsers don't support many of the relevant features): https://excessivelyadequate.com/posts/print.html


Oh these are great, and I hadn't known of markdeep. Thank you!

FWIW, my toolchain is currently markdown files in folders. I pre-pend 000-format numbers to file/dir names so they're assorted by ls or tree. Rendering is a bash script that runs pp first, since files include others using !include(), producing a single .md in /tmp. The mighty pandoc of course, to produce a Word doc, which is then the basis for all further rendering. HTML and plaintext are generated from that with pandoc. I was using pandoc to produce PDFs, but switched to calling libreoffice headless to generate PDF from the Word doc, since this seems to match formatting most closely.

Sounds fussy, but it's a few lines of bash, fairly reliable and reasonably rapid.

One outstanding issue is tables. The Word doc always requires reformatting tables for column width, flow, etc. I can't seem to get pandoc to carry the styles effectively from a custom reference.docx file. I'm looking at ways to render tables separately, format by hand, then include them into the main doc later.


If you'd like to typeset HTML/CSS documents (with optional PDF-specific CSS blocks to handle page layout, numbering, header/footer, etc.), then I'd highly recommend Prince: https://www.princexml.com/

It's a typesetting-specific layout engine that supports HTML, CSS, JS, and even styled XML if that's your thing. It independently developed support for all the latest standards... it's not free but it's very good. The inventor of CSS, Håkon Wium Lie, is one of the product's developers.

I used it on an app a while back to add a PDF export feature to a web app... couldn't speak more highly of Prince.


Wow, this is incredibly expensive. I don't mean to say it is not worth it.

I have a problem at work that requires highly dynamic content to be generated and output to pdf files. Right now, I am using excel template documents. I would love to use open technologies to do the same. Not able to find anything so far that is as flexible and user friendly. The closest alternative is to use OpenOffice Calc documents.


Try the HTML / CSS / Paged.js workflow with headless chrome - I think it works really well (and open source / free with the exception of Chrome, where you are a bit at the mercy of Google, but they are unlikely to make any breaking changes in my opinion)


Prince is so good. I wrote everything I did in grad school with Prince. It’s expensive because it’s aimed at audiences that can afford to pay for good typography.


Uh, I don't know many grad students willing to blow US$495 on software like this when they can just use a combination of Overleaf / Jupyter / RMarkdown.


It's free for non-commercial use. https://www.princexml.com/download/


I just used the free version. If you include a blank first page, you can easily cut the watermark out.


LaTeX? Not userfriendly if you aren't already used to it, but if you are nobody else will have output as beautiful as yours.


Similar to some others here, I have folders full of PDFs. I recently discovered Recoll, which is great for searching them. Fuzzy search of contents as well as filenames.

https://www.lesbonscomptes.com/recoll/

As well as this, I have a script which find all .pdf files without a corresponding .txt file, then generates one with pdftotext. Really handy, I can then easily grep -ril or ag -til for contents. One gotcha: the text files have line breaks, meaning matches don't always work.


There is also dnGrep for Windows which does much more:

https://dngrep.github.io/


Asciidoctor has a web PDF tool that just went alpha a little bit ago, uses the same stack as the OP's thingie.

https://github.com/Mogztter/asciidoctor-web-pdf

The content handoff goes like this: Asciidoc (using defined roles) generates HTML5 (Paged<dot>js polyfills page areas / pagination stuff), CSS styles stuff, and Puppeteer runs a headless Chromium for the pdf render. It's straight from CSS GCPM W3C spec, a flavor of CSS Paged Media, drafts that have been percolating since frickin' 2006 but have never seen browser implementation.

The beauty of this is that you use the same CSS for web and PDF deliverables. Actually, the even better beauty is that you are using two dirt-common technology stacks - CSS and Javascript - instead of XSL or Prawn or some ancient bespoke layout language. With Asciidoctor, for complex print requirements you're going to be forced to either 1) DocBook-XSL via fopub or 2) DocBook-LaTeX via dblatex. The native Prawn-based PDF tool isn't capable of a whole lot of customization without extensions. So web-pdf is a real shot in the arm for those of us that aren't real keen on going back into XSL-FO.


Amazing news, thank you for sharing.

Prawn’s pure ruby implementation of image layout makes it too slow for the graphics heavy technical manuals I write (though I haven’t used asciidoctor-pdf in 12 months.) I ended up drafting documents with 10dpi images just to get it to render quickly enough for layout but even then, adding images turned a 100ms render into a 6000ms render.

Hopefully this problem goes away with a fast web based stack.

I’ve never had much luck with break-after/before:avoid for <h2>. I hope their css or paged.js works for avoiding this common fault.


That's super interesting, I am in the same boat. Do you mind me asking what graphics format(s) you're using?

We're using a sort of hybrid Asciidoc/S1000D approach, Asciidoc markup with S1000D architecture (filenamers, publication modules, data module codes, etc). The art is SVG brought over from CGM, with conditional content (applicability) controlling the images via a new module type we call "illustration control files" that toggle the art based on "applicability" aka asciidoc document attributes and ifdef/ifevals.

PDF is via DocBook-XSL, but it's a scheduled process and not "on-click", which I am positive would break things. I am not even sure how to fire fopub from this company's web architecture; they wouldn't let us post html to a network directory (argh?), so any hopes of doing something more advanced are pretty low. In my off time I am looking at Antora pretty hard, and web-pdf is going to be the default pdf tool for that build platform. One thing I am wondering is how Antora's playbook files are going to relate to the "Ascii1000D" Publication Modules, which overlap a wee bit.


also using pagedjs :) awesome!


Yeah, in the Python world there's WeasyPrint for PDF out in the wild as well. It's quite slick, but it's a harder sell because of Python, which corporate types seem to think is bad hacker central.

https://github.com/Kozea/WeasyPrint


(Somewhat) related self-promotion: I've found that converting from HTML(+YAML) to PDF is one of the best ways to create a resume. It's very easy to come up with a good design, you can separate data (YAML) and presentation (HTML and CSS), and also export different views over the underlying data with simple filters (e.g. when generating a resume to apply to a job, only include the experiences that are relevant)

The code is pretty terrible, but you can see an example (my resume) here:

* Data - https://github.com/aviraldg/aviraldg.github.io/blob/master/_...

* HTML - https://github.com/aviraldg/aviraldg.github.io/blob/master/r...


This is one of the hardest problems we face as a financial research platform. We have a lot of financial data in tables along with line, bar and pie charts. Coming up with something sensible and readable is a bit harder than expected. Ultimately we have a "json to latex" converter we built but it's not great...


May I recommend Prince? https://www.princexml.com/

I created a PDF exporter for a manual test tracking app using this -- render to (pretty simple) HTML, pass to the prince executable, and out comes a beautifully typeset PDF.

Prince has its own rendering engine that is purpose-built for PDF rendering. It's actually very good - a lot of professional books and documents have been typeset using Prince.


Anything wrong with XSL-FO? I know it's not the hottest thing on earth, but it works. Apache FOP is still developed and it's easy to add it to a pipeline.


It's a lot harder to work with than an HTML template and a CSS file.


Maybe Vega and its Figures for Papers can help?

https://vega.github.io/vega-lite/tutorials/figures.html


We have been using PDF2XL[1] for this for years (used to be called CogniView).

It's genuinely unbelievable. If the PDF isn't sufficiently structured, it has OCR that seems to "just work".

You can also automate the extraction and integrate it into your pipeline.

The UI is pretty old and ugly-looking, but it is one of the few apps I've used in the last 10 years that made me feel genuine delight.

1. https://pdf2xl.com


I work in financial reporting, doing very similar things to what you mentioned here. Mind dropping me an email at Ashok.khanna@hotmail.com and we could discuss further? Would love to chat to others facing similar dilemmas :)


I don't understand why so many people think "pdf" automatically means "A4" or "letter size" document. PDF's are size agnostic. Storage is cheap and so is processing power. A phone-sized pdf will be 20 mb instead of 1 mb, but so what? It's a small price to pay for quality typesetting. Similarly, pdf's can be sized for the now typical 16:9 laptop screen.


Because 99 percent* of PDF is A4 sized.

* From my observation and guess.


Shameless plug - https://github.com/zipreport

OSS Python library to generate PDF reports from HTML, using pagedjs. Uses Jinja templates, supports runtime-generated images, client-side JS, and reports are bundled as a single file.


Anything with paged.js makes me an auto fan :)

This could be very interesting for those with Python workflows, thanks for sharing


For anybody working with pandoc you should try Weasyprint [0].

[0]: https://weasyprint.org/


Weasyprint is great, but be prepared to start hacking its layout engine for more complex generations.

It's nowhere near as mature as PrinceXML.


True, it sometimes can get messy to get a decent output. But the dev team behind is currently working on replacing Cairo with a custom Python solution as layouting engine. This will hopefully solve a lot of headaches down the road.

Sure, PrinceXML is unmatched. Same goes for the price tag. I know professional CAD software which costs less. Really not doable for smaller offices e.g.

Its astonishing that there is no real, great open source alternative for this. Don't get me wrong, Weasyprint is great, but has _a lot_ of dependencies and is a nightmare to install on Windows. Works decent on WSL, tho.


You need to try pagedjs. Which is kinda what this thread is all about.


Maybe I am missing something very obvious, but how do I get a PDF out of it in an automated way?

That's what I like about pandoc + weasyprint. I just have a plain ol' markdown document and I receive a nice PDF in an instant. Just like that, super easy.


you can run the command line version. so in your scenario pandoc-> html, pagedjs cli -> PDF


Ah yes, thanks. I missed that somehow when I took a quick look on paged.js.

Do you have much experience with paged.js? I wonder what the benefits over weasyprint are.


I'm part of the pagedjs team so my outlook is pretty biased :) Perhaps have a look at the docs on the site https://www.pagedjs.org/


True, and it's amazing what can be done with plain CSS, like `-weasy-hyphens: auto;` or page counters. And for LaTex-style math there are pandoc filters to generate embedded SVGs on the fly.


Actual content is nice! But the ToC ain't it, chief :)

https://i.imgur.com/SRftzxv.png


Thanks!

Yes I know what you mean - it doesn’t work well in certain browsers, devices etc. Aimed at desktop users and Chrome in particular (I saw a bug in Safari version). The aim is less to be a readable pdf in-browser, but rather a high quality pdf after exporting to pdf. The in browser print preview is just a nice side effect (but I might actually reuse this for other projects as I like it quite a bit!).

I think the issue with the toc is that it’s dynamically created; so while I was able to use responsive web design for the rest, it didn’t work so well for the toc. I’ll have a look at it though :) I think there may be a way to get it to work


The Achilles heel pf PDF's are they don't have responsive layouts. It's so bad the Adobe team created an AI to resize layouts, yes an AI in the cloud, available only on the Adobe app. How insanely bad is your file format that you need an AI to resize layouts in 2021? If anyone has had to handle layouts programmatically I'd think they would agree that PDF's are the most outdated ass backwards file format in existence.


PDF is an attempt at non-Turing complete, simpler PostScript (PS). It comes from a time of paged media de facto ruling the world. Changing layouts was never the goal, because PDF was the output format.

In case of academic research papers typeset with LaTeX, the source file is something you'd likely want to consider the semantic equivalent of HTML. TeX should be able to render the same document with different output constraints ("responsive layout"), but because of the architecture (TeX itself is fully Turing complete), it is pretty slow at re-rendering an entire document.

Part of the allure of a static document format like PDF is that you can, in theory, fetch just page 454 of 6000 page document and render that: with HTML, just like with TeX, you'd have to get and render the entire document to be certain that the layout won't change after you've processed the whole file.


I'm aware of the history, in the same context of why it not a good idea to use a steam engine in a train anymore


This is a feature.

The option to specify the target output size / dimensions at generation time is a reasonable option --- ISO A4 / US Letter, perhaps a target for smaller devices (though a 6"--7"+ tablet should be able to present most reasonably-formatted PDF documents reasonably legibly).

For anything smaller, PDF isn't really well-suited, and your better option is to go with a fluid-layout format such as ePub, .mobi, or, yes, HTML.

Having largely switched to bookreaders (eInk tablets), in large format with ~300 DPI grayscale screens, I strongly prefer fixed-layout formats such as PDF and DJVU to fluid-layout formats, for the spatial/cognitive reasons many others have mentioned in this thread.


Rigid formats sound great but that's unfortunately it's not 1985. Almost every document across most major industries is in .pdf and benefits from mutli size output(print, desktop, mobile). Adobe would not have spend millions in their assinine AI if this wasn't a problem. No one in the "insert any industry sector here" is using secure html document's to sends files, or any other format.


"Secure documents" is a whole 'nother ball of wax, and has no bearing on this specific question. PDFs themselves address that use-case poorly as well.

My (and others') point is and remains that a spatially-fixed layout does serve a useful purpose for some documents. Including the 140 million or so published books and an even larger count of formatted published articles.

Yes, for short texts, dynamic flow within an HTML webpage is useful. Yes, for very small devices, virtually any format sucks and blows (this is a device problem, not an inherent PDF problem).

I'm not a fan of websites that dump what should be Web-formatted content as PDFs. But I'm also not a fan of the notion that everything should be an HTML document either.

(I've used various online document formats for going on 40 years, from raw ASCII (or EBCDIC) through roff/nroff/troff/groff, HTML, LaTeX, various flavours of Markdown, etc. I've hand typed out several books simply to have a suitable online digital format of them (I hope this serves to indicate my level of obsessiveness, if not sanity, on this topic). I'm a huge fan of Pandoc and its ability to take a standard markup format and produce a wide range of output endpoints (usually: PDF, ePub, HTML, plain ASCII text, though a few others may be included).

I'm also a recent convert to large-format eBook readers. And from that experience I can make two specific observations:

1. The behaviour of HTML and web browsers on an eInk device really sucks. Pagination and not triggering scroll actions with the merest suggestion of a hint of breathing on the surface is hugely underappreciated.

2. PDFs (or equivalent paginated documents, e.g., DJVU) offer an excellent reading experience on such devices.

I'm not a huge fan of the PDF file format, mind, it's far too variable and has too many surprises and vulnerabilities. As a reading medium, however, it's quite good, especially when produced with competent tools.


It should be the other way. PDF ought to die and let documents be parsable


The section numbers in the table of contents are over writing the text, maybe not so beautiful.


Thanks for flagging. It’s aimed for desktop users and a reasonable size screen (eg A4 or so), it’s not really a responsive design as it’s meant to be a tool to generate PDFs. So it won’t work in some browser dimensions (the paged.js code is reasonably complex).

That said it works in iOS as far as I can test. For some reason page numbers in table of contents not working perfectly in Safari but Chrome works pretty well.

I guess the conclusion is that this is aimed somewhat to desktop Chrome users as a specific tool for pdf generation.


I'm in chrome on a Samsung note 10+. If you want to add contact details somewhere I'll send you a screenshot and do a test when you're ready.


I'm seeing all page numbers as 0 in the TOC on iPadOS.


Yep - looks like Chrome only gets those right. I will speak to the paged.js guys and see where the bug is in the code.

Rest seems to work on Safari - let me know if any other issues and I’ll fix / update accordingly


Does anyone know how this compares to PrinceXML / DocRapotor? We pay them big bucks for PDF generation (invoices specifically) and so far we haven't found anything comparable.


Not sure about paged.js, but we use Weasyprint for invoices. It isn’t as advanced as some of the paid options (most annoyingly page headers are somewhat hacky) but it works very well.

My go-to for everything CSS Paged Media is [0] which has a nice comparison of supported features at [1]. They recently added Weasyprint, PagedJS and Typeset.sh

[0]: https://css.paged.media/

[1]: https://css.paged.media/lessons


A few years ago I started an alternative to PrinceXML called ReLaXed.js [1], it's always been sufficient for my reports but it may lack some pagination/layout features that Paged.js may have as they seem to have given this much more thoughts (still wrapping my head around whether paged.js could be "plugged into" Relaxed).

[1] https://github.com/RelaxedJS/ReLaXed


Have you tried using Chromium print-to-pdf by an API like Puppeteer or Playwright? Combined with something like paged.js and a decent print stylesheet, you can get pretty good quality output.


I was a huge supporter of PrinceXML / DocRaptor for precisely the same reason - all the alternatives (that I knew at the time) were not good.

Paged.js was a revelation (thanks to HN for telling me about it!). It is based off CSS Print Specifications like PrinceXML (as is my understanding - I’m about 95% sure), and to me it’s even better because it utilizes all the other front end technologies directly from your browser - I think there are some use cases where PrinceXML won’t be able to get the same functionality.

For invoices, I think you should be able to easily switch over. Based on what I can see.


I also pay a lot per month to DocRaptor and they’ve been very reliable to date, but unless I’m rendering charts with JS I feel like I’m really overspending


I built, use, and maintain https://github.com/danielwestendorf/breezy-pdf-lite which uses Chrome to convert html to PDF’s as a web service. Maybe someone here will find it useful!


I was looking for something similar recently and I ended up using jsPDF for my ResumeToPDF site and it worked pretty well. Only thing I sort of had pain with was that I had to include SVG icons in-line in the html/javascript and had to convert them to base64 to be included in the PDF. And same thing for the custom fonts - I had to include base64 of the font TTFs to be embedded in the PDF:

https://resumetopdf.com

https://github.com/MrRio/jsPDF


Can you point me to something that explains how to do what you did (embedding the base64 of the font TTF)? I'm wondering how I can make a PDF of a webpage use the same font I used on the webpage.

Edit: Never mind. I found the instructions on one of your links: https://github.com/MrRio/jsPDF#use-of-unicode-characters--ut...


In case you are still stuck, here's what I do. If you have questions about this, email me using the "Send Feedback" from my site. I serve the base64 of each weight and style (normal, normal italic, bold, bold italic) inside a JSON. Here's one example of the JSON I serve for the Open Sans family:

https://resumetopdf.com/fonts/OpenSans.json

And here's me embedding it and making it available to jsPDF. NOTE the font.replaceAll to remove white spaces from the font name when embedding. jsPDF has this bug (or maybe expected behavior where it can't handle white spaces in the font name):

    var fontBase64 = {}

    const doc = new jspdf.jsPDF({
        orientation: 'p',
        unit: 'mm',
        format: currentresume.size,
        putOnlyUsedFonts: true,
    })

    const fontsToEmbed = [currentresume.headingfont, currentresume.bodyfont];

    for (font of fontsToEmbed) {
        const fontName = font.replaceAll(' ', '')
        if (!doesExist(fontBase64[fontName])) {
            let response = await fetch(`https://resumetopdf.com/fonts/${fontName}.json`)
            let data = await response.json()
            fontBase64[fontName] = data
        }
        Object.keys(fontBase64[fontName]).forEach(style => {
            doc.addFileToVFS(`${fontName}-${style}.ttf`, fontBase64[fontName][style])
            doc.addFont(`${fontName}-${style}.ttf`, fontName, style)
        })
    }
Once added to jsPDF, later on you can set the font and style, weight etc:

    doc.setFontSize(originalFontSizeInPt)
    doc.setFont(fontFamily, fontStyle)
Also the JSON with base64 of font families is served over CloudFlare with an infinite cache - this prevents any costs on my end and also speeds up the user experience when they return to the page.


PDFs are great and I now publish my website exclusively in PDF/A, not HTML. I have many reasons, but top of the list is that PDFs put the user in control, whereas HTML is now firmly the agent of the publisher.


What do you mean publish your website in PDF/A? Is your frontpage a PDF document linking other PDFs?

Ha! Just checked you account and indeed that's what you do... I'm not sure I like to acknowledge the sense of failed HTML but your pages look good and there's no javascript nonsense in the background.


lol, but not my humour.


This is nice but it not a valid Show HN. Please read the rules: https://news.ycombinator.com/showhn.html


Hi dang, hope you are well. May I kindly ask why not? I spent two weeks writing the CSS / HTML / JavaScript, and did well documented code - in fact the output serves as both documentation of the code and also output from it (in my own stupid way, I was thinking I was following Donald Knuth’s Literate Programming Approach :D).

The repo (https://github.com/ashok-khanna/pdf) contains all the necessary code and is intended for others to reuse in their projects. Some of it isn’t straightforward, despite the guide looking easy - I had to figure out how CSS selectors and counters work for example, how MathJax interacted with Paged.Js.

I think the confusion comes from it being labeled as a “guide”, in fact it’s a full set of code to give the required functionality for high quality PDFs from HTML, using paged.js, the guide is just the self documentation as I figured I might as well use documentation for the sample output. Otherwise, I’d be genuinely curious on what constitutes Show HN vs normal posts?

I think the repo description and the way the output is confusing / unclear - the primary goal is very much meant to be a code base for people to reuse as I’ve noticed for many programmers, the design side can be a bit more elusive.

Separately, would it be possible to add beautiful back to the title - it’s not really about producing PDFs from html as browsers can already do that, and there are many other tools. The main aim is to have the functionality to produce very high quality typeset PDFs from HTML, which until now, I only felt PrinceXML did well and that’s a paid solution. Maybe we could say the title is “High quality PDFs from HTML using Paged.JS”? I know there has been a separate discussion on another thread on the overuse of the word beautiful in describing code - my view is that it has its place when it relates to output / UI.

Thanks for reading, and no issues otherwise (no need to reply).


Yes, I thought it was just reading material and didn't realize that you were sharing code. Sorry! Have restored "Show HN" now.


I tried out paged.js recently for a genealogical report exported from Gramps, but I had to use PrinceXML because counter-reset to start at a page does not work: https://gitlab.pagedmedia.org/tools/pagedjs/issues/91

Apart from this feature everything worked fine.


Do you have a repo I could play around with? I had the same issue, but for my use case I figured a work around - primarily by using classes. I could spend a few minutes and see if I can get it working for you.

But yes, it’s a deficiency in the system currently


I would be happy to see it too.

Reset page number using counter is working: https://codepen.io/julientaq/details/MWammZV

The property needs to be set to the element, and not to the @page.

But we may have missed a bug though, so i’ll be happy to check your code.


Sorry, I forgot to check replies! This is the CSS:

  #grampstextdoc {
      counter-reset: page 7;
  }
  @page {
      margin: 2cm 3cm;
      @bottom-center {
          content: counter(page);
      }
  }


Adobe makes it harder and harder to use their PDF reader. I live in Canada, and somehow I'm given forms I legally have to use, that I can only print out in adobe reader v10. I need to go through hassle of installing and uninstalling their terrible product couple times a year.


This looks great -- well done! I'd love to be able to use it (the CSS in particular) in a number of different projects where creating such nice readable output is a hassle. However I couldn't find a license mentioned anywhere -- either for the associated repo as a whole [0] or the CSS specifically.

Would it be possible to add a license so it's possible to know whether others can use this in other projects without rewriting the CSS from scratch?

[0]: https://github.com/ashok-khanna/pdf


Thanks :) It’s all free, I will figure out how to add the license tomorrow as I don’t know what’s the right one - basically don’t want people to have any issues with GPL so if I understand then MIT is the right one?

Also paged.js is the property of their team, together with the script of toc.js whilst MathJax is the property of their team too. Have to figure out how to word it.

But if anyone is reading in the meantime, it is open source, no need to attribute anything back to me for my parts. If you are using the text of the guide, you could mention my name, but don’t sweat it either - it wasn’t particularly involved in terms of writing (the hard part was choosing which parts to write about so it’s not too complex but also not too barebone).


Hi, Adam here from Pagedjs. Pagedjs is MIT. Mathjax is Apache 2.0 and isn't part of our work :) Wonderful as it is...we can't claim credit for it! (We Love MathJax - https://github.com/MathJax/MathJax)


Thanks Adam! I know I come across as one of the biggest fans of paged.js, but it’s such an inspirational project!

The idea to polyfill the required functionality for CSS Print Rules is pure genius, as somebody who looked at all the alternatives. It’s a great example of thinking outside the box.

Thanks again!!


Also feel free to email me / send me a message on GitHub if you need any help in customizing parts of your CSS. I feel like I have become a guru in this now and can quickly figure out how to use it for specific goals :D

I may write a book on typesetting with CSS, the quality of what is on the web is not the best, but it seems like a huge time sink at the same time...


For me, the best way to read something is a media that is limited in the x direction but has no page limits and can go down as needed. Like most of the websites. Foldable sections, hyperlinks and floating TOCs, are a bonus too.

This is ideal for reading from a tablet or from a desktop and is somewhat printable too,

Unfortunately most tools for producing PDFs from HTMLs assume you want to divide them into pages, and there is no easily produceable "reading format" as widely adopted as PDF. Those page cuts are so annoying when you read from digital media.


PDF is elegant on my eyes. To make it a web surfing experience for me it lacks one thing here, that I need the TOC on the left or right side so I can click and jump to various section quickly.


The table of contents is itself hyperlinked. And at the bottom of each page is a link to return to the table of contents.

Let me know if that works well?


yes that helps a lot, a toc on the side will be even better but toc hyperlink is good enough for now. Thanks!


Thanks!


Are there any report generation libraries built on top of page.js?

I've been looking for report generation solution based on frontend technology.

Btw. This is great. Thanks for sharing.


Thanks for your kind words.

Would you be able to expand on what you mean by “report generation libraries”?

For example, I am building (in Common Lisp, but it’s trivial and can be done in any software) a tool to read content from a database and auto generate the HTML markup for producing pdf reports. This allows me to reuse content across reports and also leverage the full power of databases (text search in particular). As another example, I have many monthly financial metrics - I will store these in a database then use my lisp markup tool to generate the necessary HTML to produce the pdf report (via paged.js).

In addition, one can use headless chrome to automate the full workflow so that the reports are generated directly from your program and not via File > Print in your browser.

Was that what you were thinking of?

You can also add charts via charts.js.

The beauty of paged.js is that you can leverage many of the features of browsers and JavaScript libraries in your report generation.

I wasn’t able to get syntax highlighting for code blocks to work however, need to dig into that a bit more.


> Was that what you were thinking of?

Yes! This is great. Thanks for the pointers. I'll look into that.


I've built a python library called ZipReport to manage and generate PDF reports from HTML using paged.js - https://github.com/zipreport

The actual PDF generation component is an electron application, so it may fit your "frontend technology" requirement.


I started generating PDF before pagedjs and use my own electron based solution. Search for schild.report on github to see how it works. Basically reports are created via svelte templates and then printed via the electron print api. Works extremely well.


Yup. I use pagedown all the time.

https://pagedown.rbind.io/


In case you missed it, pagedown implements pagedjs afaik


What is your favourite library for printing HTML to PDF?


https://pandoc.org/ + something like pdflatex


Can anyone speak to how this compares with something like wkhtmltopdf? That's what I've been using, and this is my first time hearing about PrinceXML too.

I generate some HTML for the user, they can edit it with a rich text editor like TinyMCE, and then we export to PDF on the server side. Wkhtmltopdf is pretty barebones on the style/feature side though, so this looks worth investigating.


From my limited experience with wkhtmltopdf, a solution based on paged.js should work better

But maybe that’s because I didn’t fully learn wkhtmltopdf

PrinceXML is very good but is a paid solution. I found paged.js, at least for my purposes, on par with PrinceXML


Interesting, thanks for the input!


Great paged.js tutorial, thank you for publishing it.


Thank you! Paged.js is really such an awesome tool.

I was searching for many weeks for something like this, so I really think the word needs to get out there more. It could significantly improve the workflows of many people who are self writing / self publishing as it opens up the power of CSS and HTML (which allows to nicely defined formatting templates and use code to automate content generation) to pdf reports (which I think has its place).

I haven’t used pandoc, but I think a HTML/CSS/Paged.js workflow could challenge it.

At work I’m already converting many processes to it - I have a database of content and then use SQL queries to extract data and then generate beautiful PDFs through paged.js.

It also works well with mathematical typesetting (via MathJax).


Awesome project.

Consider adding in the demo automatic anchors on headers so one can quickly copy them for sharing. Currently they can only be obtained from ToC but you need to scroll to it. On anything larger then few pages, this is a must. One problem there is that current automatic id's are generated sequentially and not really user friendly for link sharing.


I will look into this and add it in (if possible), please do check back in a couple of weeks (maximum).

Would you be able to expand on what you mean (sorry I’m being dense), otherwise I will Google it tomorrow

Thanks for the kind words


I mean links like this: https://majkinetor.github.io/mm-docs-template/#quick-start

If you go there, I can grab it hovering over header (↵ symbol to get permalink). It also shows nicely in the URL by combining words instead of generating sequence.


Thanks! I will look into and revert, please check the repo from time to time, I’ll create an open issue against it


It's absolutely fascinating how much attention "x to PDF" or "PDF to x" draws still. I use and struggle with this on a daily basis, been trying to build my solutions but never really got there. It's still an issue people are willing to pay to solve in the best way possible for their use case.


There's definitely a market for it!


Saved PDF (in Chrome) does not have a TOC as a side pane in Acrobat Reader. MIssing pretty important feature.


The bookmarks feature of pdfs doesn't exist in chrome's pdf engine which means paged.js don't support it either last I checked.

We solved this by post processing the pdf generated with paged.js + puppeteer with itextsharp (LGPL) to add the bookmarks.

We captured the toc using the paged.js "after" hook and put that into a variable which our backend could then grab from puppeteer.


Yes, that would be an advanced feature and I think likely out of the scope for paged.js. That said the table of contents page is hyperlinked - you can jump to sections, and I put a return to table of contents in the footer to aide with navigation. Hopefully that helps?


My question on these is always how it handles multi page documents. Most DOM and CSS approaches to HTML to PDF require it to fully render everything for placement before it can convert.

If you have to render something big, like a 300 page member directory for example, the approach will blow up.


I used paged.js in production and while it's definitely not fast enough to run during an HTTP request, it can render a 300 page document reasonably quickly, definitely within the 120 second TTL of our worker tasks. It can be quite finicky though and sometimes stalls on things like images that are taller than a single page.


The RAM upper limit is my bigger concern.


www.booksprints.net people made 700 pages (which makes a lot of div to handle and process) without too much issues.

And you can also check https://villachiragan.saintraymond.toulouse.fr/impression to see HD images.

Both went straight to the printshop.


Add to this a pdf to html converter, with a focus on official forms (e.g. irs tax forms), ability to easily edit fields and add signatures (similar to how the free android adobe app does it), and you can charge money for it.


Thanks - indeed there is likely a market for it. One of the issues is that to get a commercial app, you have to solve for most of the edge cases and make sure it has a good enough UI for the non sophisticated.

I was thinking about doing it, but it would be a lot of work to do right.

By right, I would want it to be the quality of sublime or emacs / vim :) :)


You'd essentially have a pdf editor that can import a pdf, edit it and export the html back to pdf. Working with official forms is one usecase. Another is an iframe that can preview pdfs without resorting to plugins.


Indeed - that makes a lot of sense


Actually, there's one more usecase for a html to pdf converter: making a book-style copy of a multipage website. I'm looking right now at a scientific site with content spread over multiple pages and it's tiring to find and click all the links to make sure I don't miss anything.


This is cool. Exactly what I was looking for not too long ago (sometimes markdown does not fit all the needs for documentation).

How about images? How do you handle images; their layout, scaling etc?


Basically you can use CSS to manage their layout - dimensions, positioning, scaling, etc. Should work pretty well

If you want floating images (e.g. text on the left, images on the right), it may be a bit more difficult and not perfectly possible. This guide will help: https://www.pagedjs.org/page-floats/

One tricky part is if you want to have text within images and have them the same size as your main text (eg in MS Word where you can have shapes and text boxes). For that, you can probably get close enough with a simple image load, and more precise by using svg graphics, but it may result in a reasonable amount of complexity to make perfect (if at all).

For charts, use charts.js in my opinion.


There should be a button on there that converts the site to a pdf.


Some sites have this, notably Wikipedia and Wikisource, with the option to save as HTML, PDF, and ePub, generally. Occasionally other formats.


By using a W3C standard and devolving the layout engine to the browser, this solves a difficult problem the right way.

I would have loved to have something like this for a project years ago.


It really does. The team at paged.js are simply amazing and deserve so much credit.

It’s such a big deficiency in the modern web, we really need Chrome / Safari / etc to implement the W3C standard or something better


Another great option is https://www.pdfmonkey.io. They're also wired up in zapier.


I’ve been looking for a way to convert Wordpress blogs to PDF. There are Wordpress plugins for this but I have not found any that work well.

Can this be integrated with Wordpress?


Should be able to - somebody with intermediate Wordpress knowledge (unfortunately I don’t know php) should be able to integrate within a day in my opinion, based on my understanding of web development


Looks perfect when I saved it to PDF using File->Print menu on Safari. But it is exported to single long PDF page when doing it from File->Export as PDF.


I'm very confused with the PDF appreciation comments. I have to read lots of PDF textbooks and reference documents for school and the experience is grating, especially trying to navigate a document with upwards of 1000 pages on the Chrome PDF reader or Adobe Acrobat on a laptop. Trying to manipulate a tiny scrollbar with a laptop touchpad is very frustrating and using gesture scrolling is tedious for a large document when you have to flip around to various pages. Perhaps I've been doing something wrong, any thoughts?


PDFs look and are made to feel like actual books. I see that as the primary reason for people being comfortable with PDFs. Even if their readers are buggy.

Consider the alternatives.

HTML, too much re-rendering and re-formatting.

Word - Oh No. Not in a million lives. Have you seen the atrocity that is the "reading mode" in Word?

Epub / Mobi / Etc - I have never come across good readers.

For what its worth, PDFs are great for reading on larger screens like iPads. I read them on my mobile too, but that's not good for long reads.


Get a dedicated ebook reader if you can.

The price is pretty reasonable up to about 7-8", and not too high to 10".

I splurged for 13" as I read a lot and often low-quality scans of small, three-column print.

The pixel density of an ebook reader (200-300 DPI) is far higher than even Retina displays. Monocrome/greyscale gives higher resolution as well (the three-elements-per-pel aspect of colour displays means you're always left with about 30% the effective resolution, though subpixel aliasing helps a lot).

Portrait will display a single page well (laptop displays suck for reading text), and for larger devices or larger-print materials, you can often manage a two-page-up display.


Well imagine reading (and searching) in an HTML document of 1000 pages upward (I certainly would not want to scroll through it) and you realise why people who read longer texts like PDFs. It seems like you have issues with your reader, but there is lots of other readers which render documents very fast.


Feels very much like a channeling of Churchill's 'Democracy is the worst form of government' ditty, except as applied to HTML and PDF. PDF is horrible. But it's not 'as horrible' as HTML (with the rather loomingly large caveat that this is has very little to do with the formats and everything to do with what your average HTML dev ends up making, and only applies to the job of reading significant chunks of in-depth materials).

As a format, yes, what the fuck is everybody talking about? PDF is a disaster and should be killed off, HTML is great.


I would prefer if it was a single page like iOS does it when a screenshot of a webpage is made where it offers to export a PDF with everything.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: