Hacker News new | past | comments | ask | show | jobs | submit login
Using Web Technologies to Print a Book (richardmavis.info)
171 points by rmavis 67 days ago | hide | past | web | favorite | 68 comments

GitBook PressBooks are also very affordable and great options that are built with web technologies and are slightly less manual.



GitBook is targeted at people creating technical documentation in Markdown. It has the advantage of git integration so it’s possible to branch & merge your way through a book project. Their GitHub account still hosts the legacy open source editor and cli tools, although the project has moved on to focus on the commercial web-based editor. https://github.com/GitbookIO?tab=repositories

PressBooks is targeted at authors who want to self-publish and has a nice set of book templates (styled by book genre) and a workflow for selling print books on Amazon.

Scrivener (mentioned by dangoor in another comment) is a writing tool with advanced features for different manuscript formats (screen plays, novels, etc) and has cool features for organizing notes, ideas, and references.

Pressbooks looks like the CDBaby or DistroKid of book publishing. Neat!

Unsolicited code review:

    system("multimarkdown -s ../#{part}/story.md | sed -E 's/_([^_]+)_/<em>\\1<\\/em>/g' | sed -E 's/<h1 .+<\\/h1>//g' | sed 's/<p>%<\\/p>/<p class=\"section-break\">%<\\/p>/g' > output-#{part}.html")

    s_<p>%</p>_<p class="section-break">%</p>_g
is easier to read than:

    s/<p>%<\\/p>/<p class=\"section-break\">%<\\/p>/g
Any reason to keep the %-sign?

Those sed commands could probably be reduced to one invocation of sed(1) instead of three.

    system("cat #{htmls} | #{wkhtmltopdf_cmd} --footer-html footer.html - body.pdf")
No need to use cat.

You're right -- good suggestions. Thanks.

This doesn't work for Linux, but for folks writing a novel, I can definitely recommend Scrivener (Mac and Windows). In addition to a powerful writing environment, it's "Compile" feature can output flexibly for print and ebooks. I generated hardcover, paperback, mobi, and epub, with enough customizability for me and no scripting required.

Granted this is Hacker News and people here _like_ writing code (myself included), but sometimes you just want to use a tool to make a thing :)

Well, Mac and almost Windows. Windows is still stuck in version 1.x of Scrivener, which is a decent-enough writing environment (bar the now-quite-quaint UI), but it's nowhere near the version 3 that Mac users have access to, especially as regards compilation and output. (I've been waiting for a Windows version update for years and years.)

That's too bad. I did remember that Scrivener for Windows was made by a different developer. Scrivener 3's compilation was a big upgrade.

There's a 2.9 beta for Windows. I wonder how close to parity that is compare to the existing 1.x version.

I've not used it in years but I know there was a Linux version available for download from the support forums. It had a few issues but it worked really well when I tried it ~2 years ago on Ubuntu 14.04

It still exists I believe, although I did have to do some troubleshooting to install it. Sadly I cannot remember the steps.

It hasn't been updated in years and is broken on modern distros.

There's an open source clone called Manuskript though.

It's on iOS too, but sadly can only sync via Dropbox. Scrivener should allow the entire file/database to be opened and transferred in another app, e.g. SSH or SMB client.

Do HTML and CSS rendering engines implement Knuth paragraph layout yet? AIUI, they still use greedy line layout and justification, which can produce various artifacts (like whitespace rivers and orphan lines) that the Knuth-style linebreaking in TeX avoids. Right now, I still don't think I'd consider using anything but TeX or some heavyweight DTP package for printed material.

As I understand, Prince XML, which is a CSS rendering engine, uses variant of Knuth-Plass line breaking. They have samples on homepage, so judge for yourself. It's most certainly not greedy.

No CSS implementation "for the web" does.

Implementing CSS justification for printing as Knuth-Plass and implementing so in a web-compatible manner is somewhat very different task. I think the algorithm's interaction with CSS floats is unclear. As always, it's in the Firefox's bug tracker, no, it's not because browser vendors don't want to or are lazy. https://bugzilla.mozilla.org/show_bug.cgi?id=630181

Apache FOP[1] does[2] if you're willing to jump back in time to XHTML or use an HTML2FO converter.

As always, Unicode is a problem.

1: https://xmlgraphics.apache.org/fop/

2: https://xmlgraphics.apache.org/fop/0.95/hyphenation.html

cf. https://wiki.apache.org/xmlgraphics-fop/HowTo/HtmlToPdf

That's Knuth-Liang hyphenation, which is something different from Knuth-Plass line breaking.

Oh, good catch, sorry about that.

Simon Pepping did some work done on Knuth-Plass in FO, but it's old and probably not quite as relevant now:

- https://web.archive.org/web/20070114211331/http://www.leverk...

- https://web.archive.org/web/20070128145517/http://www.leverk...

Do not use wkhtmltopdf for this, use weasyprint instead.

Markdown → Pandoc → HTML → weasyprint → PDF works great, paged media support in weasyprint is good enough and much, much better than in wkhhtmltopdf.

I clearly missed something, but why are you not going markdown -> PDF directly? That's what we use extensively in our operations playbooks at work and it gives something that looks like I'd get in a book - even more so after I tweaked the templates a little.

Because the user in question does not know and is unwilling to learn how to design anything if it is not web technology.

That is, you probably want to design PDF output. Some users prefer to do any design work whatsoever using web technology. Therefore, PDF needs to be generated from HTML.

If OP is used to working with HTML/CSS, it will probably save him a lot of time and energy to do basic formatting in HTML/CSS than to learn how to design a professional-looking PDF document. He's a programmer and novelist, not a designer.

In fact, he's not even aiming to design a professional-looking PDF document, just a throwaway printout for proofreaders who don't know how to parse Markdown.

Why not use something like http://github.com/susam/texme as a starting point? Really easy to turn any Markdown document into a rendered HTML with a single line of code in the header.

This rendered HTML could be converted to PDF and printed or the self-rendered HTML itself could be printed directly.

There are a whole bunch of similar tools written in all sorts of languages, and OP just chose what he's most familiar with. It just happened to be different from your favorite toolchain.

It also looks like OP's book was split into several Markdown files, one for each chapter. So he would have needed some sort of build script anyway if he wanted to use texme on the combined document. He would also have needed more than a single line of code in the header, since he wanted some custom styling for blockquotes and code snippets.

They also noted that they (ab)used different markup for different character parts and code blocks would usually be rendered very different from block quotes. Markdown → PDF may give less control over the output formatting than Markdown → HTML + custom CSS → PDF.

I'd probably use Asciidoctor which gives more flexibility on the markup side already, but if this process works for them, why not use it.

> Because the user in question does not know {fill in the blank}


> and is unwilling to learn how to design anything if it is not web technology.

False and does not follow.

I am not aware of any direct Markdown -> PDF converters. What do you have in mind?

pandoc makes use of Latex, which is extremely bulky and brings with it a lot of complexity and idiosyncrasies. I was not able to get anything as nice as a GitHub rendering out of that.

The only alternative I know is using web technologies (chrome headless) to produce a PDF, which is what the author does.

Glad the author shared his methods. If you're going to repeat this, I'd suggest you give it a try with the first 10 to 15 pages. Something like a small zine. Then use all the tools and make sure you get the output you're expecting.

I've written several booklets that I sell on Amazon and my website. If you're interested in writing but not ready to commit to a full novel, try publishing a small zine or two. It's a lot of fun.

My latest is a series that teaches JavaScript by creating computer art. Readers learn by copying.


Thanks for sharing this link. I've written a few simple art-making Javascripts as well (http://richardmavis.info/squares, http://richardmavis.info/circles, http://richardmavis.info/stars, http://richardmavis.info/snow, http://richardmavis.info/malevich, http://richardmavis.info/whale-shark-skin) but am definitely interested in other techniques.

I really like minimalist abstract art. These are great generative art pieces that fit the bill.

Very off topic, but I couldn't find another easy way to get in touch with you. If you're up for it, I'd be super keen to swap notes on self-publishing. (Or to be more precise, I'm super keen to learn about your experiences with subscription self-publishing, and I'm hoping I can return the value in some way. I've been doing skill-building biz books & a bit of textbook stuff.) I'm rob@robfitz.com if it's relevant.

PS. sorry to everyone else for the off-topic!

You are on linux. You want to typeset a book. Why not use (pdf)LaTeX?

Latex is really good if someone needs lots of math, references and so on. It's really par none when it comes to technical writing. But it's so clunky in some ways it's more of a precision tool for this particular task than a generic layout software. I'm saying this as person who's in his past lives written scientific material using LaTeX and done semi-professional graphic design and layout work as well.

Oh god, I in my time tried to tweak LaTeX so I could reach the same aesthetics as using InDesign. While doable (I'm sure) - it's not really worth an effort. LaTeX and it's ways of working are very specific to that one context. it's the Torx screwdriver for the torx screws of technical and scientific publishing. But lot of layout stuff needs a philips head, a nail and a hammer and so on. While a torx screwdriver surely can be used to pound nails, I would not suggest it as an efficient tool.

Don't feed the LaTeX fetish. Some people like to do everything with it, just like people like doing lots of odd things that bring particular aesthetic joy to them, that would be completely impractical or intolerable to others.

LaTeX in a non-technical or non-scientific context is an eccentric quirk. I love eccentric quirks and people who have them! I have many myself! But I would not push my quirks to other people in any setting.

There's a lot more to LaTeX than just equations though. There are packages for handling lots of things the humanities need too. Egyptian Hieroglyphs? Use HieroTeX. Music? Use Musixtex. Really, think of anything that would be hard to typeset manually and somebody's likely to have already created a LaTeX package for it.

> and somebody's likely to have already created a LaTeX package for it

... or a dozen, each with different features, like tables.

I did that and i didn’t like latex. Especially error messages in case of wrong latex code was extremely unhelpful, pointing to code lines far away from the actual erroneous line. I might be too stupid... but that’s how it felt, and i imagine being able to use simple markdown much more useable.

Everyone always recommends latex, but I always end up spending more time googling cryptic errors that actually writing when using latex. I would not recommend it to a friend.

Because there is more to a book than type-setting and anything like putting images where you want them or specific whitespace aesthetics is a major pain to do in LaTex.

What's so painful about whitespace in LaTeX? hspace is horizontal space. vspace is vertical space. I mean, you should use smallskip, bigskip, etc., but if you want specific length, hspace/vspace work.

Last time I checked (maybe 2 years ago?) there wasn't a good open source html to pdf workflow. Specifically page-numbers and anything else involved with paged media is a nightmare, the CSS standards in that regard are not implemented. There is "Prince", but it isn't OSS and rather expensive.

Phantomjs (and its ilk) are based on browser engines and just don't support this. Also I would love to be able to change layout or content based on where particular elements turn up.

Docraptor https://docraptor.com/ is a monthly subscription service that uses Prince under the hood. It can be a cheaper option (though I wish for a PAYG model).

WeasyPrint supports CSS Paged Media and is open source. Yes, anything based on browser engines do not support CSS Paged Media.

WeasyPrint looks to have progressed a lot since last I looked at it (when there was no way I could use it at all, though I can’t remember the reason), but looks to still be quite limited. A couple of things that spring to my attention immediately: no flexbox, and no CSSOM (so that you can’t adjust the document based on layout at all unless you can do it in straight CSS). Still, in practice probably usable for what I was doing several years ago, with similar limitations to Prince (which also has no CSSOM implementation, though at least it has a JavaScript engine). But anything more advanced in the way of fine page-dependent layout, I get the impression that WeasyPrint won’t be able to do, while Prince is superb at handling such things.

Yup, Prince is great. There are reasons they can get paid when WeasyPrint is free. But my parent post already knew about Prince and specifically wanted to know about an OSS one.

One of the problem with CSSOM and similar things is the problem of iterative layout: Let's say you have a list which should be split automatically on two pages. You also want to add a table header on the second page. But now the second part of the list doesn't fit on the page anymore. So you need to split the list further, and add another header.

Basically every change to the CSS or DOM requires a reflow (or clever optimizations to avoid that).

That’s no different from what browsers do at present.

Yes, modifications may cause a reflow. So? That just means that it’s slow. That’s not a problem.

That’s how you implement such things. The initial implementation throws away all layout information as soon as you modify any CSSOM property, and recalculates it. You release that to people saying “it can now do this, but it’ll be extremely slow; let us know what sorts of things you do with it and when you find particularly awful performance cases, and then we’ll look into speeding it up”. Then, as people try using it, you determine where it’s worth putting effort into speeding it up. This is exactly what Michael Day of Prince said they’d do if/when they implemented CSSOM, when I asked him about whether it might come, several years ago. This is an entirely reasonable approach.

I have a theory how this should be solved in web standard, which in turn has other uses besides printing, but well, I don't feel like I can affect web standard in any meaningful sense.

The basic idea is to add (placeholder name) stopUpdating/resumeUpdating to window, which can be polyfilled as no-op. The semantics is that CSSOM view methods are allowed to return the value when update was not stopped, or any later value. That is, current web standard forces you to do things "live". New methods give option to do things in batch.

So long as you don’t read layout information in between, you can already batch your modifications just fine. All it takes is care in how you structure and implement things, and you’re fine. And before you object, your proposed solution would require almost as much care, have more hazards to trip over, and require opting in in a way that few would—and those that would, already know how to be careful. I suspect it would also increase complexity and possibly memory usage in the browser.

If you were designing something from scratch, such approaches would be worthwhile considering, but I think that boat has sailed, and the architecture would fight against you.

Then again, I believe it was generally accepted that web browsers were stuck with UTF-16, until Simon introduced WTF-8 for Servo.

I think the approach of WeasyPrint and Prince, implementing a dedicated layout engine for paged media, is better than making these things work in Browser engines.

In any case, html/css for paged media should be mostly separate from website code. "Printing out" web pages works in many cases, but it's crappy.

The only two reasons why printing web pages out is lousy are because web developers put little to no effort into it, and because the browser manufacturers put little to no effort into it. I would love the likes of WeasyPrint and Prince to be rendered obsolete by one or more mainstream web browsers. If any of them decided that it was a strategic priority, they’d get a lot done very quickly. It’s just that there’s no compelling reason for them to, while there is for the people behind engines with a specific purpose—and so Prince is pretty safe in its position.

Looks interesting.

In general I think Tex/LaTex is the way to go in terms of reporting and generation of pdf. The biggest problem with Tex is that it is so different from HTML, and it gets progressively more different and difficult if you have specific layout or style requirements.

What I wish for is a replacement for LaTeX, based mostly on web standards, extensible in javascript... Unfortunately I don't have the resources to do that.

On the topic of web design rather than bookprinting:

I would suggest not to use element-heavy background designs for as simple a case as this website's. I have a low-grade, student-level laptop, and it makes scrolling noticably lag, by a few FPS.

A random though, but you could use a horizontal rule (`<hr/>`) to split sections instead of the making a paragraph and filtering. They're part of the markdown syntax (using 3 or more hyphens, asterisks, or underscores) and you probably wouldn't even have to filter them out. Just use CSS. I'm actually pretty sure it's the semantic use case for them anyways. That's assuming, of course, that you don't use them elsewhere for other reasons.

Other than the unsolicited advise, I really like the workflow. I did something similar for almost all of my papers in college.

Great idea! Thanks.

Last time I seriously tried wkhtmltopdf (three or four years ago, at a guess), it produced results entirely unsuitable for printing in quite a few ways:

• It completely mangled the kerning, like it was ignoring the font’s kerning and then making it even worse by only placing characters to 1pt precision (at 600dpi, one dot is 0.12pt). (To clarify: I never actually measured it; this is just my rough guess as to what may have caused it.)

• It was somewhere between agonisingly difficult and impossible to actually get precise sizes; print A4, for example, with your body carefully set up so the widths add to the right amount, and the appropriate “don’t zoom” command line argument, and it’d still mess it up (and subtle content changes could make it better or worse). A container of `width: 15cm` could end up 15cm wide if you were extremely fortunate, but was more likely to be 14cm, or 17cm, or something like that. And it might vary from page to page.

• Pages didn’t really exist, in layout terms, so that any sort of finesse of where things should appear was just impossible.

• Probably worse, you could end up with the descenders of the bottom line of text on a page at the top of the next page. I have a vague feeling I hit a situation where a line could even be split in half, rather than just the descenders, but that may have been printing from Chrome or Firefox at a similar time.

• Its header/footer stuff was mildly limiting and fairly annoying to get working properly (and made document sizing even more troublesome, too).

It was also very crummy for producing a PDF for screen use, as regards things like links and tables of contents and other annotations.

Had it been just one or two of these things, I would probably have filed bugs; but it didn’t look as though there was much interest in actually fixing things, and it was so very broken for any sort of precise, serious work, that I just gave up.

Have things improved for wkhtmltopdf since then? I’d be interested to hear.

I found the state of the art for web-to-PDF conversion to be Prince (https://www.princexml.com/) by an enormous margin, with it producing absolutely superb results. Nonetheless, it does have some limitations; most notably, in my opinion, CSSOM, so that the JavaScript doesn’t interact with the layout at all. There were various other CSS and JavaScript niggles that I hit too, but they’ve steadily been fixed over time too. Bear in mind that Prince is made by a small team and is the entire web engine.

I would really like a vector graphics pipeline for Servo: https://github.com/servo/servo/issues/3788

Nope, that uses Chromium, which does not support any CSS Paged Media.

FYI: As a (free, open source) alternative there's also octobook [1]. See the Yuki & Moto Press Bookshelf [2], for "real world" live examples using the octobook themes. [1]: https://github.com/octobook [2]: http://yukimotopress.github.io

I'm going to use prince xml for this and DocRaptor (not free) to finalize without Watermark. Also wrote a small Go tool for creating counters and TOC.

Well, HTML comes from SGML, so I guess the loop is closed.

Yeah, kind of funny now that you mention it. Some professional page layout software uses XML and XSLT, but the web branch of those technologies developed more rapidly and also seems to be more concise.

I thought Markdown supports line breaks with two trailing spaces at the end of the line. This gist[0] also mentions <br/> or \ at end of line also working.

[0] https://gist.github.com/shaunlebron/746476e6e7a4d698b373

A good question from that gist thread:

>Why does markdown do this? If I want two lines to run consecutively I won’t introduce a newline.

A bit of an educated guess but I believe the answer stems from markdown originally targeting terminals, which would line wrap but not word wrap. So you could get ugly output like this:

    hello, wor
(albeit this is assuming the terminal is only 10 characters wide but I hope you get the point)

So it used to be common for developers and sysadmins to manually wrap to 80 characters (a habit I still regularly catch myself doing even now). Obviously with GUI readers and variable-width characters, you wouldn't want that 80 character manual wrap honoured.

Wrapping sentences into short functional parts (motivated by more or less arbitrary limits like 80 characters) helps having easily editable and readable text.

Apologies but I'm not sure I understand your point. Are you able to elaborate?

As others, I've grown accustomed to wrapping lines to fit in small terminals, and I started avoiding breaking the line randomly. Instead I try to break after self-sufficient parts of sentences so that lines flow naturally after one another. That usually means after a comma, before a connector word, or after an enumeration. It's not always possible, but I find that when I need to rewrite a sentence to be naturally breakable it usually reads better. Furthermore, each line often ends up expressing a whole idea, which means working on that idea usually translates to simple line operations, which fits nicely in an expressive text editor like Vim.

Hitting enter at the end of a line lets you manually wrap paragraphs in a way that will fit into an 80 or 120-char wide terminal.

Ignoring those line breaks is desirable when reflowing the text into e.g. HTML output

We use write everything in Markdown and then convert to a book with Leanpub (http://leanpub.com) Fantastic tool!

Applications are open for YC Summer 2019

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact