
Using Web Technologies to Print a Book - rmavis
http://richardmavis.info/using-web-technologies-to-print-a-book
======
mch82
GitBook PressBooks are also very affordable and great options that are built
with web technologies and are slightly less manual.

[https://www.gitbook.com/](https://www.gitbook.com/)

[https://pressbooks.com/](https://pressbooks.com/)

GitBook is targeted at people creating technical documentation in Markdown. It
has the advantage of git integration so it’s possible to branch & merge your
way through a book project. Their GitHub account still hosts the legacy open
source editor and cli tools, although the project has moved on to focus on the
commercial web-based editor.
[https://github.com/GitbookIO?tab=repositories](https://github.com/GitbookIO?tab=repositories)

PressBooks is targeted at authors who want to self-publish and has a nice set
of book templates (styled by book genre) and a workflow for selling print
books on Amazon.

Scrivener (mentioned by dangoor in another comment) is a writing tool with
advanced features for different manuscript formats (screen plays, novels, etc)
and has cool features for organizing notes, ideas, and references.

~~~
jedimastert
Pressbooks looks like the CDBaby or DistroKid of book publishing. Neat!

------
sebcat
Unsolicited code review:

    
    
        system("multimarkdown -s ../#{part}/story.md | sed -E 's/_([^_]+)_/<em>\\1<\\/em>/g' | sed -E 's/<h1 .+<\\/h1>//g' | sed 's/<p>%<\\/p>/<p class=\"section-break\">%<\\/p>/g' > output-#{part}.html")
    

IMO:

    
    
        s_<p>%</p>_<p class="section-break">%</p>_g
    

is easier to read than:

    
    
        s/<p>%<\\/p>/<p class=\"section-break\">%<\\/p>/g
    

Any reason to keep the %-sign?

Those sed commands could probably be reduced to one invocation of sed(1)
instead of three.

    
    
        system("cat #{htmls} | #{wkhtmltopdf_cmd} --footer-html footer.html - body.pdf")
    

No need to use cat.

~~~
rmavis
You're right -- good suggestions. Thanks.

------
dangoor
This doesn't work for Linux, but for folks writing a novel, I can definitely
recommend Scrivener (Mac and Windows). In addition to a powerful writing
environment, it's "Compile" feature can output flexibly for print and ebooks.
I generated hardcover, paperback, mobi, and epub, with enough customizability
for me and no scripting required.

Granted this is Hacker News and people here _like_ writing code (myself
included), but sometimes you just want to use a tool to make a thing :)

~~~
stan_rogers
Well, Mac and _almost_ Windows. Windows is still stuck in version 1.x of
Scrivener, which is a decent-enough writing environment (bar the now-quite-
quaint UI), but it's nowhere near the version 3 that Mac users have access to,
especially as regards compilation and output. (I've been waiting for a Windows
version update for years and years.)

~~~
dangoor
That's too bad. I did remember that Scrivener for Windows was made by a
different developer. Scrivener 3's compilation was a big upgrade.

------
quotemstr
Do HTML and CSS rendering engines implement Knuth paragraph layout yet? AIUI,
they still use greedy line layout and justification, which can produce various
artifacts (like whitespace rivers and orphan lines) that the Knuth-style
linebreaking in TeX avoids. Right now, I still don't think I'd consider using
anything but TeX or some heavyweight DTP package for printed material.

~~~
trynewideas
Apache FOP[1] does[2] if you're willing to jump back in time to XHTML or use
an HTML2FO converter.

As always, Unicode is a problem.

1: [https://xmlgraphics.apache.org/fop/](https://xmlgraphics.apache.org/fop/)

2:
[https://xmlgraphics.apache.org/fop/0.95/hyphenation.html](https://xmlgraphics.apache.org/fop/0.95/hyphenation.html)

cf. [https://wiki.apache.org/xmlgraphics-
fop/HowTo/HtmlToPdf](https://wiki.apache.org/xmlgraphics-fop/HowTo/HtmlToPdf)

~~~
sanxiyn
That's Knuth-Liang hyphenation, which is something different from Knuth-Plass
line breaking.

~~~
trynewideas
Oh, good catch, sorry about that.

Simon Pepping did some work done on Knuth-Plass in FO, but it's old and
probably not quite as relevant now:

\-
[https://web.archive.org/web/20070114211331/http://www.leverk...](https://web.archive.org/web/20070114211331/http://www.leverkruid.eu:80/GKPLinebreaking/elements-
xhtml.xml)

\-
[https://web.archive.org/web/20070128145517/http://www.leverk...](https://web.archive.org/web/20070128145517/http://www.leverkruid.eu/GKPLinebreaking/index.html)

------
peclink
Do not use wkhtmltopdf for this, use weasyprint instead.

Markdown → Pandoc → HTML → weasyprint → PDF works great, paged media support
in weasyprint is good enough and much, much better than in wkhhtmltopdf.

------
yaleman
I clearly missed something, but why are you not going markdown -> PDF
directly? That's what we use extensively in our operations playbooks at work
and it gives something that looks like I'd get in a book - even more so after
I tweaked the templates a little.

~~~
sanxiyn
Because the user in question does not know and is unwilling to learn how to
design anything if it is not web technology.

That is, you probably want to design PDF output. Some users prefer to do any
design work whatsoever using web technology. Therefore, PDF needs to be
generated from HTML.

~~~
kijin
If OP is used to working with HTML/CSS, it will probably save him a lot of
time and energy to do basic formatting in HTML/CSS than to learn how to design
a professional-looking PDF document. He's a programmer and novelist, not a
designer.

In fact, he's not even aiming to design a professional-looking PDF document,
just a throwaway printout for proofreaders who don't know how to parse
Markdown.

~~~
mcless
Why not use something like
[http://github.com/susam/texme](http://github.com/susam/texme) as a starting
point? Really easy to turn any Markdown document into a rendered HTML with a
single line of code in the header.

This rendered HTML could be converted to PDF and printed or the self-rendered
HTML itself could be printed directly.

~~~
kijin
There are a whole bunch of similar tools written in all sorts of languages,
and OP just chose what he's most familiar with. It just happened to be
different from your favorite toolchain.

It also looks like OP's book was split into several Markdown files, one for
each chapter. So he would have needed some sort of build script anyway if he
wanted to use texme on the combined document. He would also have needed more
than a single line of code in the header, since he wanted some custom styling
for blockquotes and code snippets.

------
codazoda
Glad the author shared his methods. If you're going to repeat this, I'd
suggest you give it a try with the first 10 to 15 pages. Something like a
small zine. Then use all the tools and make sure you get the output you're
expecting.

I've written several booklets that I sell on Amazon and my website. If you're
interested in writing but not ready to commit to a full novel, try publishing
a small zine or two. It's a lot of fun.

My latest is a series that teaches JavaScript by creating computer art.
Readers learn by copying.

[http://splashofcode.com](http://splashofcode.com)

~~~
rmavis
Thanks for sharing this link. I've written a few simple art-making Javascripts
as well ([http://richardmavis.info/squares](http://richardmavis.info/squares),
[http://richardmavis.info/circles](http://richardmavis.info/circles),
[http://richardmavis.info/stars](http://richardmavis.info/stars),
[http://richardmavis.info/snow](http://richardmavis.info/snow),
[http://richardmavis.info/malevich](http://richardmavis.info/malevich),
[http://richardmavis.info/whale-shark-skin](http://richardmavis.info/whale-
shark-skin)) but am definitely interested in other techniques.

~~~
codazoda
I really like minimalist abstract art. These are great generative art pieces
that fit the bill.

------
splittingTimes
You are on linux. You want to typeset a book. Why not use (pdf)LaTeX?

~~~
fsloth
Latex is really good if someone needs lots of math, references and so on. It's
really par none when it comes to technical writing. But it's so clunky in some
ways it's more of a precision tool for this particular task than a generic
layout software. I'm saying this as person who's in his past lives written
scientific material using LaTeX and done semi-professional graphic design and
layout work as well.

Oh god, I in my time tried to tweak LaTeX so I could reach the same aesthetics
as using InDesign. While doable (I'm sure) - it's not really worth an effort.
LaTeX and it's ways of working are very specific to that one context. it's the
Torx screwdriver for the torx screws of technical and scientific publishing.
But lot of layout stuff needs a philips head, a nail and a hammer and so on.
While a torx screwdriver surely can be used to pound nails, I would not
suggest it as an efficient tool.

Don't feed the LaTeX fetish. Some people like to do everything with it, just
like people like doing lots of odd things that bring particular aesthetic joy
to them, that would be completely impractical or intolerable to others.

LaTeX in a non-technical or non-scientific context is an eccentric quirk. I
love eccentric quirks and people who have them! I have many myself! But I
would not push my quirks to other people in any setting.

~~~
jhbadger
There's a lot more to LaTeX than just equations though. There are packages for
handling lots of things the humanities need too. Egyptian Hieroglyphs? Use
HieroTeX. Music? Use Musixtex. Really, think of anything that would be hard to
typeset manually and somebody's likely to have already created a LaTeX package
for it.

~~~
ygra
> and somebody's likely to have already created a LaTeX package for it

... or a dozen, each with different features, like tables.

------
bayesian_horse
Last time I checked (maybe 2 years ago?) there wasn't a good open source html
to pdf workflow. Specifically page-numbers and anything else involved with
paged media is a nightmare, the CSS standards in that regard are not
implemented. There is "Prince", but it isn't OSS and rather expensive.

Phantomjs (and its ilk) are based on browser engines and just don't support
this. Also I would love to be able to change layout or content based on where
particular elements turn up.

~~~
sanxiyn
WeasyPrint supports CSS Paged Media and is open source. Yes, anything based on
browser engines do not support CSS Paged Media.

~~~
chrismorgan
WeasyPrint looks to have progressed a lot since last I looked at it (when
there was no way I could use it at all, though I can’t remember the reason),
but looks to still be quite limited. A couple of things that spring to my
attention immediately: no flexbox, and no CSSOM (so that you can’t adjust the
document based on layout at all unless you can do it in straight CSS). Still,
in practice probably usable for what I was doing several years ago, with
similar limitations to Prince (which also has no CSSOM implementation, though
at least it has a JavaScript engine). But anything more advanced in the way of
fine page-dependent layout, I get the impression that WeasyPrint won’t be able
to do, while Prince is superb at handling such things.

~~~
bayesian_horse
One of the problem with CSSOM and similar things is the problem of iterative
layout: Let's say you have a list which should be split automatically on two
pages. You also want to add a table header on the second page. But now the
second part of the list doesn't fit on the page anymore. So you need to split
the list further, and add another header.

Basically every change to the CSS or DOM requires a reflow (or clever
optimizations to avoid that).

~~~
sanxiyn
I have a theory how this should be solved in web standard, which in turn has
other uses besides printing, but well, I don't feel like I can affect web
standard in any meaningful sense.

The basic idea is to add (placeholder name) stopUpdating/resumeUpdating to
window, which can be polyfilled as no-op. The semantics is that CSSOM view
methods are allowed to return the value when update was not stopped, or any
later value. That is, current web standard forces you to do things "live". New
methods give option to do things in batch.

~~~
bayesian_horse
I think the approach of WeasyPrint and Prince, implementing a dedicated layout
engine for paged media, is better than making these things work in Browser
engines.

In any case, html/css for paged media should be mostly separate from website
code. "Printing out" web pages works in many cases, but it's crappy.

~~~
chrismorgan
The only two reasons why printing web pages out is lousy are because web
developers put little to no effort into it, and because the browser
manufacturers put little to no effort into it. I would _love_ the likes of
WeasyPrint and Prince to be rendered obsolete by one or more mainstream web
browsers. If any of them decided that it was a strategic priority, they’d get
a lot done very quickly. It’s just that there’s no compelling reason for them
to, while there is for the people behind engines with a specific purpose—and
so Prince is pretty safe in its position.

------
MarsAscendant
On the topic of web design rather than bookprinting:

I would suggest not to use element-heavy background designs for as simple a
case as this website's. I have a low-grade, student-level laptop, and it makes
scrolling noticably lag, by a few FPS.

------
jedimastert
A random though, but you could use a horizontal rule (`<hr/>`) to split
sections instead of the making a paragraph and filtering. They're part of the
markdown syntax (using 3 or more hyphens, asterisks, or underscores) and you
probably wouldn't even have to filter them out. Just use CSS. I'm actually
pretty sure it's the semantic use case for them anyways. That's assuming, of
course, that you don't use them elsewhere for other reasons.

Other than the unsolicited advise, I really like the workflow. I did something
similar for almost all of my papers in college.

~~~
rmavis
Great idea! Thanks.

------
chrismorgan
Last time I seriously tried wkhtmltopdf (three or four years ago, at a guess),
it produced results entirely unsuitable for printing in quite a few ways:

• It completely mangled the kerning, like it was ignoring the font’s kerning
and then making it even worse by only placing characters to 1pt precision (at
600dpi, one dot is 0.12pt). (To clarify: I never actually measured it; this is
just my rough guess as to what may have caused it.)

• It was somewhere between agonisingly difficult and impossible to actually
get _precise_ sizes; print A4, for example, with your body carefully set up so
the widths add to the right amount, and the appropriate “don’t zoom” command
line argument, and it’d still mess it up (and subtle content changes could
make it better or worse). A container of `width: 15cm` could end up 15cm wide
if you were extremely fortunate, but was more likely to be 14cm, or 17cm, or
something like that. And it might vary from page to page.

• Pages didn’t really exist, in _layout_ terms, so that any sort of finesse of
where things should appear was just impossible.

• Probably worse, you could end up with the descenders of the bottom line of
text on a page at the top of the next page. I have a vague feeling I hit a
situation where a line could even be split in half, rather than just the
descenders, but that may have been printing from Chrome or Firefox at a
similar time.

• Its header/footer stuff was mildly limiting and fairly annoying to get
working properly (and made document sizing even more troublesome, too).

It was also very crummy for producing a PDF for screen use, as regards things
like links and tables of contents and other annotations.

Had it been just one or two of these things, I would probably have filed bugs;
but it didn’t look as though there was much interest in actually fixing
things, and it was so _very_ broken for any sort of precise, serious work,
that I just gave up.

Have things improved for wkhtmltopdf since then? I’d be interested to hear.

I found the state of the art for web-to-PDF conversion to be Prince
([https://www.princexml.com/](https://www.princexml.com/)) by an _enormous_
margin, with it producing absolutely superb results. Nonetheless, it does have
some limitations; most notably, in my opinion, CSSOM, so that the JavaScript
doesn’t interact with the layout at all. There were various other CSS and
JavaScript niggles that I hit too, but they’ve steadily been fixed over time
too. Bear in mind that Prince is made by a small team and is the _entire_ web
engine.

I would really like a vector graphics pipeline for Servo:
[https://github.com/servo/servo/issues/3788](https://github.com/servo/servo/issues/3788)

~~~
jahewson
Puppeteer works perfectly [https://medium.com/@raphaelstaebler/advanced-pdf-
generation-...](https://medium.com/@raphaelstaebler/advanced-pdf-generation-
for-node-js-using-puppeteer-e168253e159c)

~~~
sanxiyn
Nope, that uses Chromium, which does not support any CSS Paged Media.

------
geraldbauer
FYI: As a (free, open source) alternative there's also octobook [1]. See the
Yuki & Moto Press Bookshelf [2], for "real world" live examples using the
octobook themes. [1]:
[https://github.com/octobook](https://github.com/octobook) [2]:
[http://yukimotopress.github.io](http://yukimotopress.github.io)

------
_Codemonkeyism
I'm going to use prince xml for this and DocRaptor (not free) to finalize
without Watermark. Also wrote a small Go tool for creating counters and TOC.

------
JeanMarcS
Well, HTML comes from SGML, so I guess the loop is closed.

~~~
mch82
Yeah, kind of funny now that you mention it. Some professional page layout
software uses XML and XSLT, but the web branch of those technologies developed
more rapidly and also seems to be more concise.

------
karmakaze
I thought Markdown supports line breaks with two trailing spaces at the end of
the line. This gist[0] also mentions <br/> or \ at end of line also working.

[0]
[https://gist.github.com/shaunlebron/746476e6e7a4d698b373](https://gist.github.com/shaunlebron/746476e6e7a4d698b373)

~~~
banku_brougham
A good question from that gist thread:

>Why does markdown do this? If I want two lines to run consecutively I won’t
introduce a newline.

~~~
laumars
A bit of an educated guess but I believe the answer stems from markdown
originally targeting terminals, which would line wrap but not word wrap. So
you could get ugly output like this:

    
    
        hello, wor
        ld!
    

(albeit this is assuming the terminal is only 10 characters wide but I hope
you get the point)

So it used to be common for developers and sysadmins to manually wrap to 80
characters (a habit I still regularly catch myself doing even now). Obviously
with GUI readers and variable-width characters, you wouldn't want that 80
character manual wrap honoured.

~~~
jgtrosh
Wrapping sentences into short functional parts (motivated by more or less
arbitrary limits like 80 characters) helps having easily editable and readable
text.

~~~
laumars
Apologies but I'm not sure I understand your point. Are you able to elaborate?

~~~
jgtrosh
As others, I've grown accustomed to wrapping lines to fit in small terminals,
and I started avoiding breaking the line randomly. Instead I try to break
after self-sufficient parts of sentences so that lines flow naturally after
one another. That usually means after a comma, before a connector word, or
after an enumeration. It's not always possible, but I find that when I need to
rewrite a sentence to be naturally breakable it usually reads better.
Furthermore, each line often ends up expressing a whole idea, which means
working on that idea usually translates to simple line operations, which fits
nicely in an expressive text editor like Vim.

------
adamcccc
We use write everything in Markdown and then convert to a book with Leanpub
([http://leanpub.com](http://leanpub.com)) Fantastic tool!

