
Gov.uk content should be published in HTML and not PDF (2018) - rahuldottech
https://gds.blog.gov.uk/2018/07/16/why-gov-uk-content-should-be-published-in-html-and-not-pdf/
======
NohatCoder
PDF seems to fit a world-view where the highest objective of a document is to
be printed on paper.

From a machine-parsing perspective PDF files are a nightmare. Chunks of text
may be broken anywhere, mid-sentence, mid-word. These chunks may appear in the
document in any order.

Spaces may be encoded as spaces, or they may be created a number of other
ways, like by positioning chunks, or setting character spacing per character.

The mapping from code point to glyph does not need to be pure Unicode, a PDF
document may contain a custom font with additional glyphs.

This is all stuff I learned by trying to parse a limited set of PDFs found in
the wild.

All of these gotchas are by the way completely PDF/A compliant.

~~~
moralestapia
>From a machine-parsing perspective PDF files are a nightmare.

Can someone with experience care to explain why? Does it have to do with each
letter having an absolute position on the document? I have no clue, to be
honest.

~~~
Kalium
You have to essentially render the document yourself in order to figure out
what the order of chunks is. Then you _might_ be able to extract content from
the chunks you're interested in - or not. A given zipped chunk might be
literally anything.

Didier Stevens is the best expert of PDFs that I know of:
[https://blog.didierstevens.com/programs/pdf-
tools/](https://blog.didierstevens.com/programs/pdf-tools/)

------
haunter
Why not both? Wasn't the point of the PDF/A ISO standard to use it for
archiving? I always felt PDF is better for content like this than HTML which
can change dynamically

[https://en.wikipedia.org/wiki/PDF/A](https://en.wikipedia.org/wiki/PDF/A)

VeraPDF also exist for PDF/A validation
[https://github.com/verapdf](https://github.com/verapdf)

~~~
crooked-v
> I always felt PDF is better for content like this than HTML which can change
> dynamically

It's not that hard to edit a PDF file either, though.

~~~
tonyedgecombe
That's not the issue, the problem is we don't know what a document wood look
like with next years browser or if it will render at all. That isn't an issue
with PDF.

~~~
wopian
The same applies to PDFs too. There are hundreds of PDF readers and not all of
them support all the features besides feature deprecation. You could run flash
in PDFs, so archived PDFs that utilized that will be broken.

~~~
icebraining
Hence the aforementioned PDF/A, which is a version with a fixed set of
features:
[https://en.wikipedia.org/wiki/PDF/A](https://en.wikipedia.org/wiki/PDF/A)

------
nattaylor
I painstakingly convert some PDFs from Boston and Massachusetts government
into HTML and people prefer them for the flexibility and accessibility.

My workflow is to use pdftohtml then edit the result as Markdown and
screenshot the figures before converting to HTML with pandoc.

I have been pushing to publish in HTML from the get-go, in part by citing this
blog, with some success.

For example, the Massachusetts Secretary of the Office of Environmental
Affairs recently published a typed letter as a raster PDF, which I converted
into
[https://nattaylor.com/eastboston/blog/2019/3247-2017-logan-a...](https://nattaylor.com/eastboston/blog/2019/3247-2017-logan-
airport-espr/) from
[https://eeaonline.eea.state.ma.us/EEA/emepa/mepacerts/2019/s...](https://eeaonline.eea.state.ma.us/EEA/emepa/mepacerts/2019/sc/eir/3247%20-%202017%20Logan%20Airport%20ESPR.pdf)

------
fock
Best idea would probably be a download option for good, oldfashioned XML and
providing a nice XLST-stylesheet for transformation into HTML/processing to
pdf. This approach would ensure that you still can reliably save/read the
content (compared to responsive JS-nightmares) in the future...

~~~
Tijdreiziger
I think it would be best to have plain HTML available for download (without
any JS or any other cruft). Even .epub (e-books) is just zipped HTML with some
extra parts.

------
rahuldottech
Previous discussion:
[https://news.ycombinator.com/item?id=17541045](https://news.ycombinator.com/item?id=17541045)

------
HuangYuSan
I really like the work of the UK's GDS (government digital service)

~~~
toby-
Me too. Focusing on digital services was, and remains, one of the few things
the UK government has got right in recent years, IMHO - and excelled at.

~~~
pbhjpbhj
Is it anything to do with the government _per se_ , or is it driven entirely
by civil servants?

~~~
tonyedgecombe
Francis Maude was the MP responsible for a lot of that. He was the only
politician I ever heard talk about lean practices in software.

[https://en.wikipedia.org/wiki/Francis_Maude#Return_to_Govern...](https://en.wikipedia.org/wiki/Francis_Maude#Return_to_Government,_2010–2016)

------
Keverw
Interesting about PDFs. I think they still have some uses though, like scanned
documents (like court records), downloadable books... I know some software
with invoicing for say hosting might offer a downloadable version created with
some script, but just spiting out a web page is probably easier in that case,
and still printable.

However I know some PDFs don't even let you select text, they just scan in
something it seems. I noticed that even for some government sites, like some
city had their ordinances in PDFs that looked like they were typed up on a
computer or type writer, signed by someone, scanned in and re-uploaded.
Selectable text not only useful for blind people, but searching the page or
copying/pasting parts of it for research, a bit of a SEO boost too probably as
not sure if robots try to parse text in images even though it's probably very
possible with machine learning. I know though there's been lawsuits that
government websites have to be accessible but it seems like many cities and
even some higher levels sites aren't taking this seriously. Not sure if they
are just unaware, under budget or maybe just older sites but if they rebuilt a
more modern one it'd be better. I remember reading somewhere on HN once that
some city but I forget which one but they just deleted their website and
replaced it with a plain text web page since people complained, so now if you
need anything you got to go city hall. So basically negates having a website
in the first place.

------
Phylter
While PDF is supposed to be an open standard I'm noticing more and more
functionality in PDF files that isn't supported by readers who implement the
open standard. I have project right now to find out why Adobe and Chrome are
so different and, though it's obvious to me, the loss of functionality because
Chrome doesn't support certain things must be explained to those to whom it is
not so obvious.

~~~
Tijdreiziger
IIRC, implementing the PDF standard requires a JavaScript implementation. To
many readers, that's not worth it.

------
ilamont
My municipal government insists on sending out via email attached PDFs or
links to download PDFs.

Some of the PDFs are prepared by the government, others by vendors working for
the government (probably based on specs that called for this experience). It
ignores the facts that many (if not most) people read on small screen devices,
don't have logon credentials handy, or even know where to find downloaded
documents. It erodes civic engagement and leads to real problems when policies
aren't followed or generate backlash because people didn't know about them
prior to changes.

The simplest thing to do would be just to send the damn data as plaintext
email. The mayor actually gets this -- her newsletters are always in the body
of the email itself, never as a PDF attachment or link to download a PDF. Yet
her administration is still stuck on PDFs for everything.

------
onei
It was refreshing to get a notice about cookie settings which when ignored,
accepted or rejected remained readily accessible (on mobile at least).
Normally, you get to reject setting once and then it vanishes meaning you
can't update your choice either way. It was a pleasant surprise to be able to
change my mind.

~~~
NullPrefix
>Normally, you get to reject setting once and then it vanishes meaning you
can't update your choice either way.

Funny how you phrase it. Normally dark websites lets you accept the setting
and never show it to you again, so you wouldn't change your mind. If you
reject it - you get a redirect or a nagging screen which lets you change your
mind and accept.

~~~
tempestn
Having to have a separately implemented version of this on every website in
the world is the dark pattern. Browsers already let you white- or black-list
individual domains for cookies; it's crazy that every website design has to
accommodate this redundant feature that is never going to be possible to
implement perfectly for all use cases.

------
calewis
When I worked at GDS the aim was to use PDF’s they were considered to have a
long shelf life, where as HTML and the web in general can evolve and leave
some people without a way to read it. I wonder whats changed?

~~~
kick
PDFs aren't and have never been accessible, and PDFs aren't and have never
been reactive. Administrations are just finding out how frustrating the format
is.

------
macspoofing
No. Terrible idea. The authors take the perspective that content should be
easy for browsers to render. That's important but shouldn't be the only
consideration. Browsers can render HTML well, but browsers have incredibly
complicated rendering engines. This makes HTML non-portable and non-
exportable, not printable, not sharable. I'm sure for each one of those you
can create a workaround to bridge the gap but the big picture is that HTML
content tends to be dynamic and rendered by a heavy server component from
database which makes creating a true standard actually quite hard. HTML
actually really sucks as a portable document standard without some extra
semantic definition and a container format. So to make HTML workable you have
to do something like what docx and odf do: XML definition+resources in zip
container ... which now look a lot like PDF.

>We also intend to build functionality for users to automatically generate
accessible PDFs from HTML documents.

How about the other way.. PDFs -> HTML documents.

~~~
qqii
> We also intend to build functionality for users to automatically generate
> accessible PDFs from HTML documents. This would mean that publishers will
> only need to create and maintain one document, but users will still be able
> to download a PDF if they need to.

Oh I see you've added that. The problem with PDF to HTML is thst your trying
to fit something static back into something dynamic. I also don't see how pdf
rendering is any less complicated. There are many browser implementations that
can easily render gov.uk without issues.

~~~
macspoofing
This raises the question of how they create these 'HTML documents'. As far as
I'm aware, there really isn't an HTML document format. More than likely, these
documents will be created by some web-based system (maybe even proprietary and
cloud-based, making portability extremely hard) and simply exported as PDF
later. That's how every web-based content management system does it today.
It's very convenient _right now_ , but when it comes to document maintenance
for the next few decades, I'm not sure that's better than something like PDF
or ODF.

~~~
petepete
The code is available on GitHub, you can see for yourself. It's mostly written
in a customised markdown variant called Govspeak[0].

You appear to be jumping to ridiculous conclusions. Decisions on how we
deliver content are made on the back of a _huge amount_ of research. It's more
accessible than it's ever been, other governments and organisations are
following our lead.

To suggest PDF, ODF or other crippled, inaccessible, OS-constrained formats
over HTML - something everyone can read with any of their devices - shows you
need to do more research before posting your ill-informed views.

[0]
[https://github.com/alphagov/govspeak](https://github.com/alphagov/govspeak)

Edit: Also, web page have plenty of other advantages. Indexed by every search
engine, bookmarkable/shareable headings. What other medium offers that?

