Hacker News new | past | comments | ask | show | jobs | submit login
Deurbanising the Web [pdf] (lab6.com)
496 points by ColinWright 16 days ago | hide | past | favorite | 413 comments



* PDFs are self-contained and offlineable

HTML can easily be offline-able. Base64 your images or use SVG, put your CSS in the HTML page, remove all 2-way data interaction, basically reduce HTML to the same performance as PDF and allow it to be downloaded.

* PDFs are files

HTML is files

* PDFs are decentralised

This should be "PDFs can be decentralised". PDFs aren't inherently any more decentralised than any other kind of file, including HTML.

The store is the thing that becomes decentralised, not the content.

* PDFs are page-oriented

HTML can be page-oriented. Simply build your website with pagination. PDFs can also be abused to have hugely long pages. Bad UX can be encapsulated in any medium.

* PDFs used to be large (bla bla bla Javascript weighs a lot)

Nope, PDFs are still objectively larger than the equivalent HTML. PDFs don't have any dynamic interaction, rip all that out and produce the HTML of yesteryear and your HTML will be tiny in comparison to the PDF.

Edit: I'm sorry, the more I think about this the dumber I feel. The web is useful because it's 2-way. I am excited by the web because I can interact with other people. I come to hacker news to engage with thinkers, not to just read a published article from one single author. I want to read ad-hoc opinions and user submitted content. PDF web, really?


When you find a page - inherently a document-oriented term - like an article, blog post, how-to, or project writeup that's interesting or useful, and you want to make sure it's available to you later, what do you do?

Do you save the HTML, CSS, and Javascript, and hope that it works offline? I used to use the "Save page as..." tool back in the early 2000s, but it's become less and less useful, with too many dysfunctional disappointments.

No, I cut out some junk I don't need with the Printliminator [1] bookmarklet, then I do a *print-to-PDF.* This gives me a file. I can save the file, back it up to my NAS, search for it later, keep it with other files from a project where it was useful, and otherwise hang onto it. This is so common, in fact, that it's gone from being an obscure thing you could do with a Postscript-to-PDF converter or (before the adware/Ask toolbar scandal) the installing the CutePDF virtual printer. Modern OSes bundle a PDF printer, and print dialogs understand that you want to "Save as PDF". Google Docs and Office 365 editors allow downloading a document as a PDF.

I totally agree that a dynamic, interactive page or a comment section is not compatible with this model of usage. There's a lot of consumption of endless feeds, and a lot of one-time video views that also don't make sense to save as offline files. However, the web for creators, where people write articles that are worth hanging onto, has a definite place for PDFs.

[1]: http://css-tricks.github.io/The-Printliminator/


> When you find a page [...] and you want to make sure it's available to you later, what do you do?

Instead of doing a bad and lossy job of archiving the page myself, I notify† our friendly neighbourhood archivists at the Internet Archive of the page; and they then do the best, most lossless job of preserving the page that they're able, given their cumulative experience.

http://blog.archive.org/2017/01/25/see-something-save-someth...

As a side-benefit, they also then take care of keeping the archive they've made around and available online in perpetuity, with no additional marginal effort on my part. The same can't be said for something in my own "private collection."


This may not be well-known, but archive.org can and does remove pages / sites from the archive. Authors can request this, site owners (separate from the authors) can request this. There may be others who can request this.

Just an FYI. If there are critical sites you want copies of, I'd recommend making your own copy. I've lost access to important pages / sites twice before taking this to heart.

Edited for clarity


There is value in having a personally curated, offline collection of documents. You can search, annotate or otherwise manipulate it to your heart's content, all without having to be connected.

Of course the Internet Archive serves other purposes for which it is (currently) irreplaceable.


Zotero is much better for this than the too-fiddly print-to-PDF workflow described in the earlier comment.


There's also opportunity cost in spending time maintaining, indexing, annotating your own archive of documents.


> in perpetuity

Hopefully it really is around a very long time, but the world is unpredictable and things change. It's great to enhance the Internet Archive, but you can bet I'm keeping my local copy too. Just in case.


That's subobtimal as well. The site could come out with a new robots.txt file which is just <code>User-agent: * Disallow: /</code> and everything already indexed by the Internet Archive is now inaccessible to you.


Do you never get online receipts that you need to keep a copy of?


I don't think I've ever had such a thing that only appeared as a web page, without being emailed to me. To me, the email is the primary-source document in that arrangement.


There was an interesting discussion about this a year ago:

https://news.ycombinator.com/item?id=23228098

——

This is still not as powerful as my one, simple trick to handle all bookmarks, ever: Print to PDF. I've been doing it since last century, and I have 10's of thousands of PDF's of every single web page I've ever found interesting, sitting right there in a directory on my computer

——

Including the suggestion that was brought up to use ripgrep to search in the pdf text content.


Sometimes if I'm researching a topic I'll dig up a big number of newspaper articles and want to print them and read them away from the screen while scribbling notes etc, but on a lot of websites banner ads or footers with copyright statements can really mess it up.


I actually dislike HTML per se, but the only two benefits I see for PDFs in the general case are:

- In my experience, it's a little harder and rarer to make PDFs utterly incompatible with different means of viewing them, and it generally requires more overt (if perhaps slightly unintentional, at times) sadism to make that happen.

- PDFs can do some things HTML can't (easily, at least) with document design -- though those things are generally things that would be disallowed in our new "deurbanized" PDF-based web replacement.

Everything else that comes to mind goes the other way, including the fact that the viewing-mechanism incompatibility thing can be even worse with PDFs, even if it's more rare for that to happen at present, and if PDFs became the new standard for the web I'm pretty sure that relative rarity would evaporate anyway. Let's also not forget that HTML can also do some things PDFs can't (as easily, at least) do.


> Do you save the HTML, CSS, and Javascript, and hope that it works offline? I used to use the "Save page as..." tool back in the early 2000s, but it's become less and less useful, with too many dysfunctional disappointments.

I'm too lazy, so I just tend to use SingleFile these days...



I've used chrome's ability to save a single .mhtml file that contains all the resources for this purpose in the past.


You got nerd sniped by the HTML vs. PDF format thing and missed the entire point of TA:

> Isn’t it a good thing that we enjoy rapid progress? To the extent that we get to enjoy things like YouTube and sandspiel, yes! But to the extent that we want the internet to be a place where we can work and live and think and communicate free of malware, surveillance, dark patterns and the insidious influence of advertising, the answer is, empirically, sadly, no. The web has become ad-corrupted hand-in-hand with growth in technological capability, and the symbiotic relationship between web and browser means they feed on each others’ churn. Ads demand new sources of novelty to put themselves on, so the web expands continually, the specs grow in complexity, the browsers grow in sophistication, the barrier to entry grows ever higher, the vast cost of it all demands more ad revenue to fund it... and thus the perpetual motion machine is complete.


The author does identify a problem, and so you want to focus on that. That's fine. There is the issue of triviality, however.

The problem described is widely felt, and also widely discussed. We already know this stuff to be a problem. For the piece to be worthwhile, then, it should do something that is not present in the other instances where the topic has been raised. It should articulate (or at the very least exhibit, without necessarily articulating) a solution for us. It doesn't. A bad remedy to a genuine problem does not yield a solved problem.


The article is called "Deurbanising the Web", and its thesis is:

- Publish in static file formats.

- Date and hash your work.

- Stop spying on your users.

HN is a discussion forum, not project planning software. Not everything has to "yield a solved problem". Are you really setting the bar at "design a technology stack for replacing HTML/CSS/JS"? That's way, way too high.


You say that its thesis is (in part) to generally publish in static file formats, but that's not quite accurate. The piece specifically touts PDF/A as the best format and makes several arguments against the use of html/css. I agree that they're making a broader point than just "use pdf," but "use pdf" is definitely a large part of it.


Those points can be trivially met with static HTML and something like IPFS, and you can still download HTML for local storage and viewing. You can even print to PDF if you really want to do so. Meanwhile, PDFs also allow dynamic files, don't require dating and hashing, and can be used to spy on users or deliver malware.

EDIT: Oh, yeah, and static file formats doesn't necessarily have to mean static document formatting when viewing -- unless you're using PDFs, which tends to break useful stuff like reflowing for paginated documents (one of the worst things about even simple PDFs).


ipfs solve this well.


The author brings a solution, it is to publish documents in PDF instead of HTML.


"A bad remedy to a genuine problem does not yield a solved problem."


PDF is a great way to publish documents. which is what the web originally was.

The web has become a bad remedy to some distributed software problems.


Why do you feel PDFs are a bad remedy? PDFs are the usual way I absorb information.


No, the entire point of the article is to convince people to use PDF/A. Which I find comical since you have to go out of your way to check if a PDF is PDF/A compliant. If the web was run by PDF's, there's no reason why any big corporations would abide by those rules, and it'd be just as messy as HTML is today.


You've also been nerd sniped. TA goes on and on about surveillance capitalism and the attention economy. Weird, for an article that's supposedly convincing engineers of the merits of one file format over another.


Did you read beyond the "How did it come to this?" section? TA goes on and on about web standards and the need for PDF/A.

Edit: If the article _was_ all about surveillance capitalism, then it wouldn't be worth upvoting as actionable solutions are much more valuable than preaching to the choir.


If you don't think it's clear that the author's advocacy of PDF is a means to an end, subservient to their desire to dismantle surveillance capitalism and the duopoly that Google/Apple have on the web, I don't know where to go from here.


why don't we have both?


I think you're the one who got nerd-sniped here. 1.5 of the 13 pages in this PDF are about surveillance capitalism. The rest's about web standards.


What in the nine hells is nerd sniping?


It's when you trick a technically-minded person into jumping down a rabbit hole of a technical problem/controversy. Here it's PDF vs. HTML, but other classic nerd snipes are UTF-8 vs. anything else, "fixing" election tech, etc.

I tackled the premise. I think addressing the premise is the logical place to dismantle an argument.


But, again, the premise is not that "as a file format, PDF is better than HTML". The premise is: because HTML is two-way, it enables surveillance capitalism and allows bad actors to monopolize the attention economy. The author wrote it thus:

> Sure, you can write good HTML. I won’t argue with that. And if you’re writing good HTML, good for you. But HTML is a dual-use technology, the bad guys are dual-using it an awful lot, and I feel that the stone age still has a part to play in the progression of the information age.

The part where you engage with this is where you write:

> I'm sorry, the more I think about this the dumber I feel. The web is useful because it's 2-way. I am excited by the web because I can interact with other people. I come to hacker news to engage with thinkers, not to just read a published article from one single author. I want to read ad-hoc opinions and user submitted content. PDF web, really?

Which is interesting! Do you have thoughts on creating peer-to-peer systems that don't enable surveillance capitalism?


> > Sure, you can write good HTML.

A key here is that it's easier to write good HTML docs than good PDF docs, and much harder to deal with the harmful aspects of PDF docs given present technology.

> Which is interesting! Do you have thoughts on creating peer-to-peer systems that don't enable surveillance capitalism?

I don't know about the other person's ideas, but decentralization plus better anonymization and pseudonimization, with always-on strongest-reasonably-posible encryption, seems like the direction to go.


> A key here is that it's easier to write good HTML docs than good PDF docs, and much harder to deal with the harmful aspects of PDF docs given present technology.

Oh, yeah I'm not on the PDF train. That's wild. I'm more of a Markdown or Gemtext advocate, or even LaTeX.

> I don't know about the other person's ideas, but decentralization plus better anonymization and pseudonimization, with always-on strongest-reasonably-posible encryption, seems like the direction to go.

Yeah, projects like IPFS (which you reference above) are working towards this, but JavaScript still works over IPFS. Plus, fingerprinting techniques are pretty bonkers. Most of it comes down to JS and various state you keep on your local machine (cookies, flash cookies, etc.), but I think you need that. How do you maintain a session with a peer without some kind of token/cookie?


> Do you have thoughts on creating peer-to-peer systems that don't enable surveillance capitalism?

Yes, it's call TOR. However, legislation is where we should start. Crippling/abandoning an incredibly useful technology which works very well just because it's often used nefariously seems to be a bit of an overreaction.

Until then, stop using social platforms, use an ad blocker, and use VPN if you really care about "surveillance capitalism".


The classic mistaking the example for the topic.


Saying HTML can be offlineable is like saying C can be provably terminating. There's a subset of programs where that's true, but it's not inherent to the form. A PDF is inherently self-contained, standard web technologies are not. When you open the page and it's a PDF, it gives you certain guarantees, when you open it and it's HTML, you have to have to do further investigation.


Firstly, C being provably terminating is a problem dealing with the full body of C programs written in the world. The OP is dealing with their own self-published content. That's a different problem: if your analogy held it would need to be limited to proving that a subset of C programs written by the author terminate.

Secondly, the level of difficulty in making HTML offlineable is many orders of magnitude simpler than your C analogy: there's really no comparison. For the OP we only need to make HTML documents that they have authored themselves offlineable and yet people have written general purpose tools to do this automatically for most webpages. This is not a hard problem.

TL;DR your analogy is absurd.


This is a helpful post because it gets to the heart of the difference. Many people are saying "if you do HTML in a particular way, you get the same benefits." I'm asking "what's inherent to the form?" That's exactly the point about C--you can write it in a way that's provably terminated, but it's not guaranteed. Consider the consumer's perspective.

When I land on a page that's a PDF, I know certain things--I can easily save it and read it later. How do I know that? Not because I have read the PDF spec, or know that much about it, but because of my experience as a consumer of the web.

When I land on an arbitrary web-page, do I know the same thing? No. I don't know what the page is doing, I don't know what my browser will do when I try to save the page. When I save this page, I have the option to save HTML only, or a complete web page. Will the complete page actually work? I go into the source, and there's a link to the javascript (which is saved locally). Does rendering the page rely on that javascript? Does that javascript do xhr or fetch calls? Since it's Hacker News, I suspect the answer is no. However that's not inherent to the medium.

There are better ways to archive the content of even dynamic JS heavy pages, but they are not things that you learn as an average user of the web.


It's possible to write PDFs that don't "work" (for some useful definition of "work" similar to the case with HTML) offline. Please stop pretending that's not true.

The reason offline utility tends to be true more often for PDFs is that PDFs are not generally regarded as the preferred online-default format of choice, which is in turn a matter of social effects rather than technical capacity. Reverse the socially accepted roles of the two document formats and watch the same complaints get made against PDFs as you're making against HTML. I'd bet money the "normal" state of affairs would remain the same in terms of the perceived benefit/detriment allocation between online/offline formats; only which format was considered which would have changed.

. . . but then all the web would be even heavier documents, and even less customizable for local viewing, thanks in part to that pagination and strict formatting situation.


It's possible, but it takes work. I can't remember the last time a pdf did something unreadably weird, usually my only gripe is with something that's a scan of an old document but whoever turned it into PDF didn't do OCR.


I don't really follow. How does this author converting their entire site to PDF help readers/visitors/users?

The original HTML site[0] was printable as PDF, and save-able as both HTML and "Web page, complete", all of which result in a well-formatted & readable offline experience. (It was also responsive: very readable on mobile, but that's an aside).

The new PDF site is not accessible to some, difficult to read on mobile, and interacts poorly with all of the norms web users are accustomed to (back navigation, anchors, etc.)

[0] https://web.archive.org/web/20130127175816/http://www.lab6.c...


It's the difference between "this thing has X property" (termination or able to save for offline reading) and "this thing _obviously_ has X property, in a way that you can tell without any expertise, or doing any investigation".

How important this is to users, or whether it is worth it is something I've not commented on, but it is a difference.


Yes. The sort of discussion happening every day between a product manager and engineers.

hyperpage's analogy would work if the property was "avoids undefined behaviour", rather than "avoids nontermination". When we encounter a webpage, we are being expected to execute potentially complex, well-being threatening code whose behaviour is about as easy to predict as obfuscated C.


True but again only if we're talking about parsing the web. This is about HTML files the author is producing themselves.


PDFs are capable of the same issues.


I don't buy that the problem with the web is that HTML is not inherently offlineable. HTML may not be inherently offlineable but it can be. PDF isn't inherently a web friendly format, but it can be. There really isn't any good argument for PDFing the web.


Print the page to PDF.


> Print the page to PDF.

Even that usually sucks nowadays, because web developers don't care anymore. Probably 75% of the time before I do that, I have to go into the dev console to delete overlay elements that obscure content and garbage that will waste 10 pages (e.g. grossly oversized images, related article recommendations, etc.).

There was a time when most websites had a print view that gave you a simplified html page that worked well, but I think most of those are gone now. Now it's all some print "media-type" CSS that no one ever put the time in to do properly or keep up to date.


> When you open the page and it's a PDF, it gives you certain guarantees ….

I think that this is a lot less true than we're used to thinking. The PDF spec contains a lot more interactive capabilities than I think most people realise. (It supports JavaScript!) We're not used to seeing those capabilities abused, because there's no point; it is so much easier to abuse HTML. But, if people want to abuse PDF—and, if we somehow convinced the world to move to it, then they would—then they easily can.

(I'm not conversant enough in the spec to know, but I do know that Postscript is Turing complete, and I don't know that PDF isn't. At least HTML on its own certainly isn't—no recursion!—although all bets go out the window once you start layering other tech on top of it.)


I agree, I don't see why anyone can call publishing in PDF is "dumb". The author of the material gets to choose his medium. If "you" don't like it then move along or convert it to your preferred format. In other words "why not both?"


I bet HTML to PDF is a lot easier conversion than PDF to HTML.

Formats matter.


> A PDF is inherently self-contained, standard web technologies are not

What technologies exactly? You can have absolutely everything you need inside the HTML. You can inline css, js, svg and images. What technologies you can’t inline?


you are correct that you CAN - but who does. That's no longer considered best practice. The arugment these days is that it's a lot easier to manage css if it's in a separate file, same with js, etc. So none of the serious web developers actually do anything inline anymore. The time it would take to convert a "best practice" website with separate files for html, css, js, etc. is just not worth it. The point he's making is still valid - why not have the option for something static.


But with the same (and even much bigger) success you can declare “I’m switching to self-contained HTML! No more external resources!” instead of “I’m switching to PDF, saying farewell to interactivity and mobile devices”.

It's just the declaration of ONE person, switching ONE site.


> why not have the option for something static

You have the same option with either HTML or PDF:

- PDF files can be dynamic or static, depending on how you write them.

- HTML files can be dynamic or static, depending on how you write them.


>> * PDFs are self-contained and offlineable

> HTML can easily be offline-able. Base64 your images or use SVG, put your CSS in the HTML page, remove all 2-way data interaction, basically reduce HTML to the same performance as PDF and allow it to be downloaded.

You're missing the point. Even a relatively computer-illiterate person can easily save a PDF to my hard drive, and it's significantly more difficult with HTML. At a minimum you're probably going to get an HTML file with a sidecar directory (or I believe a sometimes browser-specific archive, it's been a long time since I tried since it works so poorly), and even that may not have the content you want to due to dynamic sites.


As I explained, if the author wants to make HTML easily offlineable then inline CSS and Base64 images. Or, you know, make your website printable. If authors actually thought about the print to PDF "problem" it could be solved with traditional CSS and HTML. As someone else said, we used to do this. It used to be part of my every day web design job to make sure the page printed nicely.

The idea that the whole web is going to pander to edge case archivers is asinine. This whole conversation is about supporting the needs of the very, very few and romanticizing about the time when only interesting people used the internet. It's kind of elitist and self serving.


I guess I don’t really understand the point being made. Does it matter that much that saving a page create a single file in your hard drive? If you really want a static rendering of a site why not just print it to a PDF. Why does that have to dictate the file format you use for distribution? With PDFs you don’t have to worry about conversion but they are also comparatively larger over the wire.

> even that may not have the content you want to due to dynamic sites

But PDFs also don’t give you dynamic content. Nothing is stopping people from using HTML to serve static, JS-less content. In fact that’s what it was originally designed to do. All this web app stuff was bolted on afterwards, and it’s optional.

What do we accomplish by having some people switch over to PDFs? The people who don’t care about bloat will continue to not care about it. It’s not like thin content will become more discoverable or more common. It doesn’t really change incentives. The author says using PDFs makes it so you’re not tempted to add cruft to your sites but that’s not really a compelling argument.

Getting content creators to produce content without bloat is not really a technical problem. It’s a cultural and economic one. I don’t see how a file format addresses that.


> Does it matter that much that the artifact of saving a page be a single file in your hard drive?

Yes, it matters a lot. Word/Excel files are actually a zip archive containing many files and sub-directories. Can you imagine people working with exploded Word files, sending over mail and WhatsApp complete directory trees?


The file format restricts the possibilties. You know what to expect when you see a PDF - static, JS-less content. With HTML on the other hand, it depends on what the author decided.


> You know what to expect when you see a PDF - static, JS-less content.

You know to expect that, but there's no guarantee that's what you get. PDF supports JavaScript too.


Or I could just make sure that my page prints reasonably well (we used to do this) and use the print-to-pdf functionality available in modern browsers.


You can write HTML pages to be self-contained and offline-friendly.

You can write PDFs to include resources that are not part of a single, self-contained file, and to be quite unfriendly with offline use.


But if you want a page in PDF, you can print it to PDF. Sure, non-computer-savvy users might not know how to do it off-the-bat, but browsers make it pretty easy.


> But if you want a page in PDF, you can print it to PDF...

Printing a page to PDF usually sucks: See https://news.ycombinator.com/item?id=27883028


Oh, I know that. I just meant that if your goal is for the website to be easily archivable, rather than publishing the website as PDF you could use simple HTML which wouldn't suck when printed to PDF.


>> it's significantly more difficult with HTML

Right Click > Save as

Try it with this page!


> Right Click > Save as

> Try it with this page!

Say hello to your new sidecar directory (or broken CSS/images/God knows what else)!

I tried to save an NY Times article, and it 1) needed JS to display anything, 2) even with the sidecar stuff was broken, 3) it was so plastered with ads and other junk I thought it was incomplete (it wasn't, I just had to scroll waaay down past something that looked like a footer and some voids after that).

If you save a PDF, you get that exact PDF on your hard drive, and when you open it (even in 10 years) it will look exactly the same as it did on the site.

With PDF WYSIWYS: What you see is what you save.


This is of course the point of the article - that the web is a giant steaming pile of shit for the most part, plagued by JS and external resource requirements, all of which contribute to massive total page size.

I'll preface by saying I have some expertise in HTML, but none in PDF (the format).

The point of most commenters who suggest that HTML is still a better alternative than PDF (I agree), are assuming that if this is an important issue to you, that you would craft your page in a simpler style compared to most of what we see on the web, making Print to PDF or Save As... more viable.

  > PDFs and a PDF tool ecosystem  exist today. No need for another ghost town   GitHub   repo   with   a   promising   README   and   v0.1   in progress.
This is news to me. I'm not sure that I buy it. PDFs have always been a pain in the ass to work with in my opinion. Maybe there are tools, but in my experience they aren't very good.

In general, we know that HTML is going to be much more compact (and compressible!) than PDF and that's the biggest advantage I see on a web where bandwidth still matters. Another downside shows itself by trying to copy and pasting the above quote: PDF formatting seems to be weird.


> In we know that HTML is going to be much more compact (and compressible!) than PDF and that's the biggest advantage I see on a web where bandwidth still matters.

PDFs can be tiny if they do not embed fonts. Serving fonts is very much a complex technology in HTML world.

Browsing the web is a pain in the ass if you don't use a browser compliant with up-to-date standards, but the whole "HTML can be lightweight" argument pretty much depends on avoiding much of today's standardisation. As an objection to the original argument, it is not comparing like with like.


> This is news to me. I'm not sure that I buy it. PDFs have always been a pain in the ass to work with in my opinion. Maybe there are tools, but in my experience they aren't very good.

> In general, we know that HTML is going to be much more compact (and compressible!) than PDF and that's the biggest advantage I see on a web where bandwidth still matters. Another downside shows itself by trying to copy and pasting the above quote: PDF formatting seems to be weird.

PDF is a display format. I once worked on a project parallel to a guy who was parsing PDF to extract text content. IIRC, Text in PDFs is stored in a way that works fine for printing/rendering but not so well for manipulation (e.g. it's a bunch of commands to render line Z at position X,Y with font W). Those commands don't have to be in reading order, nor do they have the semantic meaning you can get from markup like HTML (e.g. superscript can just be nothing more than a different line rendered with a smaller font).

IMHO, PDF is actually less optimal than HTML for what this guy is advocating, except that it's those precisely those limitations that have prevented PDF from becoming the mess than Web HTML has. Though, that's probably in large part because the bloaters have been too distracted by the easier-target that is HTML to bother.


Yeah, no. Try it with any other page, and see why nobody would be inclined to even try "Save As.." a web page anymore.


I actually did this pretty recently, in an attempt to get some magazine articles onto my Kobo e-book reader since Pocket couldn’t fetch the paywalled ones (I do pay).

I figured I could just save the page, automate a few edits to get around dynamic stuff, and then use it as, you know, an HTML document.

Even with a nice friendly mostly-text literary magazine, after about five hours I gave up and just copy-pasted the rendered text.


> >> it's significantly more difficult with HTML

> Right Click > Save as

> Try it with this page!

HN is not a good site to illustrate the unpleasantnesses of navigating the modern web. As you'd hope for a hacker news site, it is very friendly to this sort of thing. Most sites aren't.


> You're missing the point. Even a relatively computer-illiterate person can easily save a PDF to my hard drive, and it's significantly more difficult with HTML. At a minimum you're probably going to get an HTML file with a sidecar directory (or I believe a sometimes browser-specific archive, it's been a long time since I tried since it works so poorly), and even that may not have the content you want to due to dynamic sites.

Ctrl+P -> Save as PDF

You don't need the page to be a PDF to save it as a PDF.


These all seem like technical quibbles that miss the point.


The guy outlines his whole case based on those exact points which are, as you have observed, technical quibbles and not a basis for abandoning HTML.

Under the hood it seems apparent to me that the real premise is an emotional one, not a technical one.

The internet is plastic not because of HTML, but because of money and people. When you have teens driving content it's going to feel plastic. When Walmart uses the internet to sell you crap it's gonna be plastic. Gossip / social platforms are trash, no matter the medium.

It could be argued that TV is an incredible learning platform ruined by HD. Back in the standard definition days we had proper news, documentaries that were substantial, and no reality TV. We need to go back to black and white standard definition.

Sorry, but the PDF web is not a solution to societal rot.


> The guy outlines his whole case based on those exact points which are, as you have observed, technical quibbles and not a basis for abandoning HTML.

He's actually more of a social observation: it doesn't matter what the technology can do, what matters how how the developers of that technology actually use it.

People who use PDF almost never use 3D graphics and heavy dynamic JS, so PDFs almost always have many of the qualities he's seeking.

Web developers almost never inline anything, and do all kinds of things that are arguably deal-breakers except for a few lowest-common-denominator use cases.

> Under the hood it seems apparent to me that the real premise is an emotional one, not a technical one.

The premise is that the web has failed in important and clear ways, it's impossible to fix so we should give up, so many use cases should abandon it for something else, and PDFs are unexpectedly well suited for that.

On a related note, part of me wishes Java Applets never died. Getting rid of them seems to have caused the Web to turn into them, and maybe if they'd remained some kind of separation could have been maintained.


Turning PDFs into the replacement for HTML would change the incentives around PDF authoring, and PDFs would then acquire the same problems identified with HTML.

The solution to the identified problems is not to switch to PDFs. Stop reshuffling the chairs on the deck of your sinking ship, and start figuring out how to design, implement, and incentivize the use of, some means of conveyance other than iceberg-vulnerable ships.

> On a related note, part of me wishes Java Applets never died. Getting rid of them seems to have caused the Web to turn into them, and maybe if they'd remained some kind of separation could have been maintained.

Java Applets were killed by Flash.


> PDFs are unexpectedly well suited for that.

Not so surprising, really: the PDF standard evolved in parallel with Adobe's Flash between 2005 and 2010, which was then the key technology in Adobe's effort to keep a strategic toehold on the web. If Flash had not been a security clusterfuck, it might still be around. The PDF standard was always meant to be a complementary standard, and Adobe's attempted successor technologies have followed an even closer technological path.

The PDF standard has benefited from the fact that, unlike the W3C and WHATWG, surveillance capitalists have not been in the driving seat of its standardisation effort. Adobe's interests are not identical to those of the public, but they are not as essentially adversarial to them as the web standards bodies have been.


Is the medium the message? Does style have substance? Is form also a function?


I'm not exactly sure what point you're trying to make here, but I don't think two different formats for encoding formatted text with images constitute different "mediums".


Of course they are, and we run into it constantly in computing. You can encode text with images as a bitmap, as vector graphics, as symbolic content that references bitmaps or vectors, as an algorithm that procedurally generates any of the above...

While you can produce identical outputs from the different methods, it's not hair-splitting to say that the authoring process and hence the nature of the medium to shape expression is affected by choosing one. When you opt towards maximizing generality your production cycle can grow without bound because everything is possible by layering different media, even if all of it is unnecessary. That's how you end up with creative projects that take multiple years to decades to accomplish.


Well, you seem to get the gist of the hot take the author put out. This article is not about PDFs. There is something wrong with the world and we can sense it.

This is close to it: When you have teens driving content it's going to feel plastic.

Youth is the ultimate quality destroyer. They just fucking suck. I’m quite sick of their drivel honestly, and yet, we let them dictate the world (watch my childish cartoons, even in old age).

And the little shits complicate code bases. All you little rascals under 30, scram, I’m on to you.

And all you little adults acting like children, with your stupid motivational posts on LinkedIn, and your garbage bragging on there, I see you too.

Stop.


Unless I'm on a paper-sized tablet I would definitely rather have an offline HTML file than a PDF. Nobody likes to pan back and forth on lines of text to read something.


I had the exact opposite reaction. I’m reading this on an iPhone SE2020, and I MUCH appreciate reading this in pdf form. I didn’t have to pan back and forth or even put the phone in landscape orientation. This is one of the smallest smartphones you can still buy, and the experience of PDF is WAY better than the user-hostile auto-flow text forced down mobile users’ throats.

I was skeptical at first, but I think the author made the point fantastically well.


What.

Your browser has a zoom functionality that lets you make the text smaller, essentially replicating the PDF site above. Only the opposite of what you say is correct: I can’t read that PDF’s text without turning my phone into landscape and picking up my glasses.


To get equally small text on my desktop I have to turn the font size all the way down to 7. God forbid you have readers with less than stellar eyesight.

I get what they're going for but the PDF is not exactly an accessible reading experience.


I’m using a 2016 iPhone SE, and it’s largely unreadable without being very up close.


EPUB would beat the shit out of PDF for that.

(EPUB is basically a subset of HTML with client-oriented context.)


PDF is size-agnostic. There's nothing to stop you from creating documents the size of a phone screen.


I’m commenting here as a user reading a PDF. The fact that someone else could have laid it out differently doesn’t change the fixed layout of the PDF that I’m trying to read.

There’s a reason responsive design has been a big deal for the last 10+ years and I don’t think the benefits of PDF are worth throwing it out.


As someone who really detests responsive design, the lack of it in a PDF strikes me as a feature, not a bug.


> These all seem like technical quibbles that miss the point.

If these all "miss the point", what is the point?

It seems to me that the article's point is that PDF as a format has attributes that satisfy the author's goal, whereas HTML does not. The parent comment says that HTML does have those attributes after all (if you choose to use HTML that way). That is very directly addressing the article's point, as I understand it.


Perhaps I misunderstood, but I believe the author's point was to highlight what a steaming mess the modern web is. The PDF aspect strikes me as illustrating a point, not a seriously proposed solution.


This statement could be for both the comment you're replying to and the original article.


>PDFs don't have any dynamic interaction...

Just a caveat to that statement, you can literally do interactive and dynamic 3D graphics rendering in PDFs: https://helpx.adobe.com/acrobat/using/enable-3d-content-pdf....

You can also embed JS in PDFs: https://helpx.adobe.com/acrobat/using/applying-actions-scrip...


Yes, and many of this things are "in general" not well supported by anything but adobe PDF.

Even most simple interactive things can easily not work correctly even in more widely spread PDF readers.

IMHO PDF is in many ways worse then HTML, it's just that this ways are less commonly used, but if you start a PDF instead of HTML trend it's just a matter of time until this "not so compatible" aspects of PDF become widely used by some people.


JS in a PDF? You can do that in HTML, why not use the tools you already have that work together by design?

This guy is arguing that removing JS is what makes the web better. Having published, static, paper-like content is the way forward.


Just caveating a technical statement I knew wasn't quite true, not making any sort of assessment either way.

As someone who has had to extract data from large sets of PDFs and modern web presentation formats, I'm not a fan of either, really. Even verifying that a visibly presented string exists in a PDF document programmatically can be a non-trivial task, as with a given website as well. That to me says a lot.


monkeynotes seems to take the line that technical defects in claims others make fatally undermines their case, but technical defects in his/her arguments are irrelevancies.

For what it's worth, the same objection occured to me. The use of scripting I've seen in PDFs has been use-supporting and consistent with their book-like feel.


Also - how are PDFs exactly "discoverable"? I have petabytes of PDFs and making them easily "discoverable" for any mass use, such as analytics, search, or data analysis is a massive pain. I'd rather have them in a non-PDF format.


The author calling for new content to be authored as PDF, which can easily be made discoverable.

I’m guessing your data set is made of scans with poor or no OCR.


Not a single researcher or data analyst I know of would prefer "discoverable" content to be in PDF format, regardless of just how awesome the OCR is (which it often isn't, especially for tabular data). Even for all-text, non-tabular documents, OCR does not provide the metadata needed to make sense of the documents. Why PDF is claimed to have superior "discoverability" in the OP essay is a mystery to me. For the sake of "discoverability", PDF is definitely not the way to go.


The essay claimed

> PDFs are discoverable. Search engines index them as easily as any other format.

What you’re taking about has nothing to do with that.


Honestly, if you're going to put out a manifesto as a PDF, at least take some time "layouting" your design. The one advantage of that format is that you control the aspect ratio. Every font is permissible, everything is absolutely positioned. Using a generator to create it is cringey. Show the art that's possible. Really sell the format.

FWIW I deliver PDFs daily as an art director; not ideal, but they work in most cases. There's certainly nothing rebellious or non-commercial about them.


...and difficult to read on the small screens of mobile devices.


Yeah. That's why they're only used for print.


> HTML can easily be offline-able. Base64 your images or use SVG, put your CSS in the HTML page, remove all 2-way data interaction, basically reduce HTML to the same performance as PDF and allow it to be downloaded.

I built a tool for this exact purpose[0] since the HTML specification and modern browsers have a lot of nice features for creating and reading documents compared to PDF (reflow and responsive page scaling, accessibility, easily sharable, a lot of styling options that are easy to use, ability for the user to easily modify the document or change the style, integration with existing web technologies, etc.). In general I would rather read an HTML document than the PDF document since I like to modify the styling in various ways (dark theme extensions in the browser for example) which may be hard to do with a PDF, but its more of a personal preference. Some people will prefer that the document adjusts to the screen size of the device (many HTML pages), and others will prefer the exact same or similar rendering regardless of the screen size (PDF).

Either way, kind of a fun idea making a website using just PDFs. Not the most practical choice, but fun none-the-less.

[0] https://github.com/chowderman/hyperfiler


This reminds me of the guy who said drop box was stupid because he could set up an ftp server. It’s the exact same argument.

People understand PDFs, they are extremely common in the academic and business world as “digital paper” standalone documents. Hypothetically, anything in memory can be made into a file but in this scenario what matters is the practical goal of people actually using these files.

I think it makes sense for the web to be made up of discreet primitives not only so that the web can be browsed in an intuitive and frictionless way but also because it lends itself to being backed up and easily re-hosted.


This. Also who hates the huge double margins? The slow rendering? The unnatural break-up of text? Meaningless headers and footers? And the whole page-based layout? PDF is not meant for the web. Period.


You seem to miss the point of the post:

----

Call to action

Publish in static file formats

Date and hash your work

Stop spying on your users

----

All this cannot be GUARANTEED by HTML/pdf/epub and requires active cooperation from the author. This is bad.


All true. Incidentally, I do not see pagination as necessary or in most cases even desirable; rather, I see it as a vestige of the printing technology, while the need for printing has shrunk dramatically over the past 20 years.


> PDFs don't have any dynamic interaction

Oh, you are set for a world of surprises. Nearly every single one bad, but running our current web over PDFs is well within the specs.


PDF

- does not reflow, major suck

- is binary format, another major suck

So no thx, PDF is outdated tech, while HTML and friends are just abused.


What I like best about pdf files is that I can just give them to someone and be almost certain that any questions will be about the content rather than the format of the file.


agreed.

and, ancient HTML can still be easily read by modern browsers, so that's not exactly a special attribute of PDF either.


HTML can easily be offline-able.

Sure - if the publisher cares. From the user's standpoint, the safe assumption is that they don't. Of course PDF is No Good for many contexts, but for any sort of long-form document that is primarily meant to be read, it's so often better.

Also, if something is available in pdf, I can be moderately sure that someone else took the time to make sure it would be formatted correctly and print out OK.* If it only exists in HTML it's more of a roulette wheel experience.

* Unless some graphic designer thought 'gee this report would look so cool if the cover pages were black or some other highly saturated block of solid color.'


HTML used to be a very nice format at the age of xhtml 1.1, very formally specified, and a tie with DOM was assured by vert strictly standardised DOM v3. And ACID3 was giving you a pixel for pixel repeatability during rendering.

HTML+JS today... now it's effectively a standard in name only, and Chrome is the new IE6. The standard is now "what has worked in the last stable release"

Now go to http://acid3.acidtests.org/ and see how the latest stable Chrome release can't render a decade old CSS testcase.


> Simply build your website with pagination.

My experience is that browsers are terrible with CSS pagination support in their display and printing directly.

The only place it seems to actually work is...saving as a PDF...


PDFs aren't really meant to be read off a screen, they're much better suited for stuff that's meant to be printed out.

And you can have a single self-contained file with a webpage, it's called a "web archive", with .mhtml extension.


> Base64 your images […], put your CSS in the HTML page

Is there a tool that does those two things (or at least the first one) and that can be used by non-programmers (command line use is fine, a Python library would not be)?


You can use SingleFile for this, see https://github.com/gildas-lormeau/SingleFile/


"I come to hacker news to engage with thinkers, not just read a published article from a single author."

And how many websites today are anything like HN, in terms of relative simplicity, e.g., no images^1, 3rd party requests or ads, only a tiny bit of (gratuitous)^2 JS.

1. I do not particpate in the voting scheme but I could vote from the command line if I wanted to. I use a text-only browser so the grey, fading text gimmick is irrelevant. I see all comments and treat them according to the thinking not the voting.

2. If we exclude the .ico and a .gif

There seems to be a double-standard, for lack of a better term, where many HN commenters and voters appear to work for companies that make websites with tracking and ads and various gimmicks targeted at "non-thinkers" which are nothing at all like HN. Whatever these commenters and voters see and appreciate in HN they are not working to bring it to the rest of the web. I seriously doubt they comment and vote on HN out of fear of so-called "power users" or a belief that the HN type of simplicity could become more popular and threaten their jobs that depend on surveillance, online ads and a non-thinking audience of "powerless" users. Rather, a more rational explanation might be that they see some value in a website that shows no ads and generally uses no gimmicks; that's something to think about.

"PDF web" may not make sense to many folks who have invested heavily in JS and Big Tech web browsers, but Postscript is arguably more elegant than Javascript. "Thinkers" usually like FORTH.

https://en.m.wikipedia.org/wiki/Display_PostScript

The tracking section mentions the Abe Vigoda status page.

http://www.abevigoda.com/


PDFs are also horrible to view on mobile, as the text doesn't reflow.


Sounds a lot like epub.


so because someone chooses to publish their website in an open format that they prefer "it's dumb" because they don't agree with you.


In a sea of cynicism, I gotta say.. bravo. This genuinely put a smile on my face. It has a lot of problems, sure, but it's a creative use of the Web and would surely work for some use cases. It's certainly no worse than using Flash ever was.

It reminds me a bit of a "newsletter" I'm subscribed to called, ironically, "Not a Newsletter" (http://notanewsletter.com/). You get an email from the author each month and it just points to a Google Doc where he puts the actual content. Why's this good? The content can't set off any spam filters, he can edit the issue after it's "sent" if there are mistakes or broken links..


The content can be censored arbitrarily by google, and when you click on mobile web with the docs app installed, it logs your logged in google account identity (maybe for work?) with the view when it switches to the app.

Files have none of these problems.


You're not wrong! It always a trade off of one set of problems for another with these sorts of things, I guess.


If the author was concerned about getting censored by Google or feeding their data empire, they could set up a self-hosted Google Docs-like, like NextCloud.

The readers would still need to trust the author's not doing anything nefarious with their IP addresses, but I guess there's a degree of implicit trust when subscribing to a newsletter.


I would just put it on my own server. Are people really worried about clicking a private link and having their IP address logged? Just opening an email with a tracking pixel triggers that already, and you have to assume clicking a link will log your IP whether with Google or Constant Contact or any other mass email provider.


Google Docs are still files. It's just up to the author (or even the readers) to keep copies outside of Google's servers. Unless Lab6 owns their own servers, whoever is hosting these pdfs can delete them as well. At least, in both cases, static files are much easier to backup and copy than entire three-tier dynamic applications. And readers can keep their own copies separate from the original, which isn't possible with an application at all.


> Google Docs are still files. It's just up to the author (or even the readers) to keep copies outside of Google's servers.

No they're not? You literally can't have a google doc as a file in a first-class way - you can export it to a file, but that's a lossy process.


Yup. Another way to say it is Google will release a file format the day offline computing drops dead. It should probably amount to an antitrust case or at least a major class action claim at this point. That said, even with PDF specs it's freakin impossible to read/write that format in an intelligible way, if the person creating the document used even the barest amount of block alignment. Adobe started with an innovative notion about layout, but ended up making content extremely hard to parse, and actually tried to open source the engine. Google started with an idea of trapping everyone's data in a format they'd never make fully available, and then charging for the privilege of storing it.


It is too early to displace HTML with PDF.

> PDFs used to be inaccessible

My eyes are not very good. I have trouble reading the font in the PDF. I am using Firefox. HTML lets me pick that a font that I can read easily. I cannot do that with PDF.

> PDFs used to be unreadable on small screens, but now you can reflow them.

I am using Firefox. I cannot do that.

Realistically, how many years will I have to wait until Firefox catches up?

Over twenty years ago, I learnt Web authoring by examining the source which had a profound effect on my career. That serendipitous opportunity I had with human-readable sources will be lost to the next generation with PDF - they have to learn the technology deliberately.


My understanding is that PDF is a monster of a document format, and it's clearly not (usually and historically) meant to be reflowed. Even copy/pasting from PDFs can be very disconcerting because the viewer may not have a good idea of where blocks of text start and end (or even what the characters really are).

I can empathize with the feeling that the web is incredibly bloated, but that's IMO throwing the baby with the bath water. Simple HTML with some optional CSS would do the job much better IMO (and can be easily downloaded, mirrored or offlined with tools like wget).

And if you really don't like writing HTML (I won't blame you) then there's always formats like markdown, org-mode and friends which can easily be converted to pretty much anything.


Dealing with PDFs (as in, coding a system that can import/export/display them) is more obnoxious than dealing with excel spreadsheets.

Unless your system is a PDF library (as in, you make the black-box dependency that other systems use to handle PDF exports), everything you do with PDFs will be through some annoying black-box dependency that is a pain to use.

Even relatively complex HTML is much more fun to work with than PDF.


As far as I know, it is nothing specific to Firefox. You can't set your own PDF font or reflow a non-reflowable PDF in any browser.


Brief investigation suggests reflow is a super-clumsy, ultra-coarse-grained view mode that is implemented by few clients, is not easy to access, is not well known, and is vastly inferior to what you can get on the web, especially as it’s basically text-only.

In Adobe Acrobat (and I’m guessing Adobe Reader): Choose View → Zoom → Reflow, and it turns everything into one column of nigh-unformatted text.

(Word looks like it may support it, but that could be more that it’s converted it to a Word document in some way and reflow-like functionality falls out of that naturally, though I imagine the tagging would help with the conversion; and someone in this thread mentions something called “Book Reader” supporting it.)


Source code for websites hasn't been readable for years. Reading a minimized JS document that has mauled the DOM is only slightly more readable than the structure of a PDF.


> Over twenty years ago, I learnt Web authoring by examining the source

So did I. Now, it is impossible to reverse engineer the metric crapton of minified JS and CSS cryptoglyphics that comprise the modern web.


TBH it's a little bit like complaining you can't open a modern binary executable in a hex editor and learn programming from that. Days of doing your regular coding by writing direct machine code or assembly are (mostly) gone, and for the sake of advancing the craft, I'm (mostly) happy with it.

But I too wish the modern web was simpler. It took an evolutionary path of maintaining just enough backwards compatibility to only keep making things worse. Efforts like Gemini[1] bring some hope but I'm afraid the medium won't be flexible enough for much beyond personal blogs. But maybe that's for the better.

[1]: https://gemini.circumlunar.space; gemini://gemini.circumlunar.space


>It is too early to displace HTML with PDF. 'Never' will be too early.

>Realistically, how many years will I have to wait until Firefox catches up?

They should better improve reflow for HTML on small devices first. Focusing on PDF is a waste of resources.


I mean, Firefox just follows the website's command to not format it as a mobile webpage, right? But a button to forcibly reflow is handy though.


The one piece of software that I know that lets you reflow PDFs is Calibre. And the results aren't great.


At least it looks more beautiful than terminal-only Gemini sites.

https://en.m.wikipedia.org/wiki/Gemini_%28protocol%29


Gemini is as "terminal-only" as Markdown. Just because it's a text format first and foremost, does not mean that you can't display it nicely formatted. It's more like EPUB in that regard.


Unfortunately, many Gemini sites expect a fixed-width font for alignment.


Gemini sites are not terminal-only and the renderer can make it look beautiful (depending upon one's definition of beautiful). One example is Lagrange:

https://github.com/skyjake/lagrange


I read this entire document. If you've ever had to write a PDF-to-text parser - and God help you, I have - you will beg for Flash to come back as a web standard.

[edit] Generally though, I'm sympathetic with your point and it's kind of like why zines regained popularity in the 90s (and samizdat in the Soviet Union before that)... controlling your own publishing is a powerful idea. Anyone can do that though, without resorting to obscure formats, unless obfuscation is the point.


  $> cat file.pdf | strings
Done. /s


Stop cat abuse! /s

    $> strings file.pdf


  $> strings < file.pdf
?? /s/s


The Poppler library's pdftotext is remarkably effective.


Yeah, 10 second load time, tiny text on a mobile device. No thanks. Sucks that people went for over-styling every site making everything painful to publish. I’d be happy with 90’s static HTML, and a few images when needed. I seek information, not “an experience”.


Exactly my reaction to opening the site.

I had no idea what the content of the site was (besides the title from HN) and around the 50% download point, I had already lost interest. I'm clearly not the only one who loses interest this quick [0][1][2].

Also, as others have mentioned in root level comments, the design & layout of the content within is also severely lacking, which makes waiting for the load to occur even less worth it.

---

[0]: https://www.pingdom.com/blog/page-load-time-really-affect-bo... (2018)

[1]: https://blog.mozilla.org/metrics/2010/03/31/firefox-page-loa... (2010)

[2]: https://www.thinkwithgoogle.com/marketing-strategies/app-and... (I know it's Google, but to be fair they have more data on this than most other companies, despite their obvious desire to sell more of their product/services related to it.)


Exactly this. It is by the way one of the main reasons I initially stuck with HN. The lean UI, text based simplicity, efficiently conveying information had me instantly. I would sacrifize styling for speed anytime, everywhere.


On the contrary, I much prefer a small text on a mobile device to the reflowed text on a mobile device that we’re always forced to use. The PDF is also the same view as on a desktop, so if I look at it on another device, my spatial memory of where stuff is remains intact.


Might as well just generate a PNG. The text is too small for me on a mobile device. PDFs main goal was print. The fonts are awful for the screen and no ability to reflow the text.

I can deal with things moving around, I don't need spatial memory for that. Just give good titles, headers, and indexes. Again, we can do this with simple HTML, embed images and styles. It's all there.

Unfortunately, as I mentioned, people don't really publish information anymore. It's mainly for "experience" and for "looks". Marketing, and advertising, now drive the information era. The "Information Super Highway" is now just a crumbling road plastered with billboards. Most content is useless, and is there for clicks. Heck, I'd rather someone post their site in digests in e-book formats than PDF.


Only 10 seconds!?


I just ran your PDF through an accessibility checker and it failed magnificently. For this reason alone, suggesting people make more use of PDFs instead of well-formatted HTML is a total non-starter for me (and should be for everyone).


Making properly accessible PDFs is possible, but it is a pain in the ass. Certainly more difficult than with plain HTML.


It’s entirely possible to write accessible PDFs. It’s just that no-one does.


It is indeed! And you're right, nobody does, including this example.


And even if they did, many of the readers/viewers people use wouldn't fully support it.

While it's possible to royally mess up accessibility in HTML, too, the chances of getting something usable are at least somewhat better.


My thoughts exactly, I feel like it would be easier to write accessible webpages (given the wealth of accessibility tools).


Even Word documents are more accessible than PDFs.


Heck, even PDFs produced by Word (or comparable FOSS editors) are so much better (except if you've done it incorrectly by "printing" it) than this particular one.


I find it quite amusing that the author is railing against HTML at least in part because it's practically impossible to build a new web browser at this point, and then moves to PDF instead.

In my time working with PDFs, I've found that generating them in ways that can be read with the most popular PDF readers is cryptic and difficult, and even parsing the ones made from the most popular creators is hard.

I would definitely not pick PDF over HTML in regards to how easy it is to implement a good reader or writer.

And there's plenty of authoring tools for HTML already, so the "ecosystem already exists for PDF" doesn't track either.

Even the complaint about churn makes no sense to me, because there's no need to upgrade your tools constantly. If you're using something that produces good HTML today, it'll produce good HTML in a decade, too.

OTOH, if you have a problem that could be automated, you're a lot more likely to be able to create that tool for HTML than PDF, and it's quite likely that someone else already has for HTML, but not PDF.


> In my time working with PDFs, I've found that generating them in ways that can be read with the most popular PDF readers is cryptic and difficult, and even parsing the ones made from the most popular creators is hard.

Both pdf readers on my phone can't read the pdf, so this is definitely an issue.


As someone who works with PDFs a lot, please don't. PDFs are awful in every case except those which require a very precise visual layout. From reading the article, I do not see a single case in which PDF is superior to vanilla HTML.


My kids school used to send links to google docs for their announcements, I hated it. I pretty much hate any system like that, it's purely extra steps on the web.

In both email, and the browser I'm already in a program that displays text and images and cool stuff. So then I'm just sent a link to someplace else that does the same thing?

So then what? Is it all just "pdf can do that too", but with extra steps...? I can print to PDF in most browsers if I want, but in this case it isn't a choice.

The idea that I might save and store the school emails or that website and somehow manage those files seems kinda self important in a way ... I don't mean that as a personal attack, just that this idea that they imagine me taking the time to do that with their content? When otherwise it could have just been an accessible web page? How many people care to do that?

If I'm visiting a website I'm almost certainly not interested in saving your content / managing it... almost never.

I'm a little lost on the whole 'page-oriented' idea too. That's just a limitation of paper, and it's a pain / disruptive more often than not. Even the 'page oriented' section is broken up by the page and some extra text at the bottom of the page that is irrelevant to the paragraph...

If folks want a 'save to pdf' option might be nice to add, or the user can just print to pdf...


Well, what's wrong with static site (generators)?

I certainly get the argument, but using something like hugo or gatsby or jekyll when you want to avoid the "churn" also seems like a perfectly valid solution.


The author addresses this pretty well. Because you can embed whatever you want, static site generators aren't really static. In particular, Jekyll blogs and what not still pretty commonly include comment sections.

Of course, pdfs aren't necessarily static, either, but that is why Lab6 is choosing to use pdf/a, an actually static format intended specifically for long-term archiving of immutable files. This way you can sign the file and guarantee it stays the same forever and everyone's copy is identical.

I'm kind of surprised at the response to this. The author seems well aware of how terrible pdf is as a format and this isn't some treatise of why we should want to use it. It's an unfortunate compromise that, given the requirements they're aiming to meet, of generating a file that supports rich formatting and hyperlink embedding, but which can guarantee immutability and long-term archiving directly in the spec, pdf/a is all there is, so in spite of being a terrible format with a lot of shortcomings, it's what they're using.


Why don't they just use a static subset of HTML? You don't have to include comments sections, just like you don't have to include 3D CAD models and videos in your PDFs (yes you can do both of those, in theory anyway).


> The author addresses this pretty well. Because you can embed whatever you want, static site generators aren't really static. In particular, Jekyll blogs and what not still pretty commonly include comment sections.

But just like you can choose to use PDF/A, you can also choose to have a completely static and self-contained (e.g. using data URLs for images) HTML page.


> pdf/a is all there is

Nobody is requiring you to use PDF/A. No mainline browser (that I'm aware of) requires it.

So what is being solved? When I click on a PDF on the web, I don't know if it's using PDF/A, I don't know if it's embedding or linking its fonts. So it's the same situation, nothing has changed.

Telling people to use PDF/A when most clients do not enforce it and when there's no indication to users before they click on a link whether or not the link is following the spec -- it is exactly the same as telling them to use a subset of HTML; the author is doing the same thing they complain about.

You can't just say that PDF/A exists. That's not enough, how will you get people to restrict themselves to that format when 99% of their users will never notice the difference and no client is enforcing it?


The only thing I like about PDF compared to HTML is that with PDF, I know for a fact that no web requests are made in the background. That means no fingerprinting, no analytics etc.

With HTML, I have to trust that some random entity does what they state in their privacy policy, and they regularly don't. Sure, I can disable JS, but then 95% of the web doesn't work anymore.

Other than that PDF is quite clearly a less accessible format.


How do you know for a fact? PDF has JS in the spec, and it supports SOAP and Web Services. Have a look at https://www.adobe.com/go/acrobatsdk_jsdevguide


That's not the PDF spec is it? That is a spec for Adobe Acrobat, which is not allowed to make any web requests thanks to my application firewall (Little Snitch).

Pretty sure a PDF opened in the browser can't run any JS, but not completely sure. So you're right: I don't really know it for a fact. Poor choice of words.


The spec is ISO 32000, and it’s expensive and closed, so difficult to reference. But according to Wikipedia at least, JavaScript is normative in it. No idea if SOAP / Web Services is part of it though.


The spec for PDF 1.7 is here: https://www.adobe.com/content/dam/acom/en/devnet/pdf/pdfs/PD...

JavaScript is allowed, but not in PDF/A, which is what I use.

The PDF 2.0 spec is damnably not public.


But you can't easily tell PDF/A and regular PDF apart, so we're back to the same situation as HTML vs. HTML with javascript turned off.


Are you sure? I was under the impression that PDFs can reference web resources, and this is why there are more stringent standards for archiving (PDF/A and friends)


> With HTML, I have to trust that some random entity does what they state in their privacy policy, and they regularly don't. Sure, I can disable JS, but then 95% of the web doesn't work anymore.

If you only allow PDF, then 99.9999% of the web doesn't work anymore.

I'm all for getting sites to be static, but PDF doesn't fix that because the problem has never been the technology used to build the site.


How sure are you that there are no network requests happening? I tried to look this up and wasn't able to find any clear answer.

(It looks like at least some PDF readers have provided support for automatically displaying external images, for example)


The full PDF spec is insane and allows for web requests and javascript. Most readers do not implement the anti features but adobe's tools will.


You are fingerprinted when you find the web link.


When I click a link you mean? Definitely true, but that way they only have access to my IP and user agent, which is still better than all the WebGL, Font library, display calibration settings, mouse movement etc. that they use otherwise.

I often use Tor, although I'm pretty sure that even then, a good analytics lib can see it's me based on scroll behaviour, mouse movement, time of day, and of course what I browse.

But yeah, you make a good point.


Where do you get the link?


DDG mostly, and they don't track users.


Your device, your device version, screen size, browser, browser version, IP address, etc... Are all tracked regardless.

You might not be a unique fingerprint, but at best you are part of a group of somewhere between 3 and 1000 similar users.

Not to be a downer, but when I webscraped I learned that big corporations can spend money to fingerprint you.


Why?


You can not use js on your website.


> I certainly get the argument, but using something like hugo or gatsby or jekyll […]

Or a plug-in to Wordpress so you can keep the GUI/dynamic for the less technical employees:

* https://wordpress.org/plugins/simply-static/


Very surprised to see just few comments mentioning EPUB, which is IMO is much more suitable for document-centric approach. An open standard with freely available[1] specification and never had any problems with EPUBs on PC, tablets and phones.

[1] - https://www.w3.org/publishing/epub32/epub-spec.html#sec-intr...


Also worth pointing out, EPUBs are (or, at least, can be. I'm not sure how much flexibility is in the specifications) basically just bundled HTML.


There’s a fixed layout version of the ePub standard too, allowing PDF quality if that’s what you’re after.


But can you open an epub in a browser? That's the main point here.


Not only simple browser plugins per the other reply (and a plethora of non-crashing mobile apps, whereas mobile PDF reading apps crash on me all the time) - the ePub format is just a zip file in disguise with plain text (HTML) inside and maybe some images/etc.

In a manner of speaking, ePub as a design has an inherent built-in fallback mechanism to manually obtain the internal content in case of failure - including ability to try and repair a broken zip format (zip -F/-FF) and grep it in place (zipgrep).


Yes, but with the plugin


I also enjoyed the sentiment of the article. I used to blog a lot but in the last decade I have preferred more long form writing. Now I use the leanpub.com [1] service so when I write, I get generated PDF/ePub/Kindle formats, and material is readable online as HTML/CSS. For me leanpub is a way to make content free and accessible, but people can pay if they want. The relatively few people who pay for my material have a large effect on what I decide to write about in the future or which writing projects to drop.

I consume the web mostly by following a few very interesting people on social media and following their links. As an author, my goal is to keep producing interesting enough material to be worth people's time reading.

[1] https://leanpub.com/u/markwatson


This is an awful idea and I love it.

As others have pointed out it's strictly worse than a static HTML site in many, many ways. At the same time though, it's a brilliant criticism of many of the worst aspects of the modern web.

This is art.


Great article - so much depth and accuracy to this! I see a lot of discussion about the semantics of pdfs but I think those are missing the overarching theme here.

Feels like this is more about the fact that websites have become increasingly dynamic, unstable, unreliable, inconsistent, etc. - pdfs offer something like a book, static, stable, reliable and consistent.

Think about a book you can turn to a specific page no matter how many times you look at it and the print is the same, the information is the same, you can do the same action over and over again and get the same expected result.

Now imagine opening a book and you could have sworn that the chapter you wanted to reference was 11 but now it's 16 and the images are different, the examples are different, in fact the quote that you wanted to use for reference no longer exists in the book.

There's an insanity to this experience but it's exactly what the web is like - a book that is constantly changing, upended changed - even disappearing entirely. I could have sworn I had bought that book on discrete mathematics - how could it be gone? oh that's right the server managing site is powered off - book no longer even exists.


What is the summary?

Same as someone else, to read on mobile I have to download and open a pdf so i just cancelled the download and ignored the link


What is the bump you experience that you don’t want to download and open a pdf? Here it opens in my browser directly (Safari)


It ends up in the downloads folder and needs cleanup later.


In all browsers that I use that is only true if the server sends a Content-Disposition header with its value set to “attachment” (optionally with a file name), or maybe also in the case where the server specifies incorrect or unspecific Content-Type (such as simply “application/octet-stream” instead of “application/pdf”).


On the Brave browser for Android it also downloads the PDF file and stores it locally. Websites should use HTML and not PDF in my opinion.

On top of that the end result is not very readable on mobile, the font is too small.


> [...] Websites should use HTML and not PDF in my opinion.

> On top of that the end result is not very readable on mobile, the font is too small.

Agreed on both counts. Was only commenting about browsers saving PDFs.

PDF is not a comfortable format for reading on a screen. Nor a comfortable format to extract text or data from.


What I said happens for Firefox on android, unfortunately. It's a great browser, of course.


Even on mobile?


Yes. I use Safari on iOS.


For me all my readers (I have multiple on my phone) all can't open the file for some reason.


macos/ios have this built in but not all OSes come with a pdf viewer


Please somebody bake an icon into the browser that turns green when websites are lightweight and content-only and make it affect Google rankings.

We don’t need PDF sites, we need incentives for publishing acceptable websites.

Side note: I’d honestly love for the government to step in and outright outlaw some obvious and intentional dark patterns (example: California unsubscribe law)


> make it affect Google rankings.

Google is never going to make a change to its rankings that interferes with its real goal of 23% YoY revenue growth.


Is that actually an internal Google goal? If so, dear god, no wonder they are so willing to sacrifice the long term health of the internet in return for short term hypergrowth. No company Google's size can grow that fast without some serious dark patterns and user abuse.


You don't end up with that level of growth year over year for 20 years straight by accident. It is an unwritten assumption that missing 20% growth is a fail. I worked at Google almost 10 years and watched the dog and pony show (aka TGIF) from the inside. The real story is on the quarterly financial reports.


Maybe the author doesn't realize how difficult PDF is to work with. In PDF it's ambiguous whether any two spans of text belong together in the same sentence or paragraph. It can even be unclear where are spaces between words. PDF also allows "optimizing" font usage that makes text unreadable without OCR-ing the custom font. The messy hacks go on and on:

https://filingdb.com/b/pdf-text-extraction

OTOH it's totally possible to make a self-contained HTML page without using a JS framework of the day. It's going to be way easier to consume than a PDF.


Hello. Original author here.

I do realize how ugly PDFs are to work with (I wrote my own PDF/A generator for issue 2[2]). This is a Tagged PDF though, so you can extract text using standard tools.

To understand the mindset, have a read of the Gemini FAQ[0], specifically the answer to why not use a subset of HTML - and then read Issue 2[2] which is a hybrid Gemini+PDF polyglot, for people who don't like reading PDFs, which is apparently everyone on this thread :)

Issue 1[1] also moves beyond PDF, to try addressing some of the accessibility shortcomings by (a) prepending the content as plain text, and (b) recording myself reading the whole thing out and arranging the file as a polyglot MP3 and PDF file that can be played in an audio player as well as viewed in a PDF reader as well as a text editor.

A mini-FAQ to address some points elsewhere in the thread:

* No, it's not going to replace your blog or the web in general.

* Yes, it's an experimental art project / longitudinal CTF forensics tournament / weirdo personal blog.

* Yes, I'm serious anyway.

[0] https://gemini.circumlunar.space/docs/faq.gmi

[1] https://lab6.com/1

[2] https://lab6.com/2


> The problem is that deciding upon a strictly limited subset of HTTP and HTML, slapping a label on it and calling it a day would do almost nothing to create a clearly demarcated space where people can go to consume only that kind of content in only that kind of way. It's impossible to know in advance whether what's on the other side of a https:// URL will be within the subset or outside it. It's very tedious to verify that a website claiming to use only the subset actually does, as many of the features we want to avoid are invisible (but not harmless!) to the user

But I don't really know that your PDF website doesn't use some evil invisible PDF feature.

And I have to use a special Gemini browser to access Gemini pages. (Since an HTTPS bridge misses the point)

So why not use Dillo as my "Sane subset of HTML"? It is not hard to hand-write HTML that looks great in Lynx, Dillo, and Firefox.


> It is not hard to hand-write HTML that looks great in Lynx, Dillo, and Firefox.

Actually, it is. I love Dillo, but it's very limited. I like to make my images "fluid" using max-width and max-height attributes, and Dillo will not support those in any foreseeable future.

But again, I still love Dillo.


> would do almost nothing to create a clearly demarcated space

How do you create that demarcated space where PDF/A, PDF 2.0, and all other PDF versions can be mingled together, and there's no easy way to distinguish them?


I don't like reading PDFs and probably wouldn't read much of your website like that... but I appreciate the intervention drawing our attention to the advantages of PDFs in the disadvantaged present environment, which I think are real and worth thinking about. It seems almost like an artistic project. I'm not mad at you, and am not sure what makes some people seem to be so mad here (probably means you were succesful at something)... but I'm still not gonna read it, PDFs are a mess to read!


I've spent entirely too much time "printing" sites and articles to PDF to save them to read or reference later. Your PDF style was perfect! No need to fuss with anything just save it!


This thread might be helpful to you https://news.ycombinator.com/item?id=27817659


I think the idea of PDFs opens up many new possibilities, and your work is quite an eye opener. Design is largely missing from websites - it’s the same design over and over when it comes to optimizing for clicks.

Designers would thrive in a PDF environment instead of handing their designs over to implementation as it is now.

Maybe PDF is just the beginning and maybe a similar format can be thought up that addresses some of the concerns expressed here, and move over in time.


Case in point: copy-pasting a paragraph from his PDF-website adds line breaks everywhere. It also loses formatting (bold/italics) and the footnote superscript doesn't translate.

  PDF is an open standard, which is freely available2, and stable. It has a 
  version number  and many interoperable implementations including 
  free and open source readers and editors.
I think ease of copy-pasting is one of the coolest things about the document-centric roots of the web (along with the back button and hyperlinks; in other words, hypertext rules), although the modern web does break it (along with the back button and hyperlinks) in many places, so I can see where he is coming from. PDFs aren't the answer, though.


> OTOH it's totally possible to make a self-contained HTML page without using a JS framework of the day.

I'm basically in agreement, but the author has a good point that PDF is obviously self-contained and self-contained HTML pages are not necessarily distinguishable from those that aren't. Perhaps we might have to revisit MHTML or embrace Web bundles as an alternative to PDF.


You want PWP <https://blog.jonudell.net/2016/10/15/from-pdf-to-pwp-a-visio...> (Later aborted, and the group's work was rolled into EPUB3. As you note, there remains a genuine need for it.)

On the other hand, there's nothing stopping you from using a double-barrelled file extension for denoting this sort of thing, e.g. "memex-opus.pub.html"; so long as it ends with something recognizable, double-clicking should still open it in the browser across all the usual platforms, AFAIK.

(I'm fond of using "xyzzy.app.htm" myself to take advantage of this trick for distributing simple, self-contained programs that are designed run in the browser.)


This is what PWAs are kind of for.


It's not even JS. I'd argue a HTML + inline JS page is a lot more self-contained than one with external images, videos and fonts.

Note that PDFs can contain JS too.


> Note that PDFs can contain JS too.

That's why he says to use PDF/A, which can't contain JS.


> Note that PDFs can contain JS too.

Wait, why?!? When does it render? Who's supposed to have a js engine to do that? What version? How does it load dependencies? Is HTML and DOM carried along with it? So many questions.


Why - because scripting is useful. A big use of PDFs is translating paper forms into digital forms without needing to make a web app out of them. JS is used for client side validation, same reason it was put into browsers. Acrobat can handle this along with many other features that most PDF readers can't handle properly.

Basically in the PDF world, Acrobat Reader is Chrome and everything else is, like, Konqueror or something. Don't be fooled into thinking PDF is a small spec. It's not.


Why? To validate form fields.

Who? The PDF viewer.

When? Since about 2000 in PDF format version 1.3.

Dependencies? Hah, no such luck. You're stuck with ES5 and Adobe's crufty JS library. There is no HTML and DOM, there are however some pretty thorough PDF document bindings.


Or... AMP? But no, Google made that so it must be a bad idea.


MHTML, which is basically HTML email.


> it's totally possible to make a self-contained HTML page without using a JS framework of the day. It's going to be way easier to consume than a PDF.

Completely agree. For instance, NASA's APOD site[1] is a good example of something that'd be nontrivial using both an offline PDF and modern lightweight alternatives like Gemini, but works really well even without fancy modern design. Under 300kB including the image (HTML's under 6 kB) before gzipping.

[1] https://apod.nasa.gov/apod/astropix.html


The author addresses this: “We choose to switch to PDF in this decade, not because it easy, but because it is hard” – John F. Warnock, September 12th 1962"

The author is obviously making a statements, exploring ideas... not searching for an actual solution to his use case.


Yeah, it's kinda embarrassing that the one quote that gets pulled out in the HN commentary is the one that contains a typo. It's OK: Issue 1[0] contains a patch to fix the issue.

[0] https://www.lab6.com/1


Is this a comical misquote or is the PDF format actually 60 years old?


It's about 30 years old - it's creator however is said person.

The actual quote was from JFK iirc regarding the Apollo missions...


It's a comical, deliberate misquote.


Comical misquote, "Switch to PDF" replaced "Go to the Moon".


Its comical, but links to the founder of Adobe. IDK what the date alludes to.


JFK announcing the US would put a man on the Moon before the decade was over.


oh... yeah


From the PDF

> “But it’s just as easy to write self-contained HTML pages!”

> Sure, but if you’re going to hide CTF forensics challenges in your publication, a coverdisk allows you to do it in style!

I think it's not meant to be taken extremely seriously


Applications are open for YC Winter 2022

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: