HTML can easily be offline-able. Base64 your images or use SVG, put your CSS in the HTML page, remove all 2-way data interaction, basically reduce HTML to the same performance as PDF and allow it to be downloaded.
* PDFs are files
HTML is files
* PDFs are decentralised
This should be "PDFs can be decentralised". PDFs aren't inherently any more decentralised than any other kind of file, including HTML.
The store is the thing that becomes decentralised, not the content.
* PDFs are page-oriented
HTML can be page-oriented. Simply build your website with pagination. PDFs can also be abused to have hugely long pages. Bad UX can be encapsulated in any medium.
Nope, PDFs are still objectively larger than the equivalent HTML. PDFs don't have any dynamic interaction, rip all that out and produce the HTML of yesteryear and your HTML will be tiny in comparison to the PDF.
Edit: I'm sorry, the more I think about this the dumber I feel. The web is useful because it's 2-way. I am excited by the web because I can interact with other people. I come to hacker news to engage with thinkers, not to just read a published article from one single author. I want to read ad-hoc opinions and user submitted content. PDF web, really?
No, I cut out some junk I don't need with the Printliminator  bookmarklet, then I do a *print-to-PDF.* This gives me a file. I can save the file, back it up to my NAS, search for it later, keep it with other files from a project where it was useful, and otherwise hang onto it. This is so common, in fact, that it's gone from being an obscure thing you could do with a Postscript-to-PDF converter or (before the adware/Ask toolbar scandal) the installing the CutePDF virtual printer. Modern OSes bundle a PDF printer, and print dialogs understand that you want to "Save as PDF". Google Docs and Office 365 editors allow downloading a document as a PDF.
I totally agree that a dynamic, interactive page or a comment section is not compatible with this model of usage. There's a lot of consumption of endless feeds, and a lot of one-time video views that also don't make sense to save as offline files. However, the web for creators, where people write articles that are worth hanging onto, has a definite place for PDFs.
Instead of doing a bad and lossy job of archiving the page myself, I notify† our friendly neighbourhood archivists at the Internet Archive of the page; and they then do the best, most lossless job of preserving the page that they're able, given their cumulative experience.
As a side-benefit, they also then take care of keeping the archive they've made around and available online in perpetuity, with no additional marginal effort on my part. The same can't be said for something in my own "private collection."
Just an FYI. If there are critical sites you want copies of, I'd recommend making your own copy. I've lost access to important pages / sites twice before taking this to heart.
Edited for clarity
Of course the Internet Archive serves other purposes for which it is (currently) irreplaceable.
Hopefully it really is around a very long time, but the world is unpredictable and things change. It's great to enhance the Internet Archive, but you can bet I'm keeping my local copy too. Just in case.
This is still not as powerful as my one, simple trick to handle all bookmarks, ever: Print to PDF.
I've been doing it since last century, and I have 10's of thousands of PDF's of every single web page I've ever found interesting, sitting right there in a directory on my computer
Including the suggestion that was brought up to use ripgrep to search in the pdf text content.
- In my experience, it's a little harder and rarer to make PDFs utterly incompatible with different means of viewing them, and it generally requires more overt (if perhaps slightly unintentional, at times) sadism to make that happen.
- PDFs can do some things HTML can't (easily, at least) with document design -- though those things are generally things that would be disallowed in our new "deurbanized" PDF-based web replacement.
Everything else that comes to mind goes the other way, including the fact that the viewing-mechanism incompatibility thing can be even worse with PDFs, even if it's more rare for that to happen at present, and if PDFs became the new standard for the web I'm pretty sure that relative rarity would evaporate anyway. Let's also not forget that HTML can also do some things PDFs can't (as easily, at least) do.
I'm too lazy, so I just tend to use SingleFile these days...
> Isn’t it a good thing that we enjoy rapid progress? To the extent that we get to enjoy things like YouTube and sandspiel, yes! But to the extent that we want the internet to be a place where we can work and live and think and communicate free of malware, surveillance, dark patterns and the insidious influence of advertising, the answer is, empirically, sadly, no. The web has become ad-corrupted hand-in-hand with growth in technological capability, and the symbiotic relationship between web and browser means they feed on each others’ churn. Ads demand new sources of novelty to put themselves on, so the web expands continually, the specs grow in complexity, the browsers grow in sophistication, the barrier to entry grows ever higher, the vast cost of it all demands more ad revenue to fund it... and thus the perpetual motion machine is complete.
The problem described is widely felt, and also widely discussed. We already know this stuff to be a problem. For the piece to be worthwhile, then, it should do something that is not present in the other instances where the topic has been raised. It should articulate (or at the very least exhibit, without necessarily articulating) a solution for us. It doesn't. A bad remedy to a genuine problem does not yield a solved problem.
- Publish in static file formats.
- Date and hash your work.
- Stop spying on your users.
HN is a discussion forum, not project planning software. Not everything has to "yield a solved problem". Are you really setting the bar at "design a technology stack for replacing HTML/CSS/JS"? That's way, way too high.
EDIT: Oh, yeah, and static file formats doesn't necessarily have to mean static document formatting when viewing -- unless you're using PDFs, which tends to break useful stuff like reflowing for paginated documents (one of the worst things about even simple PDFs).
The web has become a bad remedy to some distributed software problems.
Edit: If the article _was_ all about surveillance capitalism, then it wouldn't be worth upvoting as actionable solutions are much more valuable than preaching to the choir.
> Sure, you can write good HTML. I won’t argue with that. And if you’re writing good HTML, good for you. But HTML is a dual-use technology, the bad guys are dual-using it an awful lot, and I feel that the stone age still has a part to play in the progression of the information age.
The part where you engage with this is where you write:
> I'm sorry, the more I think about this the dumber I feel. The web is useful because it's 2-way. I am excited by the web because I can interact with other people. I come to hacker news to engage with thinkers, not to just read a published article from one single author. I want to read ad-hoc opinions and user submitted content. PDF web, really?
Which is interesting! Do you have thoughts on creating peer-to-peer systems that don't enable surveillance capitalism?
A key here is that it's easier to write good HTML docs than good PDF docs, and much harder to deal with the harmful aspects of PDF docs given present technology.
> Which is interesting! Do you have thoughts on creating peer-to-peer systems that don't enable surveillance capitalism?
I don't know about the other person's ideas, but decentralization plus better anonymization and pseudonimization, with always-on strongest-reasonably-posible encryption, seems like the direction to go.
Oh, yeah I'm not on the PDF train. That's wild. I'm more of a Markdown or Gemtext advocate, or even LaTeX.
> I don't know about the other person's ideas, but decentralization plus better anonymization and pseudonimization, with always-on strongest-reasonably-posible encryption, seems like the direction to go.
Yes, it's call TOR. However, legislation is where we should start. Crippling/abandoning an incredibly useful technology which works very well just because it's often used nefariously seems to be a bit of an overreaction.
Until then, stop using social platforms, use an ad blocker, and use VPN if you really care about "surveillance capitalism".
Secondly, the level of difficulty in making HTML offlineable is many orders of magnitude simpler than your C analogy: there's really no comparison. For the OP we only need to make HTML documents that they have authored themselves offlineable and yet people have written general purpose tools to do this automatically for most webpages. This is not a hard problem.
TL;DR your analogy is absurd.
When I land on a page that's a PDF, I know certain things--I can easily save it and read it later. How do I know that? Not because I have read the PDF spec, or know that much about it, but because of my experience as a consumer of the web.
There are better ways to archive the content of even dynamic JS heavy pages, but they are not things that you learn as an average user of the web.
The reason offline utility tends to be true more often for PDFs is that PDFs are not generally regarded as the preferred online-default format of choice, which is in turn a matter of social effects rather than technical capacity. Reverse the socially accepted roles of the two document formats and watch the same complaints get made against PDFs as you're making against HTML. I'd bet money the "normal" state of affairs would remain the same in terms of the perceived benefit/detriment allocation between online/offline formats; only which format was considered which would have changed.
. . . but then all the web would be even heavier documents, and even less customizable for local viewing, thanks in part to that pagination and strict formatting situation.
The original HTML site was printable as PDF, and save-able as both HTML and "Web page, complete", all of which result in a well-formatted & readable offline experience. (It was also responsive: very readable on mobile, but that's an aside).
The new PDF site is not accessible to some, difficult to read on mobile, and interacts poorly with all of the norms web users are accustomed to (back navigation, anchors, etc.)
How important this is to users, or whether it is worth it is something I've not commented on, but it is a difference.
Even that usually sucks nowadays, because web developers don't care anymore. Probably 75% of the time before I do that, I have to go into the dev console to delete overlay elements that obscure content and garbage that will waste 10 pages (e.g. grossly oversized images, related article recommendations, etc.).
There was a time when most websites had a print view that gave you a simplified html page that worked well, but I think most of those are gone now. Now it's all some print "media-type" CSS that no one ever put the time in to do properly or keep up to date.
(I'm not conversant enough in the spec to know, but I do know that Postscript is Turing complete, and I don't know that PDF isn't. At least HTML on its own certainly isn't—no recursion!—although all bets go out the window once you start layering other tech on top of it.)
What technologies exactly?
You can have absolutely everything you need inside the HTML. You can inline css, js, svg and images. What technologies you can’t inline?
It's just the declaration of ONE person, switching ONE site.
You have the same option with either HTML or PDF:
- PDF files can be dynamic or static, depending on how you write them.
- HTML files can be dynamic or static, depending on how you write them.
> HTML can easily be offline-able. Base64 your images or use SVG, put your CSS in the HTML page, remove all 2-way data interaction, basically reduce HTML to the same performance as PDF and allow it to be downloaded.
You're missing the point. Even a relatively computer-illiterate person can easily save a PDF to my hard drive, and it's significantly more difficult with HTML. At a minimum you're probably going to get an HTML file with a sidecar directory (or I believe a sometimes browser-specific archive, it's been a long time since I tried since it works so poorly), and even that may not have the content you want to due to dynamic sites.
The idea that the whole web is going to pander to edge case archivers is asinine. This whole conversation is about supporting the needs of the very, very few and romanticizing about the time when only interesting people used the internet. It's kind of elitist and self serving.
> even that may not have the content you want to due to dynamic sites
But PDFs also don’t give you dynamic content. Nothing is stopping people from using HTML to serve static, JS-less content. In fact that’s what it was originally designed to do. All this web app stuff was bolted on afterwards, and it’s optional.
What do we accomplish by having some people switch over to PDFs? The people who don’t care about bloat will continue to not care about it. It’s not like thin content will become more discoverable or more common. It doesn’t really change incentives. The author says using PDFs makes it so you’re not tempted to add cruft to your sites but that’s not really a compelling argument.
Getting content creators to produce content without bloat is not really a technical problem. It’s a cultural and economic one. I don’t see how a file format addresses that.
Yes, it matters a lot. Word/Excel files are actually a zip archive containing many files and sub-directories. Can you imagine people working with exploded Word files, sending over mail and WhatsApp complete directory trees?
You can write PDFs to include resources that are not part of a single, self-contained file, and to be quite unfriendly with offline use.
Printing a page to PDF usually sucks: See https://news.ycombinator.com/item?id=27883028
Right Click > Save as
Try it with this page!
> Try it with this page!
Say hello to your new sidecar directory (or broken CSS/images/God knows what else)!
I tried to save an NY Times article, and it 1) needed JS to display anything, 2) even with the sidecar stuff was broken, 3) it was so plastered with ads and other junk I thought it was incomplete (it wasn't, I just had to scroll waaay down past something that looked like a footer and some voids after that).
If you save a PDF, you get that exact PDF on your hard drive, and when you open it (even in 10 years) it will look exactly the same as it did on the site.
With PDF WYSIWYS: What you see is what you save.
I'll preface by saying I have some expertise in HTML, but none in PDF (the format).
The point of most commenters who suggest that HTML is still a better alternative than PDF (I agree), are assuming that if this is an important issue to you, that you would craft your page in a simpler style compared to most of what we see on the web, making Print to PDF or Save As... more viable.
> PDFs and a PDF tool ecosystem exist today. No need for another ghost town GitHub repo with a promising README and v0.1 in progress.
In general, we know that HTML is going to be much more compact (and compressible!) than PDF and that's the biggest advantage I see on a web where bandwidth still matters. Another downside shows itself by trying to copy and pasting the above quote: PDF formatting seems to be weird.
PDFs can be tiny if they do not embed fonts. Serving fonts is very much a complex technology in HTML world.
Browsing the web is a pain in the ass if you don't use a browser compliant with up-to-date standards, but the whole "HTML can be lightweight" argument pretty much depends on avoiding much of today's standardisation. As an objection to the original argument, it is not comparing like with like.
> In general, we know that HTML is going to be much more compact (and compressible!) than PDF and that's the biggest advantage I see on a web where bandwidth still matters. Another downside shows itself by trying to copy and pasting the above quote: PDF formatting seems to be weird.
PDF is a display format. I once worked on a project parallel to a guy who was parsing PDF to extract text content. IIRC, Text in PDFs is stored in a way that works fine for printing/rendering but not so well for manipulation (e.g. it's a bunch of commands to render line Z at position X,Y with font W). Those commands don't have to be in reading order, nor do they have the semantic meaning you can get from markup like HTML (e.g. superscript can just be nothing more than a different line rendered with a smaller font).
IMHO, PDF is actually less optimal than HTML for what this guy is advocating, except that it's those precisely those limitations that have prevented PDF from becoming the mess than Web HTML has. Though, that's probably in large part because the bloaters have been too distracted by the easier-target that is HTML to bother.
I figured I could just save the page, automate a few edits to get around dynamic stuff, and then use it as, you know, an HTML document.
Even with a nice friendly mostly-text literary magazine, after about five hours I gave up and just copy-pasted the rendered text.
> Right Click > Save as
HN is not a good site to illustrate the unpleasantnesses of navigating the modern web. As you'd hope for a hacker news site, it is very friendly to this sort of thing. Most sites aren't.
Ctrl+P -> Save as PDF
You don't need the page to be a PDF to save it as a PDF.
Under the hood it seems apparent to me that the real premise is an emotional one, not a technical one.
The internet is plastic not because of HTML, but because of money and people. When you have teens driving content it's going to feel plastic. When Walmart uses the internet to sell you crap it's gonna be plastic. Gossip / social platforms are trash, no matter the medium.
It could be argued that TV is an incredible learning platform ruined by HD. Back in the standard definition days we had proper news, documentaries that were substantial, and no reality TV. We need to go back to black and white standard definition.
Sorry, but the PDF web is not a solution to societal rot.
He's actually more of a social observation: it doesn't matter what the technology can do, what matters how how the developers of that technology actually use it.
People who use PDF almost never use 3D graphics and heavy dynamic JS, so PDFs almost always have many of the qualities he's seeking.
Web developers almost never inline anything, and do all kinds of things that are arguably deal-breakers except for a few lowest-common-denominator use cases.
> Under the hood it seems apparent to me that the real premise is an emotional one, not a technical one.
The premise is that the web has failed in important and clear ways, it's impossible to fix so we should give up, so many use cases should abandon it for something else, and PDFs are unexpectedly well suited for that.
On a related note, part of me wishes Java Applets never died. Getting rid of them seems to have caused the Web to turn into them, and maybe if they'd remained some kind of separation could have been maintained.
The solution to the identified problems is not to switch to PDFs. Stop reshuffling the chairs on the deck of your sinking ship, and start figuring out how to design, implement, and incentivize the use of, some means of conveyance other than iceberg-vulnerable ships.
> On a related note, part of me wishes Java Applets never died. Getting rid of them seems to have caused the Web to turn into them, and maybe if they'd remained some kind of separation could have been maintained.
Java Applets were killed by Flash.
Not so surprising, really: the PDF standard evolved in parallel with Adobe's Flash between 2005 and 2010, which was then the key technology in Adobe's effort to keep a strategic toehold on the web. If Flash had not been a security clusterfuck, it might still be around. The PDF standard was always meant to be a complementary standard, and Adobe's attempted successor technologies have followed an even closer technological path.
The PDF standard has benefited from the fact that, unlike the W3C and WHATWG, surveillance capitalists have not been in the driving seat of its standardisation effort. Adobe's interests are not identical to those of the public, but they are not as essentially adversarial to them as the web standards bodies have been.
While you can produce identical outputs from the different methods, it's not hair-splitting to say that the authoring process and hence the nature of the medium to shape expression is affected by choosing one. When you opt towards maximizing generality your production cycle can grow without bound because everything is possible by layering different media, even if all of it is unnecessary. That's how you end up with creative projects that take multiple years to decades to accomplish.
This is close to it:
When you have teens driving content it's going to feel plastic.
Youth is the ultimate quality destroyer. They just fucking suck. I’m quite sick of their drivel honestly, and yet, we let them dictate the world (watch my childish cartoons, even in old age).
And the little shits complicate code bases. All you little rascals under 30, scram, I’m on to you.
And all you little adults acting like children, with your stupid motivational posts on LinkedIn, and your garbage bragging on there, I see you too.
I was skeptical at first, but I think the author made the point fantastically well.
Your browser has a zoom functionality that lets you make the text smaller, essentially replicating the PDF site above. Only the opposite of what you say is correct: I can’t read that PDF’s text without turning my phone into landscape and picking up my glasses.
I get what they're going for but the PDF is not exactly an accessible reading experience.
(EPUB is basically a subset of HTML with client-oriented context.)
There’s a reason responsive design has been a big deal for the last 10+ years and I don’t think the benefits of PDF are worth throwing it out.
If these all "miss the point", what is the point?
It seems to me that the article's point is that PDF as a format has attributes that satisfy the author's goal, whereas HTML does not. The parent comment says that HTML does have those attributes after all (if you choose to use HTML that way). That is very directly addressing the article's point, as I understand it.
Just a caveat to that statement, you can literally do interactive and dynamic 3D graphics rendering in PDFs:
You can also embed JS in PDFs:
Even most simple interactive things can easily not work correctly even in more widely spread PDF readers.
IMHO PDF is in many ways worse then HTML, it's just that this ways are less commonly used, but if you start a PDF instead of HTML trend it's just a matter of time until this "not so compatible" aspects of PDF become widely used by some people.
This guy is arguing that removing JS is what makes the web better. Having published, static, paper-like content is the way forward.
As someone who has had to extract data from large sets of PDFs and modern web presentation formats, I'm not a fan of either, really. Even verifying that a visibly presented string exists in a PDF document programmatically can be a non-trivial task, as with a given website as well. That to me says a lot.
For what it's worth, the same objection occured to me. The use of scripting I've seen in PDFs has been use-supporting and consistent with their book-like feel.
I’m guessing your data set is made of scans with poor or no OCR.
> PDFs are discoverable. Search engines index them as easily as any other format.
What you’re taking about has nothing to do with that.
FWIW I deliver PDFs daily as an art director; not ideal, but they work in most cases. There's certainly nothing rebellious or non-commercial about them.
I built a tool for this exact purpose since the HTML specification and modern browsers have a lot of nice features for creating and reading documents compared to PDF (reflow and responsive page scaling, accessibility, easily sharable, a lot of styling options that are easy to use, ability for the user to easily modify the document or change the style, integration with existing web technologies, etc.). In general I would rather read an HTML document than the PDF document since I like to modify the styling in various ways (dark theme extensions in the browser for example) which may be hard to do with a PDF, but its more of a personal preference. Some people will prefer that the document adjusts to the screen size of the device (many HTML pages), and others will prefer the exact same or similar rendering regardless of the screen size (PDF).
Either way, kind of a fun idea making a website using just PDFs. Not the most practical choice, but fun none-the-less.
People understand PDFs, they are extremely common in the academic and business world as “digital paper” standalone documents. Hypothetically, anything in memory can be made into a file but in this scenario what matters is the practical goal of people actually using these files.
I think it makes sense for the web to be made up of discreet primitives not only so that the web can be browsed in an intuitive and frictionless way but also because it lends itself to being backed up and easily re-hosted.
Call to action
Publish in static file formats
Date and hash your work
Stop spying on your users
All this cannot be GUARANTEED by HTML/pdf/epub and requires active cooperation from the author. This is bad.
Oh, you are set for a world of surprises. Nearly every single one bad, but running our current web over PDFs is well within the specs.
- does not reflow, major suck
- is binary format, another major suck
So no thx, PDF is outdated tech, while HTML and friends are just abused.
and, ancient HTML can still be easily read by modern browsers, so that's not exactly a special attribute of PDF either.
Sure - if the publisher cares. From the user's standpoint, the safe assumption is that they don't. Of course PDF is No Good for many contexts, but for any sort of long-form document that is primarily meant to be read, it's so often better.
Also, if something is available in pdf, I can be moderately sure that someone else took the time to make sure it would be formatted correctly and print out OK.* If it only exists in HTML it's more of a roulette wheel experience.
* Unless some graphic designer thought 'gee this report would look so cool if the cover pages were black or some other highly saturated block of solid color.'
HTML+JS today... now it's effectively a standard in name only, and Chrome is the new IE6. The standard is now "what has worked in the last stable release"
Now go to http://acid3.acidtests.org/ and see how the latest stable Chrome release can't render a decade old CSS testcase.
My experience is that browsers are terrible with CSS pagination support in their display and printing directly.
The only place it seems to actually work is...saving as a PDF...
And you can have a single self-contained file with a webpage, it's called a "web archive", with .mhtml extension.
Is there a tool that does those two things (or at least the first one) and that can be used by non-programmers (command line use is fine, a Python library would not be)?
And how many websites today are anything like HN, in terms of relative simplicity, e.g., no images^1, 3rd party requests or ads, only a tiny bit of (gratuitous)^2 JS.
1. I do not particpate in the voting scheme but I could vote from the command line if I wanted to. I use a text-only browser so the grey, fading text gimmick is irrelevant. I see all comments and treat them according to the thinking not the voting.
2. If we exclude the .ico and a .gif
There seems to be a double-standard, for lack of a better term, where many HN commenters and voters appear to work for companies that make websites with tracking and ads and various gimmicks targeted at "non-thinkers" which are nothing at all like HN. Whatever these commenters and voters see and appreciate in HN they are not working to bring it to the rest of the web. I seriously doubt they comment and vote on HN out of fear of so-called "power users" or a belief that the HN type of simplicity could become more popular and threaten their jobs that depend on surveillance, online ads and a non-thinking audience of "powerless" users. Rather, a more rational explanation might be that they see some value in a website that shows no ads and generally uses no gimmicks; that's something to think about.
The tracking section mentions the Abe Vigoda status page.
It reminds me a bit of a "newsletter" I'm subscribed to called, ironically, "Not a Newsletter" (http://notanewsletter.com/). You get an email from the author each month and it just points to a Google Doc where he puts the actual content. Why's this good? The content can't set off any spam filters, he can edit the issue after it's "sent" if there are mistakes or broken links..
Files have none of these problems.
The readers would still need to trust the author's not doing anything nefarious with their IP addresses, but I guess there's a degree of implicit trust when subscribing to a newsletter.
No they're not? You literally can't have a google doc as a file in a first-class way - you can export it to a file, but that's a lossy process.
> PDFs used to be inaccessible
My eyes are not very good. I have trouble reading the font in the PDF. I am using Firefox. HTML lets me pick that a font that I can read easily. I cannot do that with PDF.
> PDFs used to be unreadable on small screens, but now you can reflow them.
I am using Firefox. I cannot do that.
Realistically, how many years will I have to wait until Firefox catches up?
Over twenty years ago, I learnt Web authoring by examining the source which had a profound effect on my career. That serendipitous opportunity I had with human-readable sources will be lost to the next generation with PDF - they have to learn the technology deliberately.
I can empathize with the feeling that the web is incredibly bloated, but that's IMO throwing the baby with the bath water. Simple HTML with some optional CSS would do the job much better IMO (and can be easily downloaded, mirrored or offlined with tools like wget).
And if you really don't like writing HTML (I won't blame you) then there's always formats like markdown, org-mode and friends which can easily be converted to pretty much anything.
Unless your system is a PDF library (as in, you make the black-box dependency that other systems use to handle PDF exports), everything you do with PDFs will be through some annoying black-box dependency that is a pain to use.
Even relatively complex HTML is much more fun to work with than PDF.
In Adobe Acrobat (and I’m guessing Adobe Reader): Choose View → Zoom → Reflow, and it turns everything into one column of nigh-unformatted text.
(Word looks like it may support it, but that could be more that it’s converted it to a Word document in some way and reflow-like functionality falls out of that naturally, though I imagine the tagging would help with the conversion; and someone in this thread mentions something called “Book Reader” supporting it.)
So did I. Now, it is impossible to reverse engineer the metric crapton of minified JS and CSS cryptoglyphics that comprise the modern web.
But I too wish the modern web was simpler. It took an evolutionary path of maintaining just enough backwards compatibility to only keep making things worse. Efforts like Gemini bring some hope but I'm afraid the medium won't be flexible enough for much beyond personal blogs. But maybe that's for the better.
: https://gemini.circumlunar.space; gemini://gemini.circumlunar.space
>Realistically, how many years will I have to wait until Firefox catches up?
They should better improve reflow for HTML on small devices first. Focusing on PDF is a waste of resources.
 Generally though, I'm sympathetic with your point and it's kind of like why zines regained popularity in the 90s (and samizdat in the Soviet Union before that)... controlling your own publishing is a powerful idea. Anyone can do that though, without resorting to obscure formats, unless obfuscation is the point.
$> cat file.pdf | strings
$> strings file.pdf
$> strings < file.pdf
I had no idea what the content of the site was (besides the title from HN) and around the 50% download point, I had already lost interest. I'm clearly not the only one who loses interest this quick .
Also, as others have mentioned in root level comments, the design & layout of the content within is also severely lacking, which makes waiting for the load to occur even less worth it.
: https://www.pingdom.com/blog/page-load-time-really-affect-bo... (2018)
: https://blog.mozilla.org/metrics/2010/03/31/firefox-page-loa... (2010)
: https://www.thinkwithgoogle.com/marketing-strategies/app-and... (I know it's Google, but to be fair they have more data on this than most other companies, despite their obvious desire to sell more of their product/services related to it.)
I can deal with things moving around, I don't need spatial memory for that. Just give good titles, headers, and indexes. Again, we can do this with simple HTML, embed images and styles. It's all there.
Unfortunately, as I mentioned, people don't really publish information anymore. It's mainly for "experience" and for "looks". Marketing, and advertising, now drive the information era. The "Information Super Highway" is now just a crumbling road plastered with billboards. Most content is useless, and is there for clicks. Heck, I'd rather someone post their site in digests in e-book formats than PDF.
While it's possible to royally mess up accessibility in HTML, too, the chances of getting something usable are at least somewhat better.
In my time working with PDFs, I've found that generating them in ways that can be read with the most popular PDF readers is cryptic and difficult, and even parsing the ones made from the most popular creators is hard.
I would definitely not pick PDF over HTML in regards to how easy it is to implement a good reader or writer.
And there's plenty of authoring tools for HTML already, so the "ecosystem already exists for PDF" doesn't track either.
Even the complaint about churn makes no sense to me, because there's no need to upgrade your tools constantly. If you're using something that produces good HTML today, it'll produce good HTML in a decade, too.
OTOH, if you have a problem that could be automated, you're a lot more likely to be able to create that tool for HTML than PDF, and it's quite likely that someone else already has for HTML, but not PDF.
Both pdf readers on my phone can't read the pdf, so this is definitely an issue.
In both email, and the browser I'm already in a program that displays text and images and cool stuff. So then I'm just sent a link to someplace else that does the same thing?
So then what? Is it all just "pdf can do that too", but with extra steps...? I can print to PDF in most browsers if I want, but in this case it isn't a choice.
The idea that I might save and store the school emails or that website and somehow manage those files seems kinda self important in a way ... I don't mean that as a personal attack, just that this idea that they imagine me taking the time to do that with their content? When otherwise it could have just been an accessible web page? How many people care to do that?
If I'm visiting a website I'm almost certainly not interested in saving your content / managing it... almost never.
I'm a little lost on the whole 'page-oriented' idea too. That's just a limitation of paper, and it's a pain / disruptive more often than not. Even the 'page oriented' section is broken up by the page and some extra text at the bottom of the page that is irrelevant to the paragraph...
If folks want a 'save to pdf' option might be nice to add, or the user can just print to pdf...
I certainly get the argument, but using something like hugo or gatsby or jekyll when you want to avoid the "churn" also seems like a perfectly valid solution.
Of course, pdfs aren't necessarily static, either, but that is why Lab6 is choosing to use pdf/a, an actually static format intended specifically for long-term archiving of immutable files. This way you can sign the file and guarantee it stays the same forever and everyone's copy is identical.
I'm kind of surprised at the response to this. The author seems well aware of how terrible pdf is as a format and this isn't some treatise of why we should want to use it. It's an unfortunate compromise that, given the requirements they're aiming to meet, of generating a file that supports rich formatting and hyperlink embedding, but which can guarantee immutability and long-term archiving directly in the spec, pdf/a is all there is, so in spite of being a terrible format with a lot of shortcomings, it's what they're using.
But just like you can choose to use PDF/A, you can also choose to have a completely static and self-contained (e.g. using data URLs for images) HTML page.
Nobody is requiring you to use PDF/A. No mainline browser (that I'm aware of) requires it.
So what is being solved? When I click on a PDF on the web, I don't know if it's using PDF/A, I don't know if it's embedding or linking its fonts. So it's the same situation, nothing has changed.
Telling people to use PDF/A when most clients do not enforce it and when there's no indication to users before they click on a link whether or not the link is following the spec -- it is exactly the same as telling them to use a subset of HTML; the author is doing the same thing they complain about.
You can't just say that PDF/A exists. That's not enough, how will you get people to restrict themselves to that format when 99% of their users will never notice the difference and no client is enforcing it?
Other than that PDF is quite clearly a less accessible format.
Pretty sure a PDF opened in the browser can't run any JS, but not completely sure. So you're right: I don't really know it for a fact. Poor choice of words.
The PDF 2.0 spec is damnably not public.
If you only allow PDF, then 99.9999% of the web doesn't work anymore.
I'm all for getting sites to be static, but PDF doesn't fix that because the problem has never been the technology used to build the site.
(It looks like at least some PDF readers have provided support for automatically displaying external images, for example)
I often use Tor, although I'm pretty sure that even then, a good analytics lib can see it's me based on scroll behaviour, mouse movement, time of day, and of course what I browse.
But yeah, you make a good point.
You might not be a unique fingerprint, but at best you are part of a group of somewhere between 3 and 1000 similar users.
Not to be a downer, but when I webscraped I learned that big corporations can spend money to fingerprint you.
Or a plug-in to Wordpress so you can keep the GUI/dynamic for the less technical employees:
 - https://www.w3.org/publishing/epub32/epub-spec.html#sec-intr...
In a manner of speaking, ePub as a design has an inherent built-in fallback mechanism to manually obtain the internal content in case of failure - including ability to try and repair a broken zip format (zip -F/-FF) and grep it in place (zipgrep).
I consume the web mostly by following a few very interesting people on social media and following their links. As an author, my goal is to keep producing interesting enough material to be worth people's time reading.
As others have pointed out it's strictly worse than a static HTML site in many, many ways. At the same time though, it's a brilliant criticism of many of the worst aspects of the modern web.
This is art.
Feels like this is more about the fact that websites have become increasingly dynamic, unstable, unreliable, inconsistent, etc. - pdfs offer something like a book, static, stable, reliable and consistent.
Think about a book you can turn to a specific page no matter how many times you look at it and the print is the same, the information is the same, you can do the same action over and over again and get the same expected result.
Now imagine opening a book and you could have sworn that the chapter you wanted to reference was 11 but now it's 16 and the images are different, the examples are different, in fact the quote that you wanted to use for reference no longer exists in the book.
There's an insanity to this experience but it's exactly what the web is like - a book that is constantly changing, upended changed - even disappearing entirely. I could have sworn I had bought that book on discrete mathematics - how could it be gone? oh that's right the server managing site is powered off - book no longer even exists.
Same as someone else, to read on mobile I have to download and open a pdf so i just cancelled the download and ignored the link
On top of that the end result is not very readable on mobile, the font is too small.
> On top of that the end result is not very readable on mobile, the font is too small.
Agreed on both counts. Was only commenting about browsers saving PDFs.
PDF is not a comfortable format for reading on a screen. Nor a comfortable format to extract text or data from.
We don’t need PDF sites, we need incentives for publishing acceptable websites.
Side note: I’d honestly love for the government to step in and outright outlaw some obvious and intentional dark patterns (example: California unsubscribe law)
Google is never going to make a change to its rankings that interferes with its real goal of 23% YoY revenue growth.
OTOH it's totally possible to make a self-contained HTML page without using a JS framework of the day. It's going to be way easier to consume than a PDF.
I do realize how ugly PDFs are to work with (I wrote my own PDF/A generator for issue 2). This is a Tagged PDF though, so you can extract text using standard tools.
To understand the mindset, have a read of the Gemini FAQ, specifically the answer to why not use a subset of HTML - and then read Issue 2 which is a hybrid Gemini+PDF polyglot, for people who don't like reading PDFs, which is apparently everyone on this thread :)
Issue 1 also moves beyond PDF, to try addressing some of the accessibility shortcomings by (a) prepending the content as plain text, and (b) recording myself reading the whole thing out and arranging the file as a polyglot MP3 and PDF file that can be played in an audio player as well as viewed in a PDF reader as well as a text editor.
A mini-FAQ to address some points elsewhere in the thread:
* No, it's not going to replace your blog or the web in general.
* Yes, it's an experimental art project / longitudinal CTF forensics tournament / weirdo personal blog.
* Yes, I'm serious anyway.
But I don't really know that your PDF website doesn't use some evil invisible PDF feature.
And I have to use a special Gemini browser to access Gemini pages. (Since an HTTPS bridge misses the point)
So why not use Dillo as my "Sane subset of HTML"? It is not hard to hand-write HTML that looks great in Lynx, Dillo, and Firefox.
Actually, it is. I love Dillo, but it's very limited. I like to make my images "fluid" using max-width and max-height attributes, and Dillo will not support those in any foreseeable future.
But again, I still love Dillo.
How do you create that demarcated space where PDF/A, PDF 2.0, and all other PDF versions can be mingled together, and there's no easy way to distinguish them?
Designers would thrive in a PDF environment instead of handing their designs over to implementation as it is now.
Maybe PDF is just the beginning and maybe a similar format can be thought up that addresses some of the concerns expressed here, and move over in time.
PDF is an open standard, which is freely available2, and stable. It has a
version number and many interoperable implementations including
free and open source readers and editors.
I'm basically in agreement, but the author has a good point that PDF is obviously self-contained and self-contained HTML pages are not necessarily distinguishable from those that aren't. Perhaps we might have to revisit MHTML or embrace Web bundles as an alternative to PDF.
On the other hand, there's nothing stopping you from using a double-barrelled file extension for denoting this sort of thing, e.g. "memex-opus.pub.html"; so long as it ends with something recognizable, double-clicking should still open it in the browser across all the usual platforms, AFAIK.
(I'm fond of using "xyzzy.app.htm" myself to take advantage of this trick for distributing simple, self-contained programs that are designed run in the browser.)
Note that PDFs can contain JS too.
That's why he says to use PDF/A, which can't contain JS.
Wait, why?!? When does it render? Who's supposed to have a js engine to do that? What version? How does it load dependencies? Is HTML and DOM carried along with it? So many questions.
Basically in the PDF world, Acrobat Reader is Chrome and everything else is, like, Konqueror or something. Don't be fooled into thinking PDF is a small spec. It's not.
Who? The PDF viewer.
When? Since about 2000 in PDF format version 1.3.
Dependencies? Hah, no such luck. You're stuck with ES5 and Adobe's crufty JS library. There is no HTML and DOM, there are however some pretty thorough PDF document bindings.
Completely agree. For instance, NASA's APOD site is a good example of something that'd be nontrivial using both an offline PDF and modern lightweight alternatives like Gemini, but works really well even without fancy modern design. Under 300kB including the image (HTML's under 6 kB) before gzipping.
The author is obviously making a statements, exploring ideas... not searching for an actual solution to his use case.
The actual quote was from JFK iirc regarding the Apollo missions...
> “But it’s just as easy to write self-contained HTML pages!”
> Sure, but if you’re going to hide CTF forensics challenges in your publication, a coverdisk allows you to do it in style!
I think it's not meant to be taken extremely seriously