Hacker News new | past | comments | ask | show | jobs | submit login

Maybe the author doesn't realize how difficult PDF is to work with. In PDF it's ambiguous whether any two spans of text belong together in the same sentence or paragraph. It can even be unclear where are spaces between words. PDF also allows "optimizing" font usage that makes text unreadable without OCR-ing the custom font. The messy hacks go on and on:

https://filingdb.com/b/pdf-text-extraction

OTOH it's totally possible to make a self-contained HTML page without using a JS framework of the day. It's going to be way easier to consume than a PDF.




Hello. Original author here.

I do realize how ugly PDFs are to work with (I wrote my own PDF/A generator for issue 2[2]). This is a Tagged PDF though, so you can extract text using standard tools.

To understand the mindset, have a read of the Gemini FAQ[0], specifically the answer to why not use a subset of HTML - and then read Issue 2[2] which is a hybrid Gemini+PDF polyglot, for people who don't like reading PDFs, which is apparently everyone on this thread :)

Issue 1[1] also moves beyond PDF, to try addressing some of the accessibility shortcomings by (a) prepending the content as plain text, and (b) recording myself reading the whole thing out and arranging the file as a polyglot MP3 and PDF file that can be played in an audio player as well as viewed in a PDF reader as well as a text editor.

A mini-FAQ to address some points elsewhere in the thread:

* No, it's not going to replace your blog or the web in general.

* Yes, it's an experimental art project / longitudinal CTF forensics tournament / weirdo personal blog.

* Yes, I'm serious anyway.

[0] https://gemini.circumlunar.space/docs/faq.gmi

[1] https://lab6.com/1

[2] https://lab6.com/2


> The problem is that deciding upon a strictly limited subset of HTTP and HTML, slapping a label on it and calling it a day would do almost nothing to create a clearly demarcated space where people can go to consume only that kind of content in only that kind of way. It's impossible to know in advance whether what's on the other side of a https:// URL will be within the subset or outside it. It's very tedious to verify that a website claiming to use only the subset actually does, as many of the features we want to avoid are invisible (but not harmless!) to the user

But I don't really know that your PDF website doesn't use some evil invisible PDF feature.

And I have to use a special Gemini browser to access Gemini pages. (Since an HTTPS bridge misses the point)

So why not use Dillo as my "Sane subset of HTML"? It is not hard to hand-write HTML that looks great in Lynx, Dillo, and Firefox.


> It is not hard to hand-write HTML that looks great in Lynx, Dillo, and Firefox.

Actually, it is. I love Dillo, but it's very limited. I like to make my images "fluid" using max-width and max-height attributes, and Dillo will not support those in any foreseeable future.

But again, I still love Dillo.


> would do almost nothing to create a clearly demarcated space

How do you create that demarcated space where PDF/A, PDF 2.0, and all other PDF versions can be mingled together, and there's no easy way to distinguish them?


I don't like reading PDFs and probably wouldn't read much of your website like that... but I appreciate the intervention drawing our attention to the advantages of PDFs in the disadvantaged present environment, which I think are real and worth thinking about. It seems almost like an artistic project. I'm not mad at you, and am not sure what makes some people seem to be so mad here (probably means you were succesful at something)... but I'm still not gonna read it, PDFs are a mess to read!


I've spent entirely too much time "printing" sites and articles to PDF to save them to read or reference later. Your PDF style was perfect! No need to fuss with anything just save it!


This thread might be helpful to you https://news.ycombinator.com/item?id=27817659


I think the idea of PDFs opens up many new possibilities, and your work is quite an eye opener. Design is largely missing from websites - it’s the same design over and over when it comes to optimizing for clicks.

Designers would thrive in a PDF environment instead of handing their designs over to implementation as it is now.

Maybe PDF is just the beginning and maybe a similar format can be thought up that addresses some of the concerns expressed here, and move over in time.


Case in point: copy-pasting a paragraph from his PDF-website adds line breaks everywhere. It also loses formatting (bold/italics) and the footnote superscript doesn't translate.

  PDF is an open standard, which is freely available2, and stable. It has a 
  version number  and many interoperable implementations including 
  free and open source readers and editors.
I think ease of copy-pasting is one of the coolest things about the document-centric roots of the web (along with the back button and hyperlinks; in other words, hypertext rules), although the modern web does break it (along with the back button and hyperlinks) in many places, so I can see where he is coming from. PDFs aren't the answer, though.


> OTOH it's totally possible to make a self-contained HTML page without using a JS framework of the day.

I'm basically in agreement, but the author has a good point that PDF is obviously self-contained and self-contained HTML pages are not necessarily distinguishable from those that aren't. Perhaps we might have to revisit MHTML or embrace Web bundles as an alternative to PDF.


You want PWP <https://blog.jonudell.net/2016/10/15/from-pdf-to-pwp-a-visio...> (Later aborted, and the group's work was rolled into EPUB3. As you note, there remains a genuine need for it.)

On the other hand, there's nothing stopping you from using a double-barrelled file extension for denoting this sort of thing, e.g. "memex-opus.pub.html"; so long as it ends with something recognizable, double-clicking should still open it in the browser across all the usual platforms, AFAIK.

(I'm fond of using "xyzzy.app.htm" myself to take advantage of this trick for distributing simple, self-contained programs that are designed run in the browser.)


This is what PWAs are kind of for.


It's not even JS. I'd argue a HTML + inline JS page is a lot more self-contained than one with external images, videos and fonts.

Note that PDFs can contain JS too.


> Note that PDFs can contain JS too.

That's why he says to use PDF/A, which can't contain JS.


> Note that PDFs can contain JS too.

Wait, why?!? When does it render? Who's supposed to have a js engine to do that? What version? How does it load dependencies? Is HTML and DOM carried along with it? So many questions.


Why - because scripting is useful. A big use of PDFs is translating paper forms into digital forms without needing to make a web app out of them. JS is used for client side validation, same reason it was put into browsers. Acrobat can handle this along with many other features that most PDF readers can't handle properly.

Basically in the PDF world, Acrobat Reader is Chrome and everything else is, like, Konqueror or something. Don't be fooled into thinking PDF is a small spec. It's not.


Why? To validate form fields.

Who? The PDF viewer.

When? Since about 2000 in PDF format version 1.3.

Dependencies? Hah, no such luck. You're stuck with ES5 and Adobe's crufty JS library. There is no HTML and DOM, there are however some pretty thorough PDF document bindings.


Or... AMP? But no, Google made that so it must be a bad idea.


MHTML, which is basically HTML email.


> it's totally possible to make a self-contained HTML page without using a JS framework of the day. It's going to be way easier to consume than a PDF.

Completely agree. For instance, NASA's APOD site[1] is a good example of something that'd be nontrivial using both an offline PDF and modern lightweight alternatives like Gemini, but works really well even without fancy modern design. Under 300kB including the image (HTML's under 6 kB) before gzipping.

[1] https://apod.nasa.gov/apod/astropix.html


The author addresses this: “We choose to switch to PDF in this decade, not because it easy, but because it is hard” – John F. Warnock, September 12th 1962"

The author is obviously making a statements, exploring ideas... not searching for an actual solution to his use case.


Yeah, it's kinda embarrassing that the one quote that gets pulled out in the HN commentary is the one that contains a typo. It's OK: Issue 1[0] contains a patch to fix the issue.

[0] https://www.lab6.com/1


Is this a comical misquote or is the PDF format actually 60 years old?


It's about 30 years old - it's creator however is said person.

The actual quote was from JFK iirc regarding the Apollo missions...


It's a comical, deliberate misquote.


Comical misquote, "Switch to PDF" replaced "Go to the Moon".


Its comical, but links to the founder of Adobe. IDK what the date alludes to.


JFK announcing the US would put a man on the Moon before the decade was over.


oh... yeah


From the PDF

> “But it’s just as easy to write self-contained HTML pages!”

> Sure, but if you’re going to hide CTF forensics challenges in your publication, a coverdisk allows you to do it in style!

I think it's not meant to be taken extremely seriously




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: