This must be a joke. LaTeX cannot even produce accessible PDF today. Believe me, I tried. I write my lecture notes in LaTeX and some of my students have disabilities. I tried to follow these guides https://tex.stackexchange.com/questions/261537/a-guide-on-ho... (note the Feb 2024 update). No matter how technically "correct" the LaTeX people want it to be, even a PDF produced by MS Word was more accessible for my students.
The tagging work is still highly experimental. A major missing element is equation tagging: for now, you need to produce an 'associated MathML' file externally, for instance using LaTeXML. Even then, PDF readers do not support the MathML tags yet! If anything, I am sure that the LaTeX3 team would appreciate you posting minimal examples of mistagged PDFs.
If you want to produce accessible documents from LaTeX, you should convert to HTML. ATs such as screen readers just work miles better than with PDFs, and given the resources put in developing browsers compared to PDF readers, I don't think this will change any time soon. Luckily conversion from LaTeX to HTML is very feasible today, as proved by arXiv. (Shameless plug: I maintain BookML specifically to help lecturers with the LaTeX to HTML work)
A 50-page PDF loads a lot faster and shows a lot smoother than an HTML of equal textual length. And I've never seen any modern tools that turn TeX into multifile HTML (one per section).
> A 50-page PDF loads a lot faster and shows a lot smoother than an HTML of equal textual length.
Very true! Although they are now comparable, if you rely on the browser native MathML instead of MathJax/LaTeX.
(You can test this on long arXiv HTML papers, e.g. https://ar5iv.labs.arxiv.org/html/1710.07304 is more than 60 pages as PDF. Mind you, the ar5iv default CSS is not great. I would use Latin Modern for formulas, at the very least.)
> I've never seen any modern tools that turn TeX into multifile HTML (one per section).
I believe all of them can do it out of the box now. I know for sure that LaTeXML, tex4ht and lwarp can split by chapter or section.
Since you are familiar with the TeX StackExchange, perhaps you can ask there?
As you might be aware, the author of the message is David Carlisle, a very active member at TeX (766k rep) and is also a maintainer of many popular packages.
Needless to say, the guide you followed is 8 years old; Perhaps some things have changed.
> Since you are familiar with the TeX StackExchange, perhaps you can ask there?
I have. I don't want to link my HN identity to my SE identity (which is tied to my real world identity through mathoverflow) so I won't link here.
> Needless to say, the guide you followed is 8 years old; Perhaps some things have changed.
As I said, it was updated in February 2024. I was written by Ulrike Fischer who's also very involved in latex3. It's basically the current state of the art in accessible PDFs. And it's still not good enough.
I wish it were otherwise, but latex is the pinnacle of design-by-committee and tech debt. Given how latex3 turned out, I'm very pessimistic. The only thing latex has going for it is inertia.
Is texlive 2024 producing tagged PDF out of the box? That would be awesome! If not, is there somewhere some instructions/tutorials explaining the steps to follow?
> Is texlive 2024 producing tagged PDF out of the box?
Not yet. There is Tagged PDF project [1] which aims to do something like that. there is tagpdf [2] package that try to do some of these things. If you need good instructions I would suggest overleaf tagging tutorial [3].
No, but it's getting easier and easier. To opt into this testing stage functionality you need to add a few tags, and may eventually have to work around some package incompatibilities. Take a look at https://www.latex-project.org/news/2023/03/13/latex-dev-1/ for more info.
This work is ongoing. There will be a number of papers and a one-day workshop at the upcoming TeX Users Group meeting https://tug.org/tug2024/. People interested in the technical details may find the page of the working group useful https://tug.org/twg/accessibility/.
This sounds amazing! Some years ago I was involved with a project for a company where we tried to extract some data from some PDF files they had. But neither OCR nor my attempts at reconstructing order of text in the documents panned out in the end, and it all ended in tears :(
Perhaps one day in the future dealing with extracting data from PDFs will be less of a mess.
Tagged PDF is nothing new, it was added with PDF 1.4 in 2001, in order to provide an HTML-like structure in PDFs (potentially allowing PDF-HTML round-tripping). But a lot of software doesn’t bother producing it.
I am also baffled. I searched several repos for any concrete examples of what this all actually means. Blip.
Maybe there are no source changes, and everything is handled by the engine, or maybe it introduces some extra \startread \endread tags without the need of an environment.
If you are interested in this work, you could look into joining or just supporting the TeX Users Group, https://www.tug.org, or your local users group. A lot of folks doing good things, including the ones who are doing this.
Thank you so much! Very nice to see that ISO was convinced to once again follow the promise (to Adobe) of no-cost access to the PDF standard that they kept until PDF 1.7 but broke with PDF 2.0.
However, TFA is not about PDF, it is about a profile called PDF/UA-2 aka ISO 14289-2. That one is $223 from the PDF Association[1] or 173 CHF from the ISO[2].
It's very early days for clients. I guess the support is not really there yet. However most publishers and eprint archives have settles on pdfs rather than html annoyingly, so this is one of our best hopes of accessibility for academic publications.
PDF/UA makes PDFs accessible, which can otherwise be very hard to parse for screenreaders or display software that wants to reflow them (e.g. for reading them on small displays).
Without special hints (that are ignored by regular PDF readers or printers), PDF is essentially a vector graphics format, and any of these tasks amount to an exercise in OCR.
This is a somewhat little-known fact about PDFs, since many viewers do in fact implement many of these OCR-like heuristics to provide features such as text selection, search etc. that make it look a lot like a text-based format, but it really is a vector graphics format at heart. PDF/UA makes this a bit easier.
As an example, consider a multiple column layout, as is often used in scientific articles. PDF-creating software not concerned with accessibility might just intersperse all columns line by line (i.e. present text in presentational left-to-right, top-to-bottom fashion), but it could just as well achieve the same outcome by drawing column by column in semantic order. Beyond encouraging that, I believe PDF/UA also defines a bunch of (invisible) metadata tags that readers can use to figure out semantic structuring of a document.
Sounds like this would massively simplify converting scientific articles back to HTML or text.
Seriously, I know of one experiment that has O(thousands) of papers as latex source, which they publish as PDFs and which (more recently) they convert back to markdown with some ML-based image to text engine. All so that it can be fed to a vector database to make searching easier, of course.
PDFs are a terrible format for machine readable information (and thus for reproducible science), but they are the currency of the scientific community and as such will be the standard output for the foreseeable future.
I'd argue it still is a terrible format for anything other than printing. It values form over function. I mean this makes sense for laying out papers to be published into a physical journal, but it doesn't make as much sense for other purposes.
I'd much rather have plain text, or markdown, or asciidoc than PDF or HTML. It works everywhere. And support for embedded mathematical typesetting is a solved problem.
I love ePub, but I also have to admit that the viewer experience for non-books is really not quite there yet.
Most viewers on iOS, macOS, Linux and Android that I've tried are very tied to the idea of all ePubs being ebooks and insist on managing them in their own library, don't optimize latency to first page rendered (because they usually index and cache a bunch of stuff for a "newly opened book", assuming you'll be reading it over the course of days/weeks) etc.
For example, on macOS, the default app associated with ePub is Apple Books, but to be able to use technical papers in ePub format, I really need something like Preview.app (and that doesn't support them). (In that way, the situation is very similar to JPEG XL: macOS supports them in quick view, but only Safari can actually open them...)
Not sure how it is supposed to look like on other OSes, but this is barely a GUI application on macOS. (I have to launch it from the command line, UI and text rendering are extremely pixelated etc.)
It's definitely not up to par with any PDF viewer I've used so far, especially when it comes to quickly navigating between pages and chapters (crucial for non-linearly working with scientific papers etc).
I think everyone on Linux uses it just as the backend inside zathura https://archlinux.org/packages/extra/x86_64/zathura-pdf-mupd... (yes, it's associated with `application/epub+zip` ) and I can tell you that zathura feels very snappy. On Windows mupdf is used by SumatraPDF which also is stupidly fast and supports rendering EPUBs too.
Ah. Well, that's how I use it on Debian because I'm terminally addicted, but I find the rendering satisfactory. IIRC it doesn't automatically reload from disk when the document changes, for that I use Evince.
The Android version is a bit more than that however, it has a simple file browser, and I settled on it because it handles large files better than whatever alternatives I went through at the time.
These features all sound great, but it's still extending PDF "upwards" from a quite low-level, representation-focused format.
For most content PDFs are used today, I'd much prefer extending something like ePub "downwards" with stylesheets, rendering hints etc. – that can all just be thrown out by an accessibility- or readability-focused viewer.
Yet, the PDF ecosystem is so large at this point, and so many organizations are still tied to skeuomorphic ideas focused around printed pages of text, that I can't see it happen anytime soon. I just hope that at least academia figures something out.
ArXiv's recent HTML efforts [1] are a great step in that direction, but as I understand it, they still use PDF as their main ingestion format.
In the future better accessibility for screen readers of text (paragraph breaks), images (alt text), tables (control of reading order), and math formulae. It's up to the pdf readers too though.