The sad thing is that the very complexity required to implement a screen reader is soley because of the technical nightmare information accessibility currently is.
It's reasonable to think that "if only" everything could be made ubiquitous and everyone (= developers) could become collectively aware of these accessibility considerations, maybe there would be a shift towards greater semanticity.
Thing is, though, that NVDA is open source, and iOS has put a built-in free screen-reader in the hands of every iPhone user... and not much has changed.
So it's basically one of those technologies forever stuck in the "initial surge of moonshot R&D to break new ground and realign status quo" phase. :(
For my part, I would love to see PDFs which can seamlessly be viewed in continuous, unpaged mode (for example, for better consumption on a small-screen device like a phone or e-reader). Even just the minimal effort required to tag page-specific header/footer type data could make a big difference here, and I expect that type of semantic information would be useful for a screen reader also.
There already is a solution, based on html and very easy to deal with: epub. Publishers are gradually shifting to it, even for things like science textbooks, as they realize that PDFs might look pretty but they're a usability nightmare. There are still some stupid things publishers do with epubs, but they'll get it right eventually; meanwhile the main text and font settings can be overridden by ereaders.
Tagged PDF is a thing, and MS Word supports it. THe problem is the very long tail of programs that generate PDFs and don't support tagged PDF. Even some widely used PDF generators, like pdflatex, don't generate tagged PDF, at least not by default.
You could also impose requirements on public companies to provide corporate documents in accessible formats- these sorts of documents are already regulated.
There's various levers that could be pulled, maybe those aren't the right ones. But you could do it.
This becomes difficult when writing Law 2 because probably you don't know all variations that can be required by Law 1 (where Law 1 is actually a long list of laws and regulations)
Depending on the legal system of a particular country writing Law 2 might not be actually feasible unless you know what Law 1 entails.
If I have a law saying that a layout needs to look like X layout and X layout is not achievable if it also needs to accessible then depending on the type of legal system you are in you can say Law Y supersedes all previous laws and requires you to make all PDFs accessible. If a legally mandated layout cannot be made accessible then the closest similar layout that still meets accessibility requirements should be used instead (long descriptions of how closest is determined in legal format would of course follow)
I think this will work in the common law system, but I don't think it would work in a Napoleonic system of law (could be wrong, just think you would need to specify exactly what laws it superseded)
As an example when I was part of the efaktura project for the Danish Government https://en.wikipedia.org/wiki/OIOXML when the law was published mandating that all invoices to the government be sent as oioxml efaktura we ran into the problem that there was a previous law requiring that all telecommunications invoices (or maybe utility invoices, I forget) be sent with information that there was no provision for in the UBL standard that efaktura was based on.
Luckily I had introduced an extension method that we could use to put the extra data in with, but otherwise we would have had two competing laws.
As far as mandating layouts in government PDFS, you normally see that kind of thing in military documents, but laws and bureaucracies are such that there might be any number of rules and regulations that cannot be overwritten in a system by drafting an overarching accessibility bill.
Accessible PDFs are also more easily machine readable.
ISO 32000-1 (PDF 1.7) and ISO 32000-2 (PDF 2.0) are both compliant with PDF/UA (ISO 14289) - the standard for PDF accessibility
You could just put the original text in comments or something, wrapped in more tags to say what it is.
There may already be good tools that do this, but until it's super easy to see and "everyone" knows about it, then people won't think, "better just give it a quick look in the screenreader". Obviously a next good step to more adoption, is clear feedback about how to easily fix whatever issues the person is seeing...
HTML isn't perfect, it could be a lot better. But it goes a long way towards forcing developers to put their interface in pure text, and to do the visual layout afterwards.
I think separating content from styling has gone a long way towards improving accessibility, precisely because it makes the accessibility features less invisible. I suspect there are additional gains we could make there, and other ways that HTML could force visual interfaces to be built on top of accessible ones.
Well, they could have uploaded it to some online service that shows the document in the browser as an undownloadable but browsable thingy (I don't even know what to call that, and I am definitely not going give any publicity to the online service by spelling out the name)
It also made me late for a Teams based interview, as I foolishly tried to pass the join link between phones using this webpage, and couldn't copy it on the other end.
Same app implements seriously dark patterns to get users to install their messenger app. Said chat works in the desktop web browser but not a mobile browser. I shall never install it on my phone. Mobile presents as if you have pending messages even when you don't (inbox empty on the web version).
everything is on facebook instead of the real interent, personal pet peeve...
If you are familiar with Docker, here is how you can add it to your Dockerfile.
RUN mkdir -p $HOME/pdf2json-$PDF2JSON_VERSION \
&& cd $HOME/pdf2json-$PDF2JSON_VERSION \
&& wget -q https://github.com/flexpaper/pdf2json/releases/download/$PDF... \
&& tar xzf pdf2json-$PDF2JSON_VERSION.tar.gz \
&& ./configure > /dev/null 2>&1 \
&& make > /dev/null 2>&1 \
&& make install > /dev/null \
&& rm -Rf $HOME/pdf2json-$PDF2JSON_VERSION \
In all seriousness, though: it doesn't look like it was cut off; I think the final cd is just to return the docker builder thingee to the home directory? The file was already built and installed by that built.
People will disabilities that rely on screen-readers don't love it. There is no such a problem with HTML/CSS which should be the norm for internet documents.
> With PDF, everyone sees the same thing.
Yes, provided you can see at first place...
My bashful take is that nobody told the rest of the web development world that they aren't Facebook, and they don't need Facebook like technology. So everyone is serving React apps hosted on AWS microservices filled in by GraphQL requests in order to render you a blog article.
I am being hyperbolic of course, but I was taken completely off guard by how quickly we ditched years of best practices in favour of a few JS UI libraries.
And as a format, it's much more sane than, say, Word or Excel.
Even if the focus on "where do I put this glyph" means the original text isn't in there by default.
FWIW this is not a technical barrier; it would be absolutely trivial to associated blocks of non-flowed text with the layed out text.
Why would I want document type that can't even refloat on display size to represent any longer written text that is supposed to be consumed on digital device?
You like it for that very quality: immutable, reproducible rendering.
Those who have to extract data from PDFs face nearly the same problem as those who have to deal with paper scans: no reliable structure in the data, the source of truth is the optical recognition, by human or by machine.
"Because it's pretty" is it. 99% of people don't care about text being a data mining source.
I do wish they had focused a bit more on non-visual aspects such as screen-reader data, but to say the whole point is "because it's pretty" is a bit uncharitable. The format doesn't solve the problem you wish it solved, but it does solve a problem other than making things "pretty."
I quite like PDFs, but this thread has been an eye-opener.
I agree, I do too (LibreOffice), but for the opposite reason. Even internally, the font rendering in LibreOffice with many fonts is often quite bad. This is especially noticeable for text inside graphs in Calc.
If I'm going to read something lengthy that's a LibreOffice document, I open it (in LibreOffice), and export it to a PDF. LibreOffice consistently exports beautiful PDFs (and SVG graphs), which tells me that it "knows" internally how to correctly render fonts, just that its actual renderer is quite bad.
The other thing that is unfair is assholes who deliver tabular data in PDF format usually don’t want you to have it. When your county clerk prints a report, photocopies it 30 times, crumples it and scans to PDF without OCR, that’s not a file format issue.
I sometimes see people complain about how PDF sucks because it doesn't look quite the same everywhere (namely, non-Adobe readers), but if you're not doing anything fancy is pretty much does. It is, at minimum, more reliable than any other "open" format I'm aware of, save actual images.
I know something about that area. Today, perhaps a 10th of CVs are sorted and prescreened by software. That fraction will only increase.
There are two issues with parsing them however.
1) PDF is an output format and was never intended to have the display text be parseable.
2) PDF is PostScript++, which means that is is a programming language.
This means that a PDF is also an input description to the output that we
are all familiar with seeing on a page.
The big change that came with PDF was removing the programming capabilities. A PDF file is like an unrolled version of the same PostScript file. There is still a residue of PostScript left but in no way can it be described as a programming language.
A programming langauge is not inherently a programming language due to features it contains but due to it being used to program. A program is "a series of coded software instructions to control the operation of a computer or other machine."
In this way, a PDF file embodies a program that performs specific tasks. A PDF file does not contain a general purpose programming language, but it does contain the page description language of the output format that describes what is to be imaged. Then, the PDF program is given to an interpreter that displays the output.
This is the same as a simple program in turtle graphics to display a rectangle, even if no other language feature was used. In such a case, one would say that rectangle was programmed. We would not use the word program in connection with that turtle graphics program, if the rectangle description were not sent to an interpreter that displayed the rectangle.
SVG does that too, but it can also have aria tags to improve accessibility and have text that can be extracted much more easily.
Mostly. I've seen issues where PDF looked fine on a Mac but not on Windows.
Also, the fact that you see the same thing everywhere is good if you have one context of looking at things - e.g. if everyone uses big screen or if everyone prints the document, that's fine. But reading PDFs on e-book readers or smartphones can be a nightmare.
Why should they all see the same thing?
Using PDF here is self serving. It’s actively user-hostile.
With tagged PDF it's easy to get the text out.
There is a POC package for producing Tagged PDF here https://github.com/AndyClifton/accessibility but it's not a complete solution yet.
Here is an excellent review article that talks about all the other options for producing Tagged PDFs from LaTeX (spoiler — there is no solution currently): https://umij.wordpress.com/2016/08/11/the-sad-state-of-pdf-a... via https://news.ycombinator.com/item?id=24444427
As in Healthcare.
A flattened XFA/XDP PDF (no interactive elements) is more or less a PDF/A.
Nothing is more annoying than having to manual search a document that has been exported from PDF and have to make sure you catch all the now incorrect spellings when all the ligatures have just disappeared "action" -> "ac on", "finish" -> " nish".
As far as I understand, PDFs can be generated such that ligatures can be correctly cut'n'pasted from most PDF readers. I have seen PDFs where ligatures in links (ending in ”.fi”) cause problems, and I believe that's just an incorrectly generated PDF; ligatures done wrong.
Considering that PDF a programming language designed to draw stuff on paper, going backwards from the program back to clean data is not something that one should expect to always work.
I play (tabletop) RPGs online, we use a simple, free rule system (Tiny Six) and any time that I have to copypaste a specific paragraph of the rules in chat I always discover that there are missing characters (so I have to reread the block and fix it, ... in the end it would be faster to just read it aloud in voice).
Same might happen with scene or room descriptions taken from modules etc.
This is assuming you cleaning up the output of `pdftotext` which in my experience is the best command line tool for extracting plain text.
My company uses the services of, and has some sort of partnership with, a company that makes it's business out of parsing CVs.
Recently we've seen a surge in CVs that after parsing return no name and / or no email, or the wrong data is being fetched (usually, from referees).
So, out of curiosity I took one (So far) pdf and extracted the text with python.
Besides the usual stuff that is already known (As in, the text is extracted as found, e.g., if you have 2 columns the lines of text will appear after the line at the same level in the document that is in the other column) what I found - obviously take this with a grain of salt as this is all anecdotally so far - is that some parts of the document have spaces between the characters, e.g.:
D O N A L D T R U M P
P r e s i d e n t o f t h e U n i t e d S t a t e s o f A m e r i c a
These CVs have the characteristics to be highly graphical. Also anecdotally, the metadata in the CV I parsed stated it was from Canvas 
Consider that creating a PDF is generally just the layout software rendering into a PDF context — no different as far as it is concerned than rendering to the screen.
Space are not necessary for display (although they might help for text selection so often are present). It is not important that headers are drawn first, or footers last — so often these scraps of text will appear in unexpected places....
PDF has support for screen readers, but of course very few PDFs in the wild were created with this extra feature.
This means getting text out of PDFs requires rather sophisticated computational geometry and machine learning algorithms and is an active area of research. And yet, even after all that, it will always be the case that a fair few words end up mangled because trying to infer words, word order, sentences and paragraphs from glyph locations and size is currently not feasible in general.
Even if better authoring tools were to be released, it would still take a long time for these tools to percolate and then for the bulk of encountered material to have good accessibility.
This recent hn post is relevant: https://umij.wordpress.com/2016/08/11/the-sad-state-of-pdf-a...
I've also seen a similar situation, but in some ways quite the opposite --- where all the text was simply vector graphics. In the limited time I had, OCR worked quite well, but I wonder if it would've been faster to recognise the vector shapes directly rather than going through a rasterisation and then traditional bitmap OCR.
The use-case is basically one where there is a tabular PDF uploaded every week and the script parses it to extract the data. Thereafter a human interacts with the data. In such a scenario, every ~100 parses it fails and I'll have to patch the parser.
Sometimes text gets split up, but as long as the parent DOM node is consistent I can pull the text of the entire node and it seems to work fine.
PDF is being used wrongly. For information exchange, data should be AUTHORED in a parseable format with a schema.
Then PDF should be generated from this as a target format.
There is a solution: xml, xsl-fo.
I should add that our documents were not run-of-the-mill: they were grammars, and sometimes included non-Roman fonts (Bangla) and right-to-left text (Arabic and Thaana scripts). A lot of things came together at just the right time, like XeTeX (think Unicode-aware LaTeX) and good Nasta'liq fonts. Most people don't have those problems :-).
> There is a solution: xml, xsl-fo.
Meanwhile, pdf just worked-- until you the first time you crack it open and see what's inside the pdf file. I'll never forget the horror after I committed to a time-critical project where I claimed... "Oh, I'll just extract data from the PDF, how bad could it possibly be!"
Bad programmers, as usual, frustrated by their own badness.
Today's coders want us to use Jackson Pollock Object Notation everywhere for everything.
> We ain't ever going back.
Not so, friend. ODF and DOCX are XML. And these formats won't become JPON anytime soon.
I think a good interview question would be, given a list of full names, return each person’s first name. It’s a great little problem where you can show meaningful progress in less than a minute, but you could also work on it full time for decades.
You see exactly this kind of probabilistic algorithm being used every day in huge production apps. E.g. how do you think Gmail shows something like "me .. Vince, Tim" in the left column of your email inbox?
The problem with probabilistic algorithms is that you trade predictability for getting it right more frequently (but not 100% of the time). Eg. I could match common Japanese family names or look for a .jp TLD, but neither of these are guarantees that the family name comes first, and even less does their absence imply that they lead with the first name.
I imagine Google's algorithm is no more sophisticated than:
1. Are they in your contacts? Use their first name from your contacts.
2. Are they a Google user with a profile you can view? Use the first name they provided.
3. Use your locale to make a wild guess.
Does anyone know of any attempt at this?
Blah blah blah transformer something BERT handwave handwave. I should ask the research folks. :-)
As an experiment, we once tried converting an OCRed dictionary (this one: https://www.sil.org/resources/archives/10969) into an XML dictionary database. (There are probably better ways to get an XML version of that particular dictionary, but as I say, this was an experiment.)
Despite the fact that it's a clean PDF, and uses a Latin script whose characters are quite similar to Spanish (and the glosses are in Spanish), the OCR was a major cause of problems: Treating the upside down exclamation as an 'i', failing to separate kerned characters, confusion between '1' and 'l', misinterpreting accented characters, and so on and so on. And for some reason the OCR was completely unable to distinguish bold from normal text, even though a human could do so standing several feet away.
So I did think of extracting the characters from the PDF. If it had been a real use case, instead of an experiment, I might have done so.
Write-up here: https://www.aclweb.org/anthology/W17-0112/
Many menus are also available in PDF form, so we're trying to figure out if it's worth bothering with the PDF itself, or if we should just render to image and thus reduce the problem to the menu-photo one.
And on top of this many times there are no text data in the PDF just a JBIG2 image per page.
"The sad state of PDF-Accessibility of LaTex Documents"
PDF's have accessibility features to make semantic text extraction easy... but it depends on the PDF creator to make it happen. It's crazy how best-case PDF's have identifiable text sections like headings... but worst-case is gibberish remapped characters or just bitmap scans...
I still haven’t found a good way of paragraph detection. Court filings are double spaced, and the white space between paragraphs is the same as the white space between lines of text. I also can’t use tab characters because of chapter headings and lists which don’t start with a <tab>. I was hoping to get some help from anyone who has done it before.
The core is tabula-java, and there are bindings for R in tabulizer, Node.js in tabula-js, and Python in tabula-py.
My other complaint with Tabula is total lack of metadata. It’s impossible to know even what page of the PDF the tables are located on! You either have to extract one page at a time or you just get a data frame with no idea which table is located on which page.
Both are better than Tabula though.
Imagine that instead of parsing a PDF invoice, you just extract the SQLite file embedded in it, and it has a nice schema with invoice headers, details, customer details, etc.
Anything else in PDF would work nicely either - vector graphics, long texts, forms - all can be embedded as an SQLite file in a PDF.
(As a side note Mozilla reasoning is that no one knows how to turn SQLite into a backward compatible standard, not even SQLite developers. So while they recognise how WebSQL is a fantastic feature it is inadequate for the web platform the same way it would be to add a python interpreter to the browser)