(Used by many journalists to analyze the data in PDFs)
We have digital formats, and we decided to standardize document distribution on the one that makes it as hard to extract data as if it were on physical paper.
Sure, PDF is a light on context clues for automation and is targeted purely at humans. But formats targeted at both computers and humans consistently fail (XML with accompanying XSLT comes to mind), and/or only have terrible tools for creating files (easily parsable, pretty HTML).
Either there is very little real demand or we consistently fail at making alternatives viable.
Technically it does that, yes. Rarely do I see people taking advantage of it, though; most of the times I tried to copy some text out of PDF, the result had to undergo a significant cleanup before becoming usable.
It really is a problem with PDF that it's too easy to get a file where copy and paste yields a different result than what's displayed. But this varies widely with the software used for creating the file (e.g. latex ligatures never work in copy-paste)
When people write text and data and tables on the computer then put it on a paper-like format to share that's when the problem happens.
If you designed a real bad format on purpose, it would be hard to top PDF. Maybe Photoshop files are worse.
This should output a single file and ideally it should have some way of ensuring the author it won't be modified unnoticed (that's one of the features common people use PDF for, today, they think it is something no one can modify) -- perhaps signing it with a key from Keybase would work in the mid term.
https://github.com/iffy/lhtml has something going in this direction.
PDF is the successor of PostScript. PostScript is a stack-based programming language where anything can happen, while PDF enforces some document structure and metadata structure on top of it, so you can e.g. at least determine where pagebreaks are, without having to interpret ("run the code of") the whole document.
Still, PDF is simpler than PostScript in the same sense that XML is a simplification of SGML. Jumping from PDF to a well-designed format would be like jumping from XML to JSON or S-Expr.
That's most likely why copying and pasting sucks too.
How do you imagine a better format PDF alternative? On the one hand, we have text-based formats. They are not serialization of the exact rendering. On the other hand, we have ps, which is, probably, too complex to be manipulated as text when rendered. PDF and dejavu do kinda both, even if quite imperfectly.
So how do we construct a file format, which can render a symbol (not necessarily a unicode one) anywhere, pixel-perfect, but still has concept of words, paragraphs, preferably tables and such?
I also think that, with more love, epub could get there too. It's not an easy problem, but if we can crack SHA-1 I'm sure we can crack this one too :)
Yes, (al)pine is my mailtool in 2017.
Though I can generally open attachments just fine, this text rendering of PDFs would be useful for when I'm SSH'd into my home machine and reading stuff remotely(usually from work where I don't want to download my personal email).
It's easy to use, pluggable, and faster than any GUI I've touched.
I had not considered the PDF issue. I just open them with an external application. The potential of reading them within Alpine hadn't occurred to me, but now it has, I want it!
application/pdf;pdftotxt %s /tmp/pdftxt && less /tmp/pdftxt
That has a --layout option that works really well sometimes and really terrible other times. Doesn't seem to be related to document complexity either.
As part of this work, I communicated over a period, with one of the key technical people at the company behind xpdf, Glyph and Cog. Got to know from him about some of the issues with text extraction from PDF, one of the key points being that in some or many cases, the extraction can be imperfect or incomplete, due to factors inherent in the PDF format itself, and its differences from text format. PDFTextStream (for Java) is another one which I had heard of, from someone I know personally, who said it was quite good. But those inherent issues of text extraction do exist.
So wherever possible, a good option is to go to the source from which the PDF was originally generated, instead of trying to reverse-engineer it, and get the text you want from there. Not always possible, of course, but a preferred approach, particularly for cases where maximum accuracy of text extraction is desired.
 Not to be confused with xtopdf, my PDF toolkit for PDF generation from other formats.
Most of that was fairly easy, except that the POS program that sits in the actual data only allows exporting data in one single format - PDF. Converting that PDF file to a CSV that I can feed into SharePoint was one of the nastiest things I did last year. I did manage to get it to work though, by toying around with pdftotext for a while and exploring its command line parameters.
It was a pleasure to use! It took me a while to discover the correct set of command line parameters I needed, but I got it to work! Thanks, xpdf!
Is this something that could be combined with those OCR engines? (e.g. TesseractOCR...)
monetize via Google adwords.
EDIT: I'm not sure why I'm being downvoted. I am not suggesting serving PDF's. I am suggesting serving tiny text renders of top sites, that otherwise are much too bloated.
the hard part is getting the text and layout right. many people read many sites for the text IMO.
So I am suggesting you make an all-text version.
As an example, the front page of the New York Times right now, copied into Microsoft Word, is 2504 words. When I save from the word I copied into into .txt - I get a 16.4 KB file.
By comparison, when I put the site into a Page Size Checker -- http://smallseotools.com/website-page-size-checker/ -- I get 214.23 KB. That is impressively small, and it's a fast page.
If I try their competition, the Washington Post, I get 237 KB. If I try the Wall Street Journal, I get 938.15 KB -- nearly a full Megabyte. (This is actually more what I was expecting - I'm impressed by the Times.)
Suppose someone desperately wants to glance at the Wall Street Journal from a poor connection where they barely get data. The difference between 12 KB and nearly a megabyte is huge. Its the difference between 4 seconds and 312 seconds: 4 seconds as compared with 5 full minutes.
So there is a large need in my opinion for such a service in case someone desperately wants to see a text render. Preserving any formatting at all, helps hugely.
1) Copyright: completely re-serving the complete content of the top 100 sites with your own ads does not fall under fair use and would almost certainly be a magnet for lawsuits.
2) Distribution: how do you find your niche of people with poor internet connections and get them to use your mirror instead of whatever site it is they want to read?
This isn't legal advice, just the approach I would use off of the top of my head. I agree with you that it's hard. with the framework "opera minifier/turbofier as a service" it could work, though. Like a remote browser. (in a VM). Like, present it as "lynx as a service." (Lynx being an old terminal-based text browser.) Something like that, anyway.
I tried to extract text from a pdf that already has searchable text, which can be copy-pasted. This should be the easiest task of all but it made mistakes in every second word.
Then I asked the website to make a pdf into a word-file. It just inserted the whole pdf as a picture in word.
Really? I'm pretty sure that's not the way this works.
See e.g. my file sharing app https://github.com/andreif/SecretFile
The deja vu made squint for a minute.
ps: pdfbox is nice