> The funny thing is that creating a universal algorithm to convert PDFs and/or HTML to plaintext is probably comparable in difficulty to building level 5 self-driving cars, and would accrue at least as much profit to any company that can solve it. But there are ... like zero dollars going into this problem.
Converting PDFs to HTML well is a very hard problem, but hard by itself to create a very big company. When processing PDFs or documents generally, the value is not in the format, it's in the substantive content.
The real money is not going from PDF to HTML, but from HTML (or any doc format) into structured knowledge. There are plenty of companies trying to do this (including mine! www.docketalarm.com), and I agree it has the potential to be as big as self-driving cars. However, technology to understand human language and ideas is not nearly as well developed as technology to understand images, video, and radar (what self-driving care rely on).
The problem is much more difficult to solve than building safer-than-human self-driving cars. If you can build a machine that truly understands text, you have built a general AI.
Converting PDFs to HTML well is a very hard problem, but hard by itself to create a very big company. When processing PDFs or documents generally, the value is not in the format, it's in the substantive content.
The real money is not going from PDF to HTML, but from HTML (or any doc format) into structured knowledge. There are plenty of companies trying to do this (including mine! www.docketalarm.com), and I agree it has the potential to be as big as self-driving cars. However, technology to understand human language and ideas is not nearly as well developed as technology to understand images, video, and radar (what self-driving care rely on).
The problem is much more difficult to solve than building safer-than-human self-driving cars. If you can build a machine that truly understands text, you have built a general AI.