It's a pity that this comparison was quite narrow - there is actually a lot of competition in the data extraction cloud services space nowadays.
Since the outcome of that comparison is that nothing is really great, the best course of action would be "look further". :-)
(Speaking on behalf of one such service - Rossum.ai. We actually extracted data from a dataset in the same domain as the article, 25000 television ad invoices, in November with 94% dollar-weighed accuracy. It was a project together with eVentures data: https://rossum.ai/blog/presidential-campaign-spend-analysis/)
Very interesting review, thanks for the read! If I may, I've had some different experiences.
I work for one of the biggest supermarket chains in the US as part of the team implementing an invoice processing capability for the enterprise to utilize. We literally take in thousands of paper/non-digitized invoices a day, and in our testing have found Azure's Form Recognizer (AFR) to be very dependable and confidently accurate. I have also professionally used Google Form Parser and ABBYY's OCR engine, but not it's cloud offering.
> it's also the only service fast enough to be part of a synchronous pipeline.
I assume what you're talking about here is exposing the processing capability and response as part of a tool that is utilized by a person. While maybe as a one-off edge case, we've never seen the use in building for this. When talking about form processing the real goal of any enterprise is to get the invoice data into their system of record where it can be validated, addressed, and maintained. This does not require a "man-in-the-middle" approach wherein the user submits the invoice and then expects the results to be immediately returned so that they may...what, put them in the system of record, right? We've found that the "time to affect" workflow is the same regardless of whether it is hand-keyed or as the result of an AFR response to be programmatically submitted to the system.
> requires a custom model to be trained before extracting data
This is simply not true. AFR provides quite a few pre-built models[1] that we have found to return confidence scores consistently above 70%. To put that in perspective a human averages 66% accuracy when performing data-entry of this type[2]. Sure, they don't necessarily provide for invoice line items (which requires much more complex key-value arrays and matrices) they can be utilized to capture metadata on an invoice that can then inform on how and where it may be moved along in the "processing" flow.
We've also found that building a single, "monolithic [custom] model" able to address our specific vendor invoices with more finely tuned value returns has been fairly easy to build and maintain.
Wow, that paper is super strange! The outcome of visual check not adding any value to single entry seems incredibly surprising, yet besides all the fancy statistics in the paper, the authors made no attempt at all to investigate this surprising outcome further and hunt for an explanation.
(The easiest experiment - try to see based on total time whether the participants didn't cheat; UX might have been an issue; ...)
> a human averages 66% accuracy when performing data-entry
the cited material says something a bit different than that, but regardless .. different users and markets will have different tolerances for error, too
Yes, sorry, it's a tricky concept to port over as the material says that 66% of the participants provided inaccurate inputs. So not accounting for an individuals verisimilitude it would be more accurate to say that roughly 66% of entries are inaccurate, as opposed to the average person being x% accurate.
There is so much potential to these technologies, but even autogenerated documents have so many embedded semantics that it is hard to train these tools for a wide variety of document formats.
The natural followup question is how to make these tools easily trainable for a specific document type.
(After all, a human can do it, with some basic training. Ergo we should aim at computer being able to do it with equally simple training too. And in practice this doesn't seem like an insurmountable goal.)
That's precisely what people do -- they design workflows that leverage AI to handle the forms that it can, and humans to handle the forms that it can't. And constantly retrain the AI based on what the nice humans just labeled. As the AI gets better less needs to go to the humans, but they're always there for the exceptional cases. BTW this isn't theoretical -- this is current best practice.
There are various European standards for embedding XML data into PDFs to make data extraction easier. Think of commercial invoices that need to be both human readable and easily imported into ERP systems.
The French standard is called Factur-X. The German one ZUGFeRD.
This technology is of intense interest to organizations that process thousands of forms per day. For example, the Internal Revenue Service, insurance companies, banks, etc. It is very much worth their while to optimize that processing, and they do.
The cloud services reviewed here are typically components in a much larger end-to-end process. They are valuable because they are fast, and work at scale.
Suggesting that the technology is useless because it can't parse a random set of 51 invoices misses the actual use case for which these services are appropriate.
I think it’s actually the other way round. The IRS, banks and insurers design and mandate use of their own forms, and if you you deviated one bit from their form they happily deny your request, or they will even just plain ignore you. They just have the power to put the burden on the user to fill in the form in a way that their system can process. That’s not really who this software is aiming at. Sure, the IRS e.a. can and probably will use these advanced extraction services, but only once the technology is readily available and reasonably priced.
Instead, this software is aimed at companies processing all sorts of documents with structured data, but without (very) strict form requirements (or with very low compliance with those requirements). Processing invoices is actually one of the best examples out there: every company has to do it, the basic data structure is nearly universally identical, and yet the form is so different and complex to process with general purpose tools (hence specially designed tools for invoice recognition). These companies may have found great value in processing these forms and may be willing to pay for advanced text extraction tools, because their only alternative is manual processing (aided) by humans.
Since the outcome of that comparison is that nothing is really great, the best course of action would be "look further". :-)
(Speaking on behalf of one such service - Rossum.ai. We actually extracted data from a dataset in the same domain as the article, 25000 television ad invoices, in November with 94% dollar-weighed accuracy. It was a project together with eVentures data: https://rossum.ai/blog/presidential-campaign-spend-analysis/)