
Amazon Textract – Extract text and data from virtually any document - mcrute
https://aws.amazon.com/textract/
======
cmroanirgo
Found some interesting tidbits in their FAQ [0]:

"Q: What type of text can Amazon Textract detect and extract?

A: Amazon Textract can detect Latin-script characters from the standard
English alphabet and ASCII symbols."

So, English only. But _very_ worryingly is that they're going to keep your
companies' documents:

"Q. Are document and image inputs processed by Amazon Textract stored, and how
are they used by AWS?

A: Amazon Textract may store and use document and image inputs processed by
the service solely to provide and maintain the service and to improve and
develop the quality of Amazon Textract..."

"Q. Can I delete images and documents stored by Amazon Textract?

A: Yes. You can request deletion of document and image inputs associated with
your account by contacting AWS Support. Deleting image and document inputs may
degrade your Amazon Textract experience."

That said, I'm still baffled on what value-add they're providing? For me, from
the name alone, it would generate other documents of common types: .txt
(without images), .doc, .html (zip). That is, a large part of extracting text
is the ability to reflow the text across page boundaries & columns. However,
this product states that:

"All extracted data is returned with bounding box coordinates" [1]

...which is how pdf documents lay things out in the first place...Have I
missed something?

[0]
[https://aws.amazon.com/textract/faqs/](https://aws.amazon.com/textract/faqs/)

[1]
[https://aws.amazon.com/textract/features/](https://aws.amazon.com/textract/features/)

~~~
tills13
The point of this service is to train their own OCR models for use in other
products like Kindle / their e-book store. There doesn't really need to be a
value add - if people use it it's a win for them... if people don't it's not
really a big loss.

~~~
acqq
But in order to train something you have to have the input of what is actually
there, I don’t see how that is provided here.

~~~
Technetium_Hat
There might be a way for users to rate the quality of the result, or at least
report it if it is very wrong.

------
danso
Given how high and continuing the popularity of the "simple" conversion of
regular PDF forms/tables -- even for the technically-sophisticated HN audience
[0] -- if Amazon can deliver on OCR-to-data, that feels like a huge
achievement. Not as sexy (or creepy) as Rekognition, perhaps, but almost
certainly more day-to-day useful to the many, many professionals who work with
documents and legacy data entry systems.

[0]
[https://hn.algolia.com/?query=pdf%20convert&sort=byPopularit...](https://hn.algolia.com/?query=pdf%20convert&sort=byPopularity&prefix&page=0&dateRange=all&type=story)

\-
[https://news.ycombinator.com/item?id=18199708](https://news.ycombinator.com/item?id=18199708)

\-
[https://news.ycombinator.com/item?id=5487530](https://news.ycombinator.com/item?id=5487530)

~~~
just_myles
Agreed. Anything that can lighten the load of having to write custom scripts
to handle pdf-to-data conversions will be helpful.

I do maintain some level of skepticism though. It is ocr :D

~~~
danso
Even if AWS goes the cynical route of making Textract be an upsell to MTurk --
e.g. the Textract output is not reliable enough on its own, but structured for
easy piping to a MTurk job -- that's got to be useful for the many folks who
send entire pages to MTurk when they just need a couple boxes proofread.

As an example of a more scripted/structured job, ProPublica built out a
crowdsourcing framework in Rails to extract data from FCC filings. But even
that was quite difficult, because every state/TV station has its own kind of
form: [https://projects.propublica.org/free-the-
files/](https://projects.propublica.org/free-the-files/)

------
raghavtoshniwal
This plays so well with the theory of AWS taking a slice of all web activity.
They are commoditising more and more complex tasks and enabling huge number of
engineers to bootstrap their idea with amazing tech from day 1. A huge jump
from S3/EC2 to this. Commendable.

~~~
jjeaff
I sort of agree. But I think the reality is a little closer to Apple's style
of innovation. Few of the things that aws offers are things that didn't exist
before. For example, data extraction and image recognition APIs have been
around for a long while now from several different providers.

AWS is just aggregating it all into one place and giving it a really good
final polish.

------
Edmond
Not sure if this is bad news for the Robotic Process Automation (RPA) sector
or an opportunity to offload the "Robotic" part while focusing on business
process...

~~~
Ftuuky
There are many RPA solutions with OCR as part of the automation.

~~~
njstraub608
Generally, these are stuffed in there for marketing and aren't very effective
when used in actual business scenarios.

Source: I do a lot of post-sales consulting work implementing RPA solutions.

------
efields
Is off the shelf open source OCR not reliable for an image of reasonable
fidelity, like a smartphone camera picture of a B&W text document?

I ask because it feels like I should have an app that lets me scan with my
phone, process the text with OCR, then let me plain text search every scanned
document I have.

The first part only natively made it into iOS Notes a year or two ago, but
that whole experience above should be out of the box, IMHO…

~~~
Holybeds
There's a difference between doing OCR and actually understanding what is what
in the document content.

For normal text OCR works well. But automatically understanding what is what
is more complex.

~~~
njstraub608
This ^^

And actually understanding the context of what you're trying to use OCR on can
work backward to determine what the text actually is, i.e. if it's a "Name"
field then the probabilities of ambiguous letters may change (in the case of
handwriting rec).

------
hhanshin
Found some interesting tidbits in their FAQ [0]: "Q: What type of text can
Amazon Textract detect and extract?

A: Amazon Textract can detect Latin-script characters from the standard
English alphabet and ASCII symbols."

So, English only. But very worryingly is that they're going to keep your
companies' documents:

"Q. Are document and image inputs processed by Amazon Textract stored, and how
are they used by AWS?

A: Amazon Textract may store and use document and image inputs processed by
the service solely to provide and maintain the service and to improve and
develop the quality of Amazon Textract..."

"Q. Can I delete images and documents stored by Amazon Textract?

A: Yes. You can request deletion of document and image inputs associated with
your account by contacting AWS Support. Deleting image and document inputs may
degrade your Amazon Textract experience."

------
BasHamer
If this can get me tables out of pdf's generated by crystal reports it would
be a godsend for testing. This has been a nightmare to try and solve, the best
option so far has been adobe cloud but they don't offer an API for that. I'm
excited to try it out.

~~~
mjt58
Have you tried e.g. [https://tabula.technology](https://tabula.technology),
[https://pdftables.com](https://pdftables.com),
[https://pypi.org/project/Camelot/](https://pypi.org/project/Camelot/)?

~~~
counciltime
I have a friend who has also developed a number of applications that use OCR
specifically for PDF which uses Tesseract. The Report Miner application does a
nice job of locating and extracting PDF tables.

[https://www.opait.com/tesseractstudio/](https://www.opait.com/tesseractstudio/)

[https://www.opait.com/Pdfreportminer/](https://www.opait.com/Pdfreportminer/)

~~~
minhtripham
Would love to learn more about the apps your friend developed--currently doing
research into different OCR use cases + tech. can you shoot me an email at
minh@docucharm.com?

------
ocrcustomserver
Some videos that were just released:

Announcing Amazon Textract,
[https://www.youtube.com/watch?v=PHX7q4pMGbo](https://www.youtube.com/watch?v=PHX7q4pMGbo)

Introducing Amazon Textract: Now in Preview,
[https://www.youtube.com/watch?v=hagvdqofRU4](https://www.youtube.com/watch?v=hagvdqofRU4)

Introducing Amazon Hieroglyph: Now in Preview (AIM363),
[https://www.youtube.com/watch?v=FnZFK_2oqKk](https://www.youtube.com/watch?v=FnZFK_2oqKk)

------
gingerlime
I have a personal flow using tesseract to scan docs into searchable PDFs, but
it’s not that accurate. One of the main problems is that some (now most?) of
the documents are in German since I live in Germany, but some are in English.
There’s a way to choose the language but nothing to auto detect as far as I’m
aware. I was hoping for some cloud AI service with superior OCR and simple
integration or CLI (push a PDF and download one with OCR embedded). Google
seems to be too complicated unfortunately... Any tips??

~~~
philsnow
If you're running tesseract locally (i.e. not paying per invocation), run it
once with EN and count occurrences of the/this/a/any etc, run it again with DE
and count occurrences of der/die/das/um/ab/wie, and go from there?

Edit: Hell, even average word length is probably going to be a good indicator
since German is so agglutinative. Collect some factors like this and I think
you'll be able to build a pretty good classifier.

~~~
rpedela
Good idea. I would take it one step further. I would use a ML-based language
detection tool which should return a list of languages and a confidence score.
Whichever language has the highest confidence score wins. The FastText project
has a good pre-trained model available.

------
ocrcustomserver
This is very interesting. I'm curious to see how they will execute on several
points:

1\. How it will deal with multiple templates that the system hasn't seen
before. Especially when there is significant difference between the templates.

2\. UI/UX. E.g. how it will trace the extracted data to the original source
and how it will show the confidence scores of each entity.

3\. Verification process, how will the workflow look like when the confidence
score is low and the document has to be checked by human operators.

------
citilife
This looks a lot like what I've seen from companies such as InstaBase[1].
Given how hard it is to do well (largely due to poor initial images), I'm
curious how Amazon's product offering will work.

I a team I'm working with had a lot of success doing this, curious what
method(s) they are using.

[1]
[https://en.wikipedia.org/wiki/Instabase](https://en.wikipedia.org/wiki/Instabase)

------
sbarre
So this is Apache Tika as a Service?

[https://tika.apache.org/](https://tika.apache.org/)

~~~
bpchaps
A little late to the comment party, but I was wondering the same. I'm working
on a web scrape workflow that's currently using Tika. I'm very interested in
to see how well this does in comparison.

~~~
sbarre
I was quite surprised by how powerful and flexible Tika can be, and my use-
case was pretty basic: crawling a network drive to index project artifacts
like Office docs and media files and pushing them into an Elasticsearch index.

Have you found any major problems or shortcomings in your usage?

~~~
bpchaps
One small problem is that it sometimes doesn't make newline separations
properly. In my use case, I was extracting email addresses from web scrapes -
some email addresses would come out as "blah@blah.comRandomWord"

------
amelius
Can't use this because my clients/contract don't allow sending of documents to
third parties.

~~~
sbarre
Have you looked at Apache Tika?

[https://tika.apache.org/](https://tika.apache.org/)

------
ironfootnz
Amazon Textract may store and use document and image inputs processed by the
service solely to provide and maintain the service and to improve and develop
the quality of Amazon Textract..."

I still prefer the Dropbox solution for that, but I'm waiting them
transforming into an API.

------
jgalt212
I have been following this service from afar, as the founder is quite skilled.
Seems a bit pricey, but does similar.

[https://www.pdfdata.io/](https://www.pdfdata.io/)

------
blacksmith_tb
I wonder if they have any detection of captchas, or if they'd let people just
submit screengrabs containing them as 'documents' to be processed...

------
foxhound6
Any idea if this can support handwriting even with a reduced confidence?
Support for non-English languages?

~~~
brad0
According to the Textract preview sign up form there is the following
features:

\- Printed text detection

\- Handwritten text detection

\- Key-Value detection

\- Table detection

\- Checkbox detection

\- Other optical marks (e.g. barcode, QR code)

There's a decent possibility it has handwriting recognition. Not sure about
the non-English languages though.

~~~
ocrcustomserver
The docs page [1] (subject to change) mentions:

 _Do you support handwriting? – We do not support handwriting extraction._

[1]: [https://docs.aws.amazon.com/textract/latest/dg/how-it-
works-...](https://docs.aws.amazon.com/textract/latest/dg/how-it-works-
faq.html)

------
hbcondo714
Arg, you have to type in all your information even if you are logged into the
AWS console

------
dvtrn
The FOIA geek in me is....well...geeking out over this. Slightly.

------
jijji
This is genius...

1\. make "strings" api 2\. hook it to a web server 3\. profit!

