Are there any open source tools that would slurp in content like this and develo... | Hacker News

Hacker News new | past | comments | ask | show | jobs | submit

login

tlack on Sept 9, 2015 | parent | context | favorite | on: Code to transform Hillary's emails from raw PDF do...

Are there any open source tools that would slurp in content like this and develop its own sense of relationships in the data, that I could then explore by hand?

Bedarra's Text Analyzer[1] kinda floored me and I'd like to use something similar for various tasks, if there was something good and free.

[1] http://www.bedarra.com/movies/textAnalyserMovie.html

AdieuToLogic on Sept 9, 2015 [–]

> Are there any open source tools that would slurp in content like this ...

Yes, tesseract[1] can do a pretty good job. Here[2] is a blog post which describes using it to perform OCR on PDF's.

As for searching the PDF contents, Solr[3] might be what you are looking for instead.

1 - https://github.com/tesseract-ocr/tesseract

2 - http://fransdejonge.com/2012/04/ocr-text-in-pdf-with-tessera...

3- http://stackoverflow.com/questions/6694327/indexing-pdf-with...

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact