Hacker News new | past | comments | ask | show | jobs | submit login

Are there any open source tools that would slurp in content like this and develop its own sense of relationships in the data, that I could then explore by hand?

Bedarra's Text Analyzer[1] kinda floored me and I'd like to use something similar for various tasks, if there was something good and free.

[1] http://www.bedarra.com/movies/textAnalyserMovie.html




> Are there any open source tools that would slurp in content like this ...

Yes, tesseract[1] can do a pretty good job. Here[2] is a blog post which describes using it to perform OCR on PDF's.

As for searching the PDF contents, Solr[3] might be what you are looking for instead.

1 - https://github.com/tesseract-ocr/tesseract

2 - http://fransdejonge.com/2012/04/ocr-text-in-pdf-with-tessera...

3- http://stackoverflow.com/questions/6694327/indexing-pdf-with...




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: