

Text extraction - theslay

Hi, I&#x27;m working on plagiarism detection and I need some help on text extraction from pdfs. I&#x27;ve tried PDFTextStream which really works well for extracting text from pdfs. I need to be able to extract the text into a strutured format where i could query thing like title, chapters,etc. Would appreciate it if I could get pointers to achieving this task. Thanks
======
pedalpete
Have you tried posting this to
[http://stackoverflow.com](http://stackoverflow.com) ? That's a better forum
for these kinds of questions.

If you were to write a blog post about how to structure the extracted text,
that's more the HN thing.

------
mindcrime
I won't swear to it, but I suspect you're going to have to largely roll your
own, and that it will be at least partly heuristic driven. I use Apache
Tika[1] to extract text from PDFs and then index it with Lucene, but we don't
need to discriminate between various chapters or anything. But I can picture
how you could use OpenNLP[2] and some custom code, to break down the chapters.

[1]: [http://tika.apache.org](http://tika.apache.org)

[2]: [http://opennlp.apache.org](http://opennlp.apache.org)

