
Textract, a Python package for extracting text from any document - ColinWright
http://datascopeanalytics.com/what-we-think/2014/07/27/extract-text-from-any-document-no-muss-no-fuss
======
sheetjs
The node module by the same name
([https://github.com/dbashford/textract](https://github.com/dbashford/textract))
also supports image OCR (via tesseract), excel files, RTF and other formats.

~~~
wangman
I also assumed that it was some kind of Python wrapper or implementation of
Tesseract OCR when I saw that name. One would think so when Tesseract being
(one of?) the best preforming OCR-programs out there.

------
ddumas
This looks nice. What I'd really like to see, along these lines, is a python
library for automated document metadata extraction with confidence assessment,
like this:

./autometa.py --author --verbose academic-paper.pdf

Author: "Edward Witten" Confidence: High (matches template "amslatex")

~~~
deanmalmgren
I thought about the metadata thing but decided to exclude it for the earliest
versions of textract to keep things simple. If you'd like to see it in there
and have a good example of how you'd like to use metadata, please feel free to
throw an issue on the issue tracker
[https://github.com/deanmalmgren/textract/issues/](https://github.com/deanmalmgren/textract/issues/)

------
quink
I realise that it's nice that it'll give you a single function to dump
whatever file format into (while actually running it through a shell command
in the backend), but it's not that hard to:

    
    
      out = ""
      pdf = pyPdf.PdfFileReader(stream)
      try:
          if pdf.getIsEncrypted():
              pdf.decrypt('')
          for page in pdf.pages:
              out += page.extractText()
      except NotImplementedError:
          # Yeah, this ain't happening

------
Contraptor
When I first read the headline, I thought there was a new python API or SDK
for the already existing Textract OCR solution from Structurise. We've used
Structurise's product called Textract for years at work, so it was definately
around first. I'm not sure if the creators of this new solution/product were
aware of the prior's existence, but using the same product name for a product
that solves a similar problem seems like it would be an issue... or at the
very least confusing.

Here's a link to StructuRise's Textract product page:
[http://www.structurise.com/textract/](http://www.structurise.com/textract/)

------
kalkin
I have a little shell script which tries to do basically this:

[https://gist.github.com/djudd/1402751e2928cb8ac788](https://gist.github.com/djudd/1402751e2928cb8ac788)

It tries either abiword or OpenOffice/LibreOffice for filetypes other than
pdf, ps, and txt, which works pretty decently for doc, docx, ppt, etc.

One file type here that textract folks might want to add is Postscript.

~~~
deanmalmgren
Thanks for the suggestion. I wasn't familiar with ps2ascii and I just created
an issue here
[https://github.com/deanmalmgren/textract/issues/25](https://github.com/deanmalmgren/textract/issues/25)

------
jknz
Apache Tika exists for years and seems to have the same goal:
[http://tika.apache.org/](http://tika.apache.org/)

I'm wondering why the authors wrote something from scratch ?

edit: this is answered by one author in the 2nd disqus comments of the link

~~~
wiremine
Here's the comment:

"Its very similar to Apache Tika (which I didn't know about until yesterday),
but I think it is different in at least two important ways.

"1\. The intention of textract is to provide many possible ways to extract
text from any document, provided words appear in the correct order in the text
output. By being method agnostic, its possible to use different parsing
techniques in different situations. Here's more on that philosophy
[http://textract.readthedocs.or..](http://textract.readthedocs.or..). and, to
be fair, I'm not sure that Tika's philosophy differs in any meaningful way on
this.

"2\. Another subtle difference is that textract is written in python, which is
a language that is used by nearly all data people that I know. Since the
intent is to be a preprocessing framework for natural language processing, I
wanted it to be as maintainable by the community as possible."

------
oblio
Python version supported? Pypi doesn't list it.

On the same note, your pypi page is borked:
[https://pypi.python.org/pypi/textract](https://pypi.python.org/pypi/textract)

(look at Build status & co, there's a formatting error)

~~~
deanmalmgren
Currently 2.7 but there's no reason python 3 can't be supported too. Thanks
for the heads up on the borking of the pypi page. Noted.

------
goblin89
> Ok, ok, ok. You can’t extract text from any document at the moment, but
> textract integrates support for many common formats and we designed it to be
> as easy as possible to add other document formats.

There go my hopes to see painless OCR library for Python…

~~~
deanmalmgren
Hopefully it will be? There's a great suggestion to use tesseract-ocr to make
this happen.
[https://github.com/deanmalmgren/textract/issues/16](https://github.com/deanmalmgren/textract/issues/16)

If you have any other (better?) ways of doing this, feel free to add some
comments on the issue tracker.

------
haddr
Great tool! BTW. how does this compare to Apache Tika for text extraction from
HTML pages?

------
aphexcx
i'm using this for my git repos now. (I version control my word docs and
pdfs.) here, I even made a post about it [http://www.aphex.cx/2014/08/using-
git-for-pdf-and-word-doc-f...](http://www.aphex.cx/2014/08/using-git-for-pdf-
and-word-doc-files.html)

------
ppod
Nice. Does it do any encoding conversion, e.g. latin1 to utf-8? Does Tika do
that?

------
hackerews
i've always thought the datascope team was awesome. textract makes them even
awesomer.

------
jpulec
This is awesome

------
ealize
Direct link: [http://datascopeanalytics.com/what-we-
think/2014/07/27/extra...](http://datascopeanalytics.com/what-we-
think/2014/07/27/extract-text-from-any-document-no-muss-no-fuss)

~~~
dang
Thanks! Url changed from
[http://getprismatic.com/story/1406492962896](http://getprismatic.com/story/1406492962896).

