

Ask HN: How to search through 1M scanned TIF images? - esers

Here's my problem: I've got a corpus of over 1 million scanned text documents in tif format. No OCR has been performed.<p>What is industry best-practice for indexing and searching through these documents?<p>Does anyone have any first-hand experience with commercial and/or open-source solutions that are performant on this type of problem?<p>How about experience with ranking algos other than tf-idf?<p>I'm sure this is a common problem at the enterprise level. For example, just today, the WSJ published an article mentioning that over "100,000 pages of documents [have been collected from] the companies and agencies involved in the" oil spill: http://online.wsj.com/article/SB10001424052748703339304575240210545113710.html?mod=WSJ_hpp_LEFTTopStories<p>What is the quick, easy, and reliable way to search through these single-serving corpora?
======
hga
I used to do this 15 years ago and as far as I know the approach is still the
same:

Run the images through a fixup program that will among other things fix
scanning skew.

OCR it, of course. If you can afford it I believe that this is still the best
system: <http://www.primerecognition.com/augprime/prime_ocr.htm>

It can use multiple OCR engines and gets around a _nasty_ bug in the one that
has the best quality and speed.

You might want to hire a specialist company for the above tasks. Litigation
support companies are one possibility but the lawyers they work for usually
have a serious budget.

Then you need to put it into a full text engine that's ideally flexible WRT
OCR errors. I.e. the old standard of an inverted index is subpar.

Good luck.

------
esers
What I'm getting at here is: What is the SugarCRM of Enterprise Search?

I've tried Googling for "DIY enterprise search", but I haven't come up with
anything.

All I've found are these very heavy (and very expensive) systems from
Autonomy, etc...

------
jacquesm
You should probably read this first:

<http://en.wikipedia.org/wiki/Document_management_system>

And then branch out from there.

------
WestCoastJustin
Might want to check this out:

[http://open.blogs.nytimes.com/2007/11/01/self-service-
prorat...](http://open.blogs.nytimes.com/2007/11/01/self-service-prorated-
super-computing-fun/)

------
trimber
I believe most enterprise systems are fairly expensive. Maybe you would
consider building your own system? I have done a similar thing, by writing a
plugin for Google Desktop Search that indexes TIF files. Writing a plugin is
pretty straightforward. Now for the OCR part, there is only a few Open Source
OCR engines, the most popular being Tesseract. The quality of Tesseract's
results is pretty good and I believe is sufficient for many systems.

~~~
keefe
Raw Tesseract is a pain to use. <http://code.google.com/p/ocropus/> This is
much easier to deal with and uses Tesseract as the default engine - this is a
google funded project btw.

