

Building A Full-Text Index In Javascript - olivernn
http://garysieling.com/blog/building-a-full-text-index-in-javascript

======
knowtheory
This is pretty cool, but the fundamental problem is still that you (or someone
else) have to load an entire PDF (or set of PDFS) before you can use the full
text indexing to search it.

If you're running a service (say like DocumentCloud) you're way better off
precomputing a full text index on ingest and providing a search API than
shunting over substantial parts of your stored documents.

Definitely cool as a piece of gear, but not terribly practical from a client-
side perspective i'd think.

~~~
garysieling
Yes, that is certainly true. The other issue with the technique I see is if I
tried to scale this I'd probably hit some maturity issues with these
libraries.

For what it's worth, it looks like DocumentCloud uses Open Calais, which is a
Thomson Reuters product - I used to work there in a different division, they
have a bunch of interesting products in this space.

~~~
knowtheory
Oh neat, what'd you do at Thomson Reuters?

I notice your blog is filled with NLP related goodies. I've been meaning to
screw around with Stanford NER lib, to see if i can train up some custom
recognizers for particular document domains of any utility.

~~~
garysieling
I worked on a bunch of products, but the longest term one was the a side-
product to WestLaw, Firm360, which was a market research tool for law firms
that came from an acquisition (FindLaw). I worked on some of the data-
warehousing stuff, and got to talk to a lot of people who worked on the
content side. There were some teams near me that did similar things (People
Map / KeyCite).

~~~
MWil
Thank you for posting this and for your hard work at TR. I'm developing
something related to this stuff - <http://youtu.be/3m194rui52Q> (it's a really
old video!)

Uses some sorts of social Open Calais-type activities

------
Ygg2
Now all we need is for someone to port an LibreOffice editor in JavaScript :)

~~~
garysieling
Yeah, the thought crossed my mind. There are enough people trying to make
online products like Google Docs or places to post Powerpoint presentations
that it may have already happened somewhere internally. Or, maybe everyone is
just using LibreOffice/Muhimbi and doing it all on the server.

------
binarymax
lunr.js looks pretty nice, seems very useful for tiny browser based stuff. For
something a bit more heavyweight, I've used natural node[1] which is quite
good - though not available in browser.

<https://github.com/NaturalNode/natural>

~~~
garysieling
That one looks neat - it has some interesting NLP features like Wordnet
integration and bayes classification.

