What is a non-lexical phrase extractor? I googled it but it leads back to this p...

kgermino · on Nov 4, 2012

<META>

> I googled it but it leads back to this page.

It blows my mind how often this happens to me, even though I understand how and why. Especially since it's usually just a few minutes after the original comment is written. Google is awesome.

</META>

bane · on Nov 4, 2012

It's a fun way to find sequential tokens (words) that have a high probability of being the names of people.

- Take a lexicon (list) of single words in a given language (English for example).

- Take a block of text and think of it as an ordered sequence of tokens, news articles work really well for this approach, books not as much.

Example (from http://www.cnn.com/2012/11/02/showbiz/movies/flight-review-c...): With its spectacular plane crash -- I would rate it fractionally behind the air disasters director Robert Zemeckis staged in "Cast Away" and Joe Carnahan in "The Grey," but still more than gut-wrenching enough to make you think about taking the train, next time -- "Flight" immediately raises the stakes on your typical addiction drama. But that's essentially what it is -- with a courtroom finish for extra lift.

- Stream through the text, any word that is in your lexicon, throw away.

---- ---- ----------- ----- ----- -- - ----- ---- -- ------------ ------ --- --- --------- ------- Robert Zemeckis ------ -- "---- ----" --- Joe Carnahan -- "--- ----," --- ----- ---- ---- ------------- ------ -- ---- --- ----- ---- ----- --- -----, ---- ---- -- "------" ----------- ------ --- ------ -- ---- ------- --------- -----. --- ----'- ----------- ---- -- -- -- ---- - --------- ------ --- ----- ----.

- treat each group of remaining tokens as separate objects, in this case we have 2

- write these non-lexical (not in your original lexicon) sequences out:

Robert Zemeckis

Joe Carnahan

Boom, you just made an entity extractor that plucks names out of text without having to model the English Language too rigorously. And it generally works in most languages that have a low intersection between name-part tokens and lexicon tokens. And it can be brutally fast.

Where this gets interesting is in suppressing junk and tweaking the algorithm around things like parenthesis, apostrophese, and sentence boundaries. There's lots of little edge cases like this that you have to be mindful of - numbers in the text for example are never parts of names but aren't in your lexicon so you have to figure out what to do with those. And then you can use other heuristics to improve the results, suppose another sentence just had "Zemeckis" in it, that's a name, but then suppose another sentence had a token not in your lexicon like "Samoflange"...do you count that as a name? What about lexical tokens that are names like "Bush"? So you can try things like only counting sequences of tokens that have more than 2 tokens (like "Robert Zemeckis") and ignoring ones that have only 1.

And it goes on and on -- endless tweaks to improve the quality of the names you get and suppress non-name sequences.

To make the project more interesting, try storing your lexicon in a database or some kind of index so you can search it quickly, I like to use SQLite files with indexes on the lexicon table myself, but it's a fun assignment to try different things like in-memory TRIEs.

If you want to try threading, you can try playing around with searching the text at different start and end points (thread 1 searches the first 25% of the text, thread 2 the second 25%, etc.) or have different threads search different articles.

You can try all kinds of different things to keep it interesting and as you start abstracting the problem you can play with all kinds of different control and data structures to accomplish the task. Trying to make this as fast as possible (with all the heuristics turned on) can also be a fun challenge.