

Ask HN: Most efficient way to test if a string contains a word? - bladeaod

For an app I wrote I need to check if a String contains a word, finding the longest word it contains.
Currently I check the whole string against a dictionary, then check every substring possible with decreasing length until I find a word.<p>Is there a more efficient way to do this?
======
Xichekolas
EDIT: Revised running time and added code (in Ruby)

Build a Trie out of your dictionary. Then plug your string into and record
depth/chars where it deadends. Repeat for you string with the first char
removed. At the end you should have the longest substring.

Minor optimization would be to stop plugging things into the Trie once the
remaining characters in the string are less than the longest word you have
already found.

This should run in linear time (relative to your search string) because the
recursion is limited by the longest word in the dictionary (depth of the
trie). I have tested this out just for fun.

First I grabbed the Wikipedia page for 'line of succession to the British
throne' and stripped it down to text only (352,000 chars). Then I added
'monarch' to my dictionary. Then I took the first 10k chars of the text and
put it into my trie, getting back monarch as my match and timing it. Then I
figured the ratio of time taken to size of string (10k). I repeated this for
20k chars, 30k chars ... 350k chars, and the ratio remains constant,
indicating time increases in proportion to string length... linear time.

    
    
      DICTIONARY = %w(a ants apple bear bicycle)
      
      class LameTrie
        def initialize(dict)
          @trie = {}
      
          dict.each {|word| insert(word) }
          # obviously it'd be nice to only build this once
          # and then write it out to disk, since our
          # dictionary probably won't change much
        end
        
        def insert(word, level = @trie)
          c, rest = h_t(word)
          level[c] = {} unless level.has_key?(c)
          if rest.empty?
            level[c][:term] = true
          else
            insert(rest, level[c])
          end
        end
        
        def longest_word(str)
          result, n = '', str.length
          n.times do |i|
            t = longest?(str[i..-1])
            result = t if t.size > result.size
            break if result.size > n-i # our micro optimization
          end
          result
        end
        
        def longest?(str, level = @trie)
          return '' if str.empty? or str.nil?
          
          c, rest = h_t(str)
      
          if level.has_key?(c)
            x = longest?(rest, level[c])
            if level[c].has_key?(:term) || !x.empty?
              c + x
            else
              ''
            end
          else
            ''
          end
        end
        
        def h_t(str)
          return str[0,1], str[1..-1]
        end
      end
      
      t = LameTrie.new(DICTIONARY)
      
      puts "Dictionary: #{DICTIONARY.inspect}"
      puts '[emptystring] -> ' + t.longest_word('')
      %w(a an antler crab crabapple pants polarbear adntsants oteuhbicyclentehut bearclaw).each do |str|
        puts str + ' -> ' + t.longest_word(str)
      end

~~~
Xichekolas
Just to clarify, when I said I 'put the first 10k chars into my Trie', I
didn't mean I inserted them into the Trie... merely that I called longest_word
with a 10k-char string.

------
ajuc
I'll use radix tree for dictionary, to speed up testing.

For words (son, sun, so, sunny) it would be:

    
    
       root -> S -> O -> .
               \     \-> N -> .
                \-> U -> N -> .
                          \-> N ->  Y -> .
    

Then - for each character in given string I'll make pointer pointing at root
of dictionary tree.

Then in loop I would advance down each pointer in turn, until it's no longer
possible to go down the tree, or the until the characters for given pointer
ended.

When no pointer can be advanced, the pointer that gives longest word is the
best.

------
bladeaod
So by the nature of the app, the string I am checking will often increase by 1
letter, and then I check the string again. Right now I save a list of strings
I have checked, because it is quicker to see if a string in in that small list
than in the dictionary, then once I find a word I clear the list of strings I
have checked. Is there any thing else like this I could do to speed things up
so I am not constantly checking if the same strings are a word? Should I not
be clearing that list so often?

~~~
khafra
There's always <http://en.wikipedia.org/wiki/Aho-Corasick_algorithm>

~~~
ajuc
Everything has been done before :)

------
mbrubeck
Just to be clear, is this a valid example of the output you're looking for?

    
    
        longest_word("oteuhbicyclentehut") => "bicycle"

~~~
bladeaod
At first glance, yes, unless there is a longer word in there

