
Elegant n-gram generation in Python - striglia
http://locallyoptimal.com/blog/2013/01/20/elegant-n-gram-generation-in-python/
======
coolsunglasses
ngrams from the book NLP for the Working Programmer:

    
    
        ngrams' :: Int -> [b] -> [[b]]
        ngrams' n = filter ((==) n . length) . map (take n) . tails
    

ghci session:

    
    
        λ> let ngrams' n = filter ((==) n . length) . map (take n) . tails
    
        λ> inputList
        ["all","this","happened","more","or","less"]
    
        λ> tails inputList
        [["all","this","happened","more","or","less"],["this","happened","more","or","less"],["happened","more","or","less"],["more","or","less"],["or","less"],["less"],[]]
    
        λ> map (take 2) (tails inputList)
        [["all","this"],["this","happened"],["happened","more"], ["more","or"],["or","less"],["less"],[]]
    
        λ> filter ((==) 2 . length) (map (take 2) (tails inputList))
        [["all","this"],["this","happened"],["happened","more"],["more","or"],["or","less"]]
    

Pointfree Haskell code oftens ends up being a lot like piping together Unix
commands.

Except typed and pure. With global type inference, so you have an objective
type in mind, you can slap together and query the type with _:t_ in ghci and
see if it looks like what you wanted.

------
e271828
This is an elegant solution for sequences, but it doesn't work for arbitrary
iterables (which need not support slicing). While this generalization might
not be needed for language ngrams, the general problem of taking n items at a
time from an iterable pops up in various places.

Here's a generator that yields ngrams from an arbitrary iterable:

    
    
      from collections import deque
      from itertools import islice
      
      def ngram_generator(iterable, n):
          iterator = iter(iterable)
          d = deque(islice(iterator, n-1), maxlen=n)
          for item in iterator:
              d.append(item)
              yield tuple(d)

------
syllogism
But but...Why?

I would've just written:

    
    
        def find_bigrams(input_list):
            ngrams = []
            last_word = '-EOL-'
            for word in input_list:
                ngrams.append((last_word, word))
                last_word = word
            return ngrams
    

I have trouble seeing the requirement to generalize to arbitrary n as
important...If the data is big enough to want n >= 4, it's probably large
enough that you'll write this in another language anyway. And n is unlikely
ever to be larger than 5.

~~~
fiatmoney
You don't want to re-structure your code to change the degree of N. N is a
hyperparameter, you expect it to change as you figure out what gives you the
best result. You also typically want 1, 2, 3, etc. grams, not just of one
degree, and it's silly to have to call different functions for each of those.

And n of quite large degrees is not uncommon in hardcore natural language
processing, or bioinformatics, both of which Python (wrapping Numpy and Scipy,
usually) is heavily used for.

For instance, Chinese doesn't tokenize its words (all the characters are
packed) which means you usually end up doing something like taking N-ngrams
(of potentially large degree) on the character space, doing a lot of lookups
into a dictionary and a language model, and seeing if you can get everything
to "fit" so that all characters are accounted for and the resulting sentence
makes sense.

~~~
syllogism
I didn't think of character n-grams, that's a case where yeah, you do want
larger n. Same with bioinformatics.

But as far as word ngrams goes, I've been doing NLP research for over ten
years, and you almost never want 4 or 5 grams, let alone ngrams of greater
length. The data's simply too sparse to be useful. So, it's really a matter of
generating bigrams and generating trigrams, which I think it's reasonable to
have separate functions for.

