
A First Excercise in Natural Language Processing with Python: Counting Hapaxes - cristoperb
http://catswhisker.xyz/log/2017/9/7/a_first_excercise_in_natural_language_processing_with_python_counting_hapaxes/
======
dec0dedab0de
I get that the point is to be an introduction to the libraries and whatnot,
but was I the only one who immediately thought of just using Counter?

    
    
        from collections import Counter
        import re
    
        [word for word, count in Counter(re.findall('\w*', text.lower())).items() if count == 1]

~~~
kleiba
As the article states in the introductory paragraph, this problem encompasses
more than just counting strings. It also involves _" some fundamental tasks of
natural language processing (NLP): tokenization (dividing a text into words),
stemming, and part-of-speech tagging for lemmatization"_, so a little more
work is required here.

------
newman8r
for anyone interested in more good beginner resources, I really enjoyed this
youtube playlist on python NLTK
[https://www.youtube.com/watch?v=OGxgnH8y2NM&list=PLQVvvaa0Qu...](https://www.youtube.com/watch?v=OGxgnH8y2NM&list=PLQVvvaa0QuDfKTOs3Keq_kaG2P55YRn5v)

edit* I accidentally linked to another good playlist, but here's the first vid
of the NLTK list from the same user
[https://www.youtube.com/watch?v=FLZvOKSCkxY](https://www.youtube.com/watch?v=FLZvOKSCkxY)

------
grabcocque
Hapax Legomenon is such a satisfying phrase to say. Even the opportunity to
look at it makes my eyes happy.

------
visarga
I counted word n-grams up to length 6 in a corpus of 6 billion words with
Madoka, a Count-Min sketch algorithm.

[https://pypi.python.org/pypi/madoka](https://pypi.python.org/pypi/madoka)

------
cristoperb
Author here. The misspelling in the title is embarrassing, but luckily not
very noticeable (I've fixed it on the site).

~~~
bluzeee
It's so beautifully put together and made so easy to understand, I thank you
very much, it helped greately in my learning.

