Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Read the article. It claims they have found a better estimate of word importance in a document than tf-idf, which would be very significant. Also it doesn't seem to need text segmentation, which means no need for language-specific tokenizers. The research paper is here (haven't read it yet):

http://bioinfo2.ugr.es/publi/PRE09.pdf



Hmm I have no background but how can they determine what a 'term' is for the purposes of frequency without some form of tokenization unless they are using an arbitrary maximum length on 'term' sizes and are eliminating small terms.


tf–idf: term frequency–inverse document frequency




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: