

Show HN: TF-IDF search engine in 30 lines of Scala - felipehummel
https://github.com/felipehummel/TinySearchEngine

======
boyter
You can do the same in Python (in about the same amount of lines I suspect).
The below is the core of the linked implementation (Vector Space) in 15 lines
of Python. All you need is something to build a clean concordance on the
search terms/documents, then just compare the terms against the documents and
sort based on the return value of relation.

Actually the below implementation is usually the first non trivial thing I try
to implement in any language I am learning.

    
    
      import math
    
      class VectorCompare:
        def magnitude(self,concordance):
          total = 0
          for word,count in concordance.iteritems():
            total += count ** 2
          return math.sqrt(total)
    
        def relation(self,concordance1, concordance2):
          relevance = 0
          topvalue = 0
          for word, count in concordance1.iteritems():
            if concordance2.has_key(word):
              topvalue += count * concordance2[word]
          return topvalue / (self.magnitude(concordance1) * self.magnitude(concordance2))
    
    

EDIT - Was going to fork this on GitHub and make a quick Python port but
corporate firewall got in the way.

~~~
boyter
Cannot edit anymore, but here is an example I had written some time ago in
Python. [http://www.wausita.com/2010/08/build-vector-space-search-
eng...](http://www.wausita.com/2010/08/build-vector-space-search-engine-
python/)

------
DrJosiah
A bit opaque, but cute. I've got a Redis + Python version (keeps the index in
Redis) that I talked about last year: [http://dr-
josiah.blogspot.com/2010/07/building-search-engine...](http://dr-
josiah.blogspot.com/2010/07/building-search-engine-using-redis-and.html)

------
whakojacko
As someone who works on search and is a scala fan, I'm super impressed. Next
obvious question: how many lines would it take to add some sort of stemming?

~~~
felipehummel
Maybe in a couple of lines we could do a very naive plural removal. I've seen
code for stemming, they have tons of different conditions and possibillities,
I guess it would be difficult to condense that into few lines.

------
freakinasshowl
i don't think tf-idf relevant for tiny in-memory index >__>

~~~
felipehummel
The index does not have to be tiny, the source code is. I tested with a 500k
documents dataset.

~~~
boyter
Out of curiosity how long did that take to index?

