Hacker News new | past | comments | ask | show | jobs | submit login
A search engine in 80 lines of Python (alexmolas.com)
634 points by alexmolas 9 months ago | hide | past | favorite | 95 comments



This is really cool. I have a pretty fast BM25 search engine in Pandas I've been working on for local testing.

https://github.com/softwaredoug/searcharray

Why Pandas? Because BM25 is one thing, but you also want to combine with other factors (recency, popularity, etc) easily computed in pandas / numpy...

BTW phrases are the hard thing. There are a lot of edge case in phrase matching. Not to mention slop, etc. And you want to compact positions into as little memory as possible

https://github.com/softwaredoug/searcharray/blob/main/search...


Thanks for your comment Doug! I just bought your book (Relevance Search) and I'm planning to read it asap.

I'll give a look to your project and try it for experimenting, thanks for sharing.


I also bought the book (it’s Relevant not Relevance BTW), that one and Deep Learning for Search were invaluable when building a news search engine a few years ago (right before and in the middle of the pandemic).


Can confirm this book is excellent.

It's one I point people at all the time when they ask me why something isn't working as expected in any standard search tool, and something I reference from time to time to refresh my own knowledge.

Well worth the money.


Thank you! It's a bit old now, but I think the principles in it are still valid :)


Hey, I tackled phrase matching in my toy project here: https://github.com/vasilionjea/lofi-dx/blob/main/test/search...

I think I tested it thoroughly but any feedback would be appreciated!

Edit: I delta-encoded and base36-encoded the positions


Oh nice I’ll take a look!


Have you found that including sentiment analysis has helped(or hurt) with phrases? I also have found that phrases are difficult to work with and I’m wondering what I can do to improve performance.


How come you found this post and commented on it so quickly, do you have some sort of search scanning frontpage for terms of interest or was it by chance? Curious


I got that HN gaaaame


Huh? I check Hacker News multiple times a day - it's not odd to click on an article within an hour of it being posted.


It's not the first time I saw an article posted and then an expert in the field comment on it rather quickly, I thought I may be missing something how other people use this site, had no negative intentions asking this and thanks for the answer ;)


HN has an RSS feed [0] so there's no need to keep refreshing or make the rounds of this and other sites that have interesting information.

I have my own feed setup with sites I like to frequent [1].

[0] https://hackaday.com/blog/feed/

[1] https://mechaelephant.com/feed


what software did you use for your own feed?


Various scripts in python, shell and javascript [0].

[0] https://github.com/abetusk/www.mechaelephant.com/tree/releas...



Awesome!


thanks! I'm currently looking at the files

how did you filter which posts appear in the feed?

EDIT: I was able to make it work, I replaced the node module xmltojson with xq (python lib available from pip)


https://f5bot.com is a way to get an email notification when a keyword is mentioned


Some people set up something like a Google Search Alert for terms about them or their writings, etc.

Also, a lot of subject matter experts hang out here so it just happens a lot :)



Checks out. Most of the difficulty in search is dealing with the data volume. The logic itself is surprisingly easy, or it can be. Of course you can complicate it endlesssly, but this project successfully cuts most of the fluff...

So with that in mind, approaching the problem not as one of how to make the search engine bigger, but the data smaller (physically or better signal/noise ratio) goes a very long way.


Looking at the code (src/microsearch/engine.py), we have:

    class SearchEngine:
        def __init__(self, k1: float = 1.5, b: float = 0.75):
            self._index: dict[str, dict[str, int]] = defaultdict(lambda: defaultdict(int))
            self._documents: dict[str, str] = {}
            self.k1 = k1
            self.b = b
I've no idea what `k1` or `b` are. Nor is there a single comment in the entire file. Are comments considered unfashionable these days?

Looking at `_documents`, I'm guessing the keys are URLs and the values contents of those URLs, but i might be wrong.

The whole thing looks like it would've been a useful resource for people to learn how to build search engines with, that people could build on, had the writer been bothered to document it. But as it is I'm disappointed with the poor code.


That's explained in the article, which serves as the documentation for the code within the article. The link for BM25 goes to some of the math, and a little more searching the internet about BM25 parameters can lead you to some relevant articles on how to choose them.


BM25 is just too famous in traditional searching and not worth explaining it all over again.


Hi, author here.

If I wanted a catchy title for the post I needed to cut the number of LOC as much as possible;)

Joking apart, thanks for your feedback. I agree that usually it's better to have documentation and code together, but in this case since it's an educational project I decided to split code and documentation, and document the code in a blog post.


Put the comments at the end of the lines at least, maybe? LOC remains the same, but now there are at least some comments.


Fair enough


On mobile device but it’s the standard weighting values for either TF/IDF or BM25. In this case BM25.

A comment would be useful but they are also instantly recognisable to anyone familiar with the problem.


> instantly recognisable to anyone familiar with the problem.

I always love reading those when not familiar. It's almost as funny as reading something one already knows, waiting for the punch line...


k1 and b are tuning parameters for the BM-25 ranking function[1]. These are not names OP's invented. Virtually every implementation and definitely every textbook uses these variable names. You give them the name k1 and b because otherwise nobody who's into information retrieval will understand what they do.

[1] https://en.wikipedia.org/wiki/Okapi_BM25


this trend of `a: float` always reminds me of the Rich Hickey "you don't want types, you want proper names" talk. I really hate this (feels to me Go inspired) tendency of undescriptive single letter variable, with the type system abused as a naming assistant.

Names can convey proper semantic information about what your program does, use them godammit


> tendency of undescriptive single letter variable

There are 2 schools of thought on which one is clearer,

    F = G * m1 * m2 / r**2
or

    force = gravitational_constant * mass_of_body_1 * mass_of_body_2 / distance_between_bodies ** 2


Phycisist :

> F = G * m1 * m2 / r**2

Computer scientist:

    nonrelativistic_gravitational_force = ( 
      Physics.NonRelativistic.Gravity.gravitational_constant 
      * body1.NonRelativistic.mass() 
      * body2.NonRelativistic.mass() 
      / body1.NonRelativistic.distanceTo(body2) ** 2
    )*


Software engineer straight out of uni:

# Get force of gravity as below to be used later

F = G * m1 * m2 / r*2

Software engineer 5 years after uni:

    gravitational_force = ( 
      PhysicsContextConstructorFactory.createByRelativisticEnum(SystemConfig.getRelativisticEnum()).construct().getGravitationalConstant().value()
      * body1.getRelativisticContext(SystemConfig.getRelativisticEnum()).mass().value() 
      * body2.getRelativisticContext(SystemConfig.getRelativisticEnum()).mass().value() 
      / SystemConfig.getRelativisticContext(SystemConfig.getRelativisticEnum()).distanceTo(body1, body2).value() \* PlatformUnsignedInt(PlatformMinMaxAwareUnsignedInt(2).value()).getUnsignedInt().value()
    )*

Software engineer after 15 years:

# Newton's law of universal gravitation is well within wanted margin of error

F = G * m1 * m2 / r*2


Sadly the 5y SE forgot about dependency injecting the computation... after all we're always on the brink of new physics discovery, we'd hate to have to rewrite all that code when we can readily swap in fresh laws with the proper architecture!


Oh, but that comes as a needs-to-be-fixed to the pull request from the 6y SE, so no worries at all.


Short names are OK if they are explained somewhere. I often explain them in a comment at the start of a class or function.


This is right, but if you are implementing a formula known in the domain or documented elsewhere, you can (and should) use the letters used in the formula (in this case, b and k1) instead of making up names.


I agree with you that better names are always preferable. But type hints also work as documentation. We can have both.

However, in this particular case the undescriptive names are like this for historical reasons. I agree these are not the best names, but are the names used in the literature. If I was working in a physics I would probably use "c" as the speed of light or "kb" as the Boltzmann constant, which are non very descriptive names.


What is the point of flexing about LOC, if it is not a total number of \r\n since we are using external deps? I know that there is no unit for codebase in SI system, but I think we should measure cognitive load somehow.


Although it's not formal, my team sometimes says "this code is not grug" or "this code is pretty grug" in reference to https://grugbrain.dev


One of the best scholarly essays on the net.


This grug not big brain. This grug ate food and read article on big screen at the same time. Now food on screen, not in belly. Grug still hungry, but need to find rag now to clean screen. Otherwise no more work and no more shiny rocks for grug. But grug thanks other grug for article. Grug belly empty, but grug brain full now.


> "grug tempted reach for club when too much agile talk happen but always stay calm"


me nod head save to pdf not just bookmark link


Best thing I've read today, thanks!


Thats awesome im stealing this


The 80 line search engine isn't using any external deps though. It only imports collections, math and string, all in the standard library. Maybe it'd be more accurate to call it a search engine engine though. The crawler and interface aren't counted towards that goal, but they are obviously needed in some form, and the implementations presented add a bunch of lines and a lot of libraries.

But even then, those libraries aren't related to search engines. If we start counting generic dependencies like pandas and fastapi, we might as well start counting the millions of loc of the operating system needed to run this search engine, and the firmware in the network card. Maybe even account for the complexity of the hardware this is running on.


Why not celebrate the achievements of the industry allowing us to build something like this in 80 LOC?


I mean, I could import this achievement in my own project and build a search engine in 1 LOC.


"X in N lines" is interesting because it's going to show you the minimal implementation of the interesting or central part of the solution without all the other moving parts of a production system, usually for the purpose of demonstrating how something works.

You can see how your 1-liner pitch doesn't fulfill this expectation.


You code just use an existing search engine and it would be 0 LOC, but I think you're missing the point. The focus wasn't on 80 LOC, but rather being able to talk through it in a short blog post.


I don't think that LOC affects one's ability to effectively communicate the way their code operates. And, if we really need to, lowering the amount of operations required to attain the desired result would be better than lowering lined of code, sine low ops = less explanation.


It's meaningful here because if it said "A search engine in 4000 lines of Python" most readers' eyes would glaze over, but 80 is short enough to warrant a glance.


"A search engine in 80 columns of Python code!"


"A search engine in 80 lines of Python code" (but every line is 10 statements separated by semicolon)



I am aware of it, but as far as I understand, cyclomatic complexity is used to compare implementations of single flows (algorithms), not for codebases. What is the cyclomatic complexity of Facebook? :)


Probably at least 5


Operating system and programming language are also external dependencies.


I like it!

Here is a recommendation engine in < 20 lines of python to go along your search engine (if you keep session logs of clicked urls).

  def build_recommendations(logs: List[List[str]], window_size: int = 10, 
  max_recommendations_per_url: int = 50) -> dict:
      recommendations = {}
      for session in logs:
          for i in range(0, len(session)-1):
              url_id = session[i] # reference element
              window = session[i+1 : i+window_size] # sliding window
              recommendations[url_id] = recommendations.get(url_id, {})
              for pos, next_link in enumerate(window):
                  weight = window_size - pos # elements in the window get decreasing weight proportional to their distance from the reference element
                  recommendations[url_id][next_link] = recommendations[url_id].get(next_link, 0)
                  recommendations[url_id][next_link] += weight
    
      for url_id, link_recommendations in recommendations.items():
          # sort and truncate the recommendations
          recommendations[url_id] = dict(sorted(link_recommendations.items(), key=lambda item: item[1], reverse=True)[:max_recommendations_per_url])

      return recommendations

  recommendations = build_recommendations(your_logs)
  print(list(recommendations[some_url_id].keys())) # gives an ordered list of the recommended url_ids for the given url_id

With some tweaking (mix typed queries and clicked urls in the logs you feed) you can get a spellcheck suggestion out of it as well :)


This is very cool and very educational. Don't deploy it, though. :-)

I needed something like this once, but at a little bit larger scale (few tens of thousands of documents). The answer, as always, was [sqlite](https://www.sqlite.org/fts5.html); structurally though it's just what you have here but with someone else writing the inverted-index persistence layer.


I use SQLite FTS for just about everything. Has never let me down.


Wow it literally has the same equations and everything. Thanks for this comment, gave me a "shiver of understanding" if that makes sense haha.


Google allows you to search for "search engine" (notice the double quotes) and it’ll only show you results where the two words appear in this specific order.

At least only some of the time, unfortunately. What power users want is "grep for the Web", not "Google, tell me what you want me to see."


> What power users want is "grep for the Web", not "Google, tell me what you want me to see."

I can almost guarantee that nobody actually wants this. "Grep for the web" is strictly bad compared to a search engine that does the tiniest amount of query expansion. Google is definitely taking too many liberties in interpreting the query, but there are many things any search engine should do that will be a straight improvement over not doing them.

The problem with Google search right now is that it's hard to reason about why it gives the results it does, seemingly because they rely too heavily on embeddings to compare strings. It's frustrating when "cat food" matches "dog restaurant" because the two were semantically close in some embedding space that doesn't quite align with human reasoning.


> chunk for chunk in chunks if chunk

I believe there's a saying about a woodchuck somewhere in here


Imo, it's not fair to talk about 80 lines of code while using thitd party libraries (feedparser, bs4, etc)


If they'd built it on top of elasticsearch I'd agree with this sentiment, but given the actual search engine bit is implemented in those 80 lines, I think it's fair. The libraries being pulled are exactly the sort of libraries you shouldn't try to hand-roll.

(sometimes you see those articles like 'built your own search engine' and it's a guide on how to install searxng or yacy or something.)


it is if they're extremely general & main stream


Nice! It wouldn't be much work to add fuzzy-search functionality (all results with a prefix edit-distance below some threshold delta, so that a search for "hackrnew" matches "hackernews"). Basically, what you would do is you add an additional inverted index, but this time the keys are n-grams of words in your document collection (typically 3-grams), and the postings are the words (or their ID) in which these n-grams occur. There is then a nice lemma that basically says the following (where PED is the prefix edit distance, and N(x) are the n-grams of word x):

If PED(x, y) <= delta, then |N(x) ∩ N(y)| >= |N(x)| - n ∙ delta

That is, x and y must have at least |N(x)| - n ∙ delta n-grams in common to have a PED(x, y) less then or equal to delta.

If you now have an input x, you calculate N(x) and retrieve all postings from the n-gram index for each n-gram of x. You can now merge all of these postings and get a list that looks like this: [worda, worda, worda, wordb, wordb, wordc] (for each q-gram x and some word y have in common, you get one entry of y). If you merge the duplicates, you get: [(worda, 3), (wordb, 2), (wordc, 1)], and for each y (in this case, worda, wordb, wordc), the number in the corresponding tuple is |N(x) ∩ N(y)|.

If this number is larger than |N(x)| - n ∙ delta, you explicitly compute PED(x, y) and check whether it is below your threshold. If the number is smaller, you can simply skip it, saving you large amounts of costly PED calculations.

Your result is a list of words y with a PED(x, y) to the input x below some threshold, and you can then use this list of words to query your existing index.

I used this approach many years ago to implement a fuzzy client-side JS search engine on https://dont.watch/ (if you look into the JS code, you can see that the inverted index and the (compressed) n-gram index are simply transferred in the JS-file). The actual search engine is around 300 lines of JS, with no external dependencies and some very basic heuristics to improve the search results).


How much does that increase the index size?


Last time I checked lxml.html and lxml.html.clean (possibly with cssselect) were much faster than BeautifulSoup. Not sure this is still the case.


This checks out! Pretty much exactly what we did in my undergrad Information Retrieval class.


Is it really a good idea to build something like this (big data, needs to crunch data fast) in Python (slow)?


I would argue that Python does not inhibit this project as much as you imply. There are multiple benefits to Python, such as:

- A built-in dictionary type, used for indexing words

- Clean and easy to read code, which is one of Python's core strengths

- It's fast to draft code in, perfect for toy programs

- Easy async support, which the author comments on

- Plenty of libraries to do the heavy lifting of tasks not focused on by the post, such as hosting a web server, rendering template HTML, and parsing CLI arguments

Yes, Python is not fast relative to C or Rust, but it's perfect for this type of project.


Yes. Paragraph 3:

> This implementation doesn’t pretend to be a production-ready search engine, just a usable toy example showing how a search engine works under the hood.

The whole point is expressiveness.


I'm not a fan of Python, but the bigger "issue" here if this was meant to be production code (it's clearly not) is not the choice of Python, but algorithm choices. It shows the principles, but for a production search engine there are a whole lot of tricks you'd apply whether you stick with Python or not.

And once you do, odds are you'd find most of the runtime spent in very small portions of code doing fairly basic composable operations on large compressed arrays, and you can get very far with just rewriting a tiny core in something faster.

The number of people who need a scale where this is hard to do fast enough is small...


This is a toy project to understand the space. He is using solr at work.


"Crunch data fast" is something numpy is really good at :)

You can see my comment about SearchArray above, but you can do a lot of native performance comparable things if you embrace array based programming


For an inverted index with JavaScript/Typescript, here is my implementation: https://github.com/vasilionjea/lofi-dx

It was a fun little project but definitely way more than 80 LOC :)


Thanks for this lovely well written account. Made me wish I still taught this kind of coding because it would be such a lovely example for students to work through. (We used to do a bootcamp of build your own Facebook and Google in a week) Very inspirational.


Really cool project! Building a search engine from the ground up to tackle small site discoverability? That's a challenge I can get behind. I'm especially into how you used asyncio in Python for faster crawling – smart move.

But, let's talk scale and features. As it stands, handling bigger data sets or adding more complex search features might be tough. On the features side, playing around with query operators or n-gram indexing could seriously level up your search results.

Expanding beyond RSS for content could also give your engine a nice touch. Just throwing in my thoughts – been around the block with search tech a bit. Can't wait to see what's next for your project!


Don't use keywords (1-grams), the best results for English can be achieved with 2+3-grams. n-grams retain context.


I think you'd probably get the best result with both. There's definitely real merit to a keyword-understanding of which terms appear in the title or as a named entity for example.


Usually you would want to have weighted n-grams with 1-grams having lowest weights. In many cases it's better to have zeros. For English, 4-grams react on generic phrases/idioms, 5+ are way too selective and just keywords usually reduce relevance too much. 2+3 are the best.



I have myself dabbled a little bit in that subject. Some of my notes:

- some RSS feeds are protected by cloudflare. It is true however that it is not necessary for majority of blogs. If you would like to do more then selenium would be a way to solve "cloudflare" protected links

- sometimes even selenium headless is not enough and full blown browser in selenium is necessary to fool it's protection

- sometimes even that is not enough

- then I started to wonder, why some RSS feeds are so well protected by cloudflare, but who am I to judge?

- sometimes it is beneficial to cover user agent. I feel bad for setting my user agent to chrome, but again, why RSS feeds are so well protected?

- you cannot parse, read entire Internet, therefore you always need to think about compromises. For example I have narrowed area of my searches in one of my projects to domains only. Now I can find most of the common domains, and I sort them by their "importance"

- RSS links do change. There need to be automated means to disable some feeds automatically to prevent checking inactive domains

- I do not see any configurable timeout for reading a page, but I am not familiar with aiohttp. Some pages might waste your time

- I hate that some RSS feeds are not configured properly. Some sites do not provide a valid meta "link" with "application/rss+xml". Some RSS feeds have naive titles like "Home", or no title at all. Such a waste of opportunity

My RSS feed parser, link archiver, web crawler: https://github.com/rumca-js/Django-link-archive. Especially interesting could be file rsshistory/webtools.py. It is not advanced programming craft, but it got the job done.

Additionally, in other project I have collected around 2378 of personal sites. I collect domains in https://github.com/rumca-js/Internet-Places-Database/tree/ma... . These files are JSONs. All personal sites have tag "personal".

Most of the things are collected from:

https://nownownow.com/

https://searchmysite.net/

I wanted also to process domains from https://downloads.marginalia.nu/, but haven't got time to read structure of the files


> I wanted also to process domains from https://downloads.marginalia.nu/, but haven't got time to read structure of the files

Feel free to shoot me an email if you have any questions about how to use the files. They're very much there to fan the flames of other indie search projects :D


Also, forgot to add, for this approach you better use sketches (bloom filter/hyperminhash). Same results with lightning performance.


i know it is unrelated but i liked your website.


thanks! The design has been carefully chosen, I was looking for something minimalistic. I'm glad you enjoyed it :)


just fyi, PageRank and BM25 address orthogonal problems of reputation/popularity (PageRank) and query/semantic matching (BM25). Saying you're using bm25 instead of PageRank makes no sense.


both are similar in the sense that they can be used to sort a series of sites/documents by relevance. PageRank makes the assumption that sites with better in/out links are more relevant, while BM25 makes the assumption that documents with high keyword matching are more relevant. In this sense both solutions are similar.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: