Hacker News new | past | comments | ask | show | jobs | submit login
Building a full-text search engine in 150 lines of Python code (degoe.de)
381 points by bartdegoede 19 days ago | hide | past | favorite | 83 comments

Nice introduction. A good ramp-up from basic splitting to ranking.

It does need to be said that when Lucene had a set of features this small, it was also pretty tiny. And, if those are the needs, one could still download it and it will probably run on modern JVM:


lucene-1.4.3.jar 2004-11-29 14:13 316K

It is a bit bigger of course, but that's because it already had stemmers, multilingual support, multiple Query Parsers including phrase, RAM and Disk directory implementations and a bunch of other advanced features one can easily see by unzipping the jar file.

it amaze me how much is about creating an eco-system to scale and handling complexity and how software grow to answer that instead of trying to split into two different piece of software. Many people/companies/websites just need those basic features when search is not a core essential capability

Well, Solr and Lucene projects are right now in the process of splitting up (after joining for versions 3-8).

And, you could have always used Lucene directly as an embedded search library. Amazon does, for example, for their customer facing search as their scalability patterns do not align with either Solr or Elasticsearch.

And if you do use Lucene directly, you can choose just the libraries that apply to your use case. I think at minimum, you can get away with maybe 3 jars (core, queryparser, analyzers-common) and that's just over 5Mb for the latest Lucene, which includes things like: http://blog.mikemccandless.com/2021/03/open-source-collabora...

Sometimes, there is a disconnect between implementation that is very flexible and messaging that just shows 'use everything' garden path.

> Amazon does, for example, for their customer facing search as their scalability patterns do not align with either Solr or Elasticsearch.

Can you explain this to a non-programmer?

That is a good challenge, as I am sure their reasons were extremely technical. Twitter also uses Lucene and also in different ways. And I am not at Amazon or Twitter, so I just go by their public presentations, quite technical ones.

But basically I suspect it is like this.

Vast Majority of actual search magic (vector analysis and heavy math and optimized index management) is in Lucene. But Lucene does not care much about getting data into the system (json, csv, xml, etc) or getting it out. It just has internal representation of documents and fields. Plus it does not do any sort of multi-index management, needed for scale.

So, both Solr and Elasticsearch build on top of Lucene to do the user-friendly and scaling parts. Solr allows to send data in multiple formats, Elasticsearch sticks to JSON. Solr has multiple way to build search queries (url params, xml, json, changed over time), Elasticsearch sticks to JSON. Solr is a bit rigid about how schema (field collection and type definition) is done, Elastisearch is a bit more hands-off. Solr has composable pre-processor chains and explicit field analyzer chains, Elasticsearch focuses more on logs and has external pre-processors and custom scripting language. Solr uses Apache Zookeeper to coordinate distributed state, Elasticsearch rolled their own. Solr has one particular way to split data into replicas and shards and throw them from one server to another, Elasticsearch has another. Solr is perfectly happy to run a single node=server=collection=core. Elasticsearch starts with fully distributed cloud setup.

All of these are trade-offs on top of actual search. That's why most of the consultants work with both Solr and Elasticsearch, the most difficult search optimization concepts are same for both.

Bloomberg, I think, uses Solr directly and builds on top of that. They contributed Machine Learning ranking to Solr, for example. They also use Elasticsearch for log analysis, I believe.

Amazon's needs are different again. Their data comes in through different routes (direct from databases?), their multi-tier replication and sharding strategy have different needs, they don't need user-facing user interface, etc. So, they take Lucene directly, and build the rest on top. And then contribute back to Lucene, which makes it available to both Solr and Elasticsearch. Everybody benefits in the end.

>Many people/companies/websites just need those basic features when search is not a core essential capability

and some companies convince themselves that search isn't essential to them and they end up with awful search not much better than an SQL like query and when they do UX evaluations of how people use their site say look - nobody uses our crappy search let's not put any time into fixing it!

agreed. but lets be honest to operate an elasticsearch for instance. You need a sizeable team. its not a one man job.

You can use embedded Lucene if you want great search but no UI/schema management and you are happy to code the integration. You can start with 3 jars (5Mb) in size and add additional jars as your needs expand (e.g. multilingual processing).

You can use a single node Solr, if you want UI and send data in multiple formats and may eventually need to scale.

You can use Elasticsearch if you know you are starting big and deal with scaling from the start.

You have the choice. It is not just clustered Elasticsearch vs code your own.

well, I'm pretty good at all the non operations aspects of ElasticSearch, so in my experience if you have ES in some sort of managed cloud instance - like searchly, doesn't have to be AWS or elastic - it's a one man job - and also in my experience not necessarily full time devoted to just es.

May vary dependent on if you have a lot of data, but if so you probably have people to handle lots of data problems anyway.

I went down this rabbit hole some ten years ago, and somewhere along the line I discovered SQLite has an FTS engine (up to version 5 now, I think) and just started using that instead. Massive performance for tens of gigs’ worth of text content on a single core.

Neat! One production ready python library I love to use in this space is https://radimrehurek.com/gensim/ It is quite mature and can handle a good amount of data well! It offers topic modeling too and can b helpful to find similar documents also.

As Joel spolsky said: "Just do me a favor and search the damned hard drive, quickly, for the string I typed, using full-text indexes and other technologies that were boring in 1973."

That's a standard feature I use nearly everyday and I'm very glad for in MacOS.

Does Spotlight search even index contents? I didn't think that happened, but maybe I'm wrong?

Excellent read. Was searching for a full text search engine but not finding any suitable one. Plan to implement one just this way.

Author here: you may want to check out something like Whoosh https://whoosh.readthedocs.io/en/latest/intro.html (it's like a clone of Lucene but in pure Python). I've used this to build some basic search for a small Python website and it was more than fast enough for my purposes :-)

SQLite has a pretty good built-in fts engine: https://www.sqlite.org/fts5.html

Problem is FTS5 isn't included in the most default installation through package managers [I use Fedora]. And recompiling from source breaks a lot of things, as sqlite libraries are generally linked with all apps that use it.

I admit I only used sqlite through the go driver (https://github.com/mattn/go-sqlite3) where using fts5 amounts to one flag during the compile phase.

But SQLite's FTS5 has no support for the `offsets` function...

Many years ago I ran into this paper "Self-indexing inverted files for fast text retrieval" http://citeseerx.ist.psu.edu/viewdoc/download?doi=

It's short and to the point. And then I implemented all that ... in PHP and MySQL :)

It feels daunting at first, but once you understand what it wants you to do, it's actually not that hard (for this particular paper, and this particular approach).

However, you do want to employ a stemming library to normalize word forms.

A well worked example using Python and Redis:


Reminds of this David Crawshaw (CTO at Tailscale) presentation on full-text search with SQLite which probably requires 10 lines or less: https://www.youtube-nocookie.com/embed/RqubKSF3wig

It's not relevant or an indicator of power that this example requires ten lines of SQL. This presentation just uses SQLite's builtin full-text search system [1]. Of course it's going to require less code to call a library than to implement it.

[1]: https://sqlite.org/fts5.html

Relevant because "building a full-text search engine" is solved using SQLite just as much as it can be solved by writing Python on top of lxml and py-stemmer.

SQLite is no heavyweight dependency àla Apache Lucene. It also helps that it is bundled in billions of Android and iOS devices.

I'm sure there's some pypi package that has a full search system you can download. For that matter, there's the python sqlite package. It's irrelevant that it's ten lines of library calling code when you're comparing it to a toy implementation.

Apache Lucene can be as small as 5Mb, if you don't load what you don't need.

Yea! FTS5 FTW! Amazing for what it costs.

The article looks suspiciously similar to https://artem.krylysov.com/blog/2020/07/28/lets-build-a-full.... Very similar examples, code and structure.

Does it? Both are implementations and explanations of a well-known algorithm. Most articles on quicksort will also look alike, but there's no reason to assume the author has plagiarized anything.

Compare "def analyze" and "func analyze". It's obvious this article was copying the other.

So? Wikipedia is one of the most convenient, large English corpora available, and I doubt there are many significantly different ways to write the bit of functionality that's built up here. I'm not sure if that's what you're meaning to suggest, or that there was some kind of plagiarism / inspiration going on here.

I think it is plagiarize. Compare "def analyze" and "func analyze". Very similar.

I saw a case once, I forget where, when someone was accused of plagiarising from wikipedia and the truth was that they'd themselves written a large part of the wikipedia article.

Except that one is written in go and the other in python...

Yes, but where's the attribution?

How many different ways do you think there are to write this code?

A billion?

Everything is a paraphrase of the original source - whatever and/or whenever that maybe from.

Including our very own DNA...

True, but it’s also common for people to acknowledge where their DNA came from.

You would likely be called out if you tried to hide that fact in order to seem more accomplished than your are. If for instance your dad is the owner of the company you are working at, you’d be rightfully be called out if you tried to hide the fact that your accomplishment is lessened by the fact that you copied your fathers DNA.

No one is claiming the author did anything wrong in copying the source material for his article. However it is wrong to not provide proper attribution, especially when it’s as easy as an “inspired by link” at the bottom.

Cool article.

I recently build program in Go that takes wikipedia article and gets all dependencies then using tfidf*count ranks concepts in order of "importance". Seems quite good for math articles to get list of more basic concepts to understand first.

Fun times. I once applied pagerank onto a set of 8000 math articles and ran the result as a web app in 2009/10.


As a gimmick, I created 36 groups, with group 1 containing the most important concepts:


Spoiler, the top ten concepts were:

Function · Set · Number · Integer · Real Number · Point · Property · Finite · Ring · Relation Theory

BTW: Glad the archive has a copy, because I do not.

Wow thats such a cool idea. Really nice.

Using nlp techniques to create tokens is for a fast and simple python search often not needed. It makes things slow and python standard string function are often good enough. Issues arise when searching for combinations of words like ‘machine learning’ in a sentence. Nice read, but the GitHub repro needs a license to be more useful.

Added an MIT license. Have fun :-)

Failed for me. I use Chinese and Japanese. There are no spaces to split on. I also search code where this also fails. I know it was meant to be simple and illustrate some points. I think the point it fails at is that this is actually a much harder problem in real life than a simple 150 line solution suggests.

if you use a library for Chinese/Japanese tokenization (which is harder because the lack of space), it seems like the rest of the code would work?

@op great article. I ran your code first without fully reading the article. It took the code almost an hour to parse and index the data using the terminal.

A suggestion, if you want to make the code more friendly. Either:

1. Write a note to README.md that once you exit the program, you loose the indexed data.


2. Save the indices to disk. :)

thanks for the article

  return [token for token in tokens if token]
What the what?

Comprehensions in Python are pretty handy, but they can look weird sometimes. This is filtering out `None`, empty string, etc values from a list.

It is iterating over a list (tokens) and creating a temporary variable (token). It tests the truthyness of it (if token), which means None and '' will return False and thus be excluded, and then returns it (the first token).


    list(filter(None, tokens))

    list(filter(lambda x: bool(x), tokens))

Lol. Compact map.

This doesn't account for synonyms. I'd rather use a document embedding search.

Very nice work! Thank you for sharing.

Does anybody work for Atlassian here? If so, can you please share this with the Confluence team? Thanks...

If you want to feel extra sad about it, see this: https://jira.atlassian.com/browse/CONFSERVER-13499

If a wiki page has the word adrecordset, a search on recordset should include the page. Currently only a search on adrecordset or a search with wildcards returns the page.

Wait..this seems unreasonable?

Isn't this how it works with elasticsearch or solr? or even google for that matter..a search for green won't return evergreen..

They don't support leading wildcards. So foo* will find foobar but *bar won't find foobar. Elastic does fine with that as of 7.9.

Well, in my experience they cannot even find URLs, IP addresses, or wildcarded words like foo*

Do people really find the search in confluence that bad? Would they actually be willing to pay to have it fixed?

I wouldn't pay to fix it, Confluence isn't free. There are a lot of Open Source alternatives that provide such features out of the box. It's the only feature that shouldn't be broken.

I really hate Confluence nowadays (and I really liked in the past, when I was using it randomly): - Counter productive syntax that is not even equal to the one of Jira

- Search functionality that cannot find anything (I spend HOURS trying to find my content, let alone somebody else's). Some days I just wish I had the DB at hand / a folder with the whole pages exported to grep it myself.

- Proprietary file format - good luck migrating away from it

I'm really sad to say these things, I like Atlassian as a company but the more I use Atlassian products, the more I realize they're bloated and have _broken_ features that were implemented quickly just to tick some more boxes on their features sheet.

And the Jira team

JQL really isn't that bad is it?

Not bad - horrible.

What's horrible about it? The syntax is very straightforward

Not around text field searches. Something as simple as searching for exact fragment with punctuation (think function header) often does not work at all.

I worked previously for a very high traffic ecommerce company (Alexa top 300 site).

As part of the search team, I worked on a project where we deliberately rewrote the whole product search engine in Python and Cython, including our own algorithms manipulating documents for deletion, low latency reindexing after edits, and more.

We did this because SOLR was too slow and the process of defining custom sort orders (for example, new sort orders defined by machine learning ranking algorithms, and needing to be A/B tested regularly) was awful and performance in SOLR was poor.

It was a really fun project. One of the slogans for our group at the time was “rewriting search in Python for speed.”

The ultimate system we deployed was insanely fast. I doubt you could have made it faster even writing the entire thing directly in C or C++. It became a reference project in the company to help avoid various flavors of arrogant dismissal of Python as a backend service language for performance critical systems.

Defining custom sort orders in Solr is as simple as uploading a text file with the values you intend to use for ranking. This is a great feature that is in fact missing from Elasticsearch and saves you so much reindexing time.

There certainly are usecases where Lucene based solutions aren't the best fit. But I think the claim that you couldn't make something faster by moving away from Python is outlandish.

> There certainly are usecases where Lucene based solutions aren't the best fit. But I think the claim that you couldn't make something faster by moving away from Python is outlandish.

I read that as a statement that they implemented a proper and bespoke algorithm, not that the speed of Python is greater than C. I am surprised that you read it that way. Who in their right mind would say Python speed is faster than C speed?

You read

>I doubt you could have made it faster even writing the entire thing directly in C or C++.


> a statement that they implemented a proper and bespoke algorithm, not that the speed of Python is greater than C.


Yes. If the implementation language isn't the determining factor for speed, then what is the cause? There is a branch of Computer Science called Algorithmics[0], wherein one expression of measurement is called Big-O notation[1].

[0] https://en.wikipedia.org/wiki/Algorithmics

[1] https://en.wikipedia.org/wiki/Big_O_notation

You seem fundamentally confused, for example with tools like Cython.

Many extension module implementations in Python are literally as fast as pure C (not just nearly as fast with minor extra CPython overhead, but literally as fast as pure C by deliberately bypassing CPython VM loop and data models).

Cython can be as fast as C for simple use cases.

Because you're writing regular Python for a production service though, and not artificially writing optimized examples, then you will occasionally have to pay extra costs.

Are you perhaps a bit too invested in your own narrative?

You are incorrect. This is only true if you can precompute all sort orders (many types of sort orders cannot be precomputed and depend on additional context data only available at query time, especially for personalization or trending solutions). Additionally you must deal with precommitment to very poor sharding properties with Solr. With our Python approach, we could hold all results (billions of content items with trimmed down data structures representing only the thinnest container needed for each sort order representation) in memory easily, and dynamically resort on the fly or double sort by multiple sort orders that each required contextual data only available at query time.

Could i send you a mail somewhere? I'd like to have a casual conversation about e-commerce and some of your insights.

Why would one choose Python for this instead of Go or Rust?

I imagine Python performance would be terrible.

The idea was to illustrate the concepts of an inverted index and tf-idf :-) I've built some stuff like this for a smaller project written in Python because it was Easier™ than spinning up an Elasticsearch cluster (ie it definitely didn't need to scale :-D)

Yes, I've been really interested in FTS recently and love articles like this. I'm currently implementing a paired-down version myself in Elixir because I'm searching one "table" (not postgres) and I don't want to bring in external dependencies for it.

Because it's easier for the average developer to follow Python code? This is a tutorial/demo, not a library.

The article is good and so can be applied to any language, also Python is not that slow.

As a lot of the work is done by code library code, there's a very good chance that the code is not all that slow.

If the dataset is small, speed doesn't matter. Also, there's PyPy if you need speed.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact