Show HN: Search 1M Linux kernel commit messages

jabo · on May 13, 2021

Hey everyone,

I'm Jason and I work on Typesense (an opensource Algolia/ElasticSearch alternative): https://typesense.org.

I recently learnt that the Linux Kernel I've used for the past 20 years crossed 1 Million commits and also turned 30 years old in 2021: https://news.ycombinator.com/item?id=27014732

I've been on the lookout for interesting datasets to build search interfaces with and with the 1M Linux commits milestone, I figured it would be fun to index all the 1 million Linux commit messages in Typesense and build an instant-search + browsing interface on top of it, especially to find interesting tidbits from the commit messages.

This site is the result of that effort, and a tribute (if I may) to all the hard work that has gone into a piece of software that powers a good chunk of the modern world and has personally been a bedrock for my journey into programming. Thank you Linux Kernel Team!

If you find any interesting commit messages & search terms, let me know and I can add them to the highlights on the site.

dzonga · on May 13, 2021

btw does typesense work on attachments e.g pdf's or office docs ?

jabo · on May 13, 2021

Typesense operates on JSON documents. So you'd have to first extract the data from PDFs / office docs into JSON and then index it.

There are libraries that let you do the conversion:

https://github.com/modesty/pdf2json

https://github.com/microsoft/Simplify-Docx

java-man · on May 13, 2021

Quite fast.

Is there a document/whitepaper which describes how it works?

karterk · on May 13, 2021

We don't have a proper design document (but I certainly think we should have one). I will try to offer a high level summary:

At the heart of Typesense is a `token => documents` inverted index backed by an Adapative Radix Tree (https://db.in.tum.de/~leis/papers/ART.pdf), which is a memory-efficient implementation of the Trie data structure. ART allows us to do fast fuzzy searches on a query.

All indices are stored in-memory, while the documents are stored on disk on RocksDB. All underlying data structures were carefully designed, benchmarked and optimized to exploit cache locality and utilize all cores efficiently.

hpeinar · on May 13, 2021

Is there an estimate how much memory does this site need to have the full index in memory?

I gotta say, I've seen at least one? other Typesense post here on HN at some point and I can't really comprehend HOW FAST this actually is, especially considering how much more bloat and slower general web has gone in the past years.

I don't really have anything to search for from the given site but I just played around with it to enjoy the speed.

jabo · on May 13, 2021

Ha! Once you're used to instant search-as-you-type experiences, it's pretty addictive. From a UX perspective, it actually helps increase engagement as people tend to search for more queries, since there's no cognitive overhead in typing something and waiting for a result.

The commits data is ~950MB on disk, with ~1 million records. It takes up about ~3GB in RAM when indexed in Typesense.

quickthrower2 · on May 14, 2021

Cool project. My second search stumbled upon this gem. Not sure how to link so pasting here:

>>>>>>>>

Authored by Rusty Russell Committed by Jessica Yu

ed875ea1fcc6c34ea232610c3041d0978e327bbe

Tue Aug 15 2017 It's been 20 years since I became a kernel maintainer, so despite how much I'm loving my new career, this patch elicits deep feelings[0].

I went to 1997 USENIX, my first conference. I remember[1] standing around with Alan Cox, Linus, Ted Ts'o and David Miller as they wrote the code for the BKL on a napkin. I listened in awe as this homeless-looking guy described porting Linux to the Ultrasparc, and then described how he then proceeded to beat Solaris on every single lmbench microbenchmark.[2]

A lot of it I didn't understand, but I got home knowing that I had to work with this random bunch of hackers. I had some firewalling hacks which I turned into ipchains, and sent it to DaveM with a config option to switch between the old ipfwadm code and my new code. He liked it so much he replaced ipfwadm entirely, and I woke up one day as kernel firewall maintainer[3].

I found someone to fund my work the next year, and suddenly I was doing my dream job full time. I flew myself around Australia visiting every LUG to convince them to come to the first Australian Linux conference. And of course, DaveM was top of my list for speakers.

There was so much work to do on the kernel; everywhere you'd look there was code which could be simplified, improved. I read the module code and was so horrified at its complexity that I rewrote it, not realizing how epic that would be. Of course I broke lots of things; halfway through the patch series I broke SCSI, so Linus applied up to that point and we had half a module subsystem for a while; I was literally in the airport in Tokyo on my way to Spain when he applied it, too. Every arch maintainer woke up to find they had to implement a whack of complex relocation code, and I got a lot of grumbling.[5]

But one person disagreed with my approach so much and so continuously that I developed a dread of reading my mail every morning: eventually I wrote a filter to send their mail to a separate mbox, which I've still never read and don't intend to.

But mainly, it was a huge amount of fun. I got to hack, and geek out with hackers all around the world. When I flew into San Jose for the first time, DaveM offered to pick me up: turns out he had a two seater so I rode squashed under the rear glass on the overside parcel shelf to see the sights (Sun campus, Berkeley). Back home, I moved to Canberra to join the legendary group of hackers at OzLabs.

The mailing list changed: I gradually learned not to be an asshole (unless, y'know, it was really funny, and eventually not even then). Most of my peers trended the same way. The kernel itself became more formal, more complex, and giant overarching changes became far, far fewer. There are still horrible APIs (the return value of copy_to/from_user, using the same type for list heads and elements, to name two[7]), but the modern calculus of disruptive changes means sometimes we simply step over the broken paving stones instead of repairing them.

I built a team around netfilter, then handed maintenence off to Harald Welte and ceased contributing: I wanted him to own it entirely. I was more nervous handing module maintenance over to someone I've never even met or spoken to, but it's clear now that with Jessica Yu I have scored 2 for 2. I'm as proud of choosing them as of any individual piece of kernel code[8].

To my fellow maintainers: stay harsh on code and don't be afraid to say "No" or "Why?"; there really are more bad ideas than good ones, and complexity is such a bright candle for us hacker-moths. But be gentle, kind and forgiving of your peers: respect from people you respect is really the only reward that sticks[9].

Farewell all, and I look forward to crossing your paths again! Rusty.

[0] Which means I'm now going maudle for NINE paragraphs! And no TLDR, bwahaha! [1] OK, I remember this. Reality may differ. [2] There's no recording of this talk, but it was the best technical talk anyone has ever given on anything[1]. [3] On the internet, nobody knows you barely passed Computer Networking![4] [4] OTOH I topped COBOL/Database programming, so I have no idea what happened. [5] Except DaveM. I'd written test reloc code for sparc/spac64, but he didn't know that so he just cheerfully reimplemented it.[6] [6] Those reading this post closely may suspect that I have a massive hackercrush on David S. Miller. Those reading the code closely, of course, already feel that way themselves. [7] But set_bit finally takes a long! Seriously... [8] Though the ARRAY_SIZE macro and the poetry in lguest are a close second. [9] Actually, bitcoin is a nice reward too; it's like crystalized machine sweat!

Signed-off-by: Rusty Russell

Signed-off-by: Jessica Yu