Hacker News new | past | comments | ask | show | jobs | submit login
Lessons from building GitHub code search [video] (youtube.com)
256 points by costco 9 months ago | hide | past | favorite | 100 comments



They list things like "Fast" and "Regular Expressions", but what about correct results? I often use GitHub web search and it never finds terms which I know for a fact exist. After it fails to return results, I have to instead manually navigate to the file containing the search term and yes it's there.


(I gave this talk and work on code search.)

Sorry you've had a frustrating experience with the product. When we look into complaints about missing results, it's almost always one of two things. First, the repository is not indexed yet, but searching triggers indexing. Future searches work. Second, the files or the repo hit our documented limitations (https://docs.github.com/en/search-github/github-code-search/...). I think these limits are pretty reasonable, but we need to do a better job making it clear why content isn't in the index.


> First, the repository is not indexed yet

I understand this is important. But the issue I have is that it’s hard or maybe impossible to know what’s been indexed and what isn’t.

I run a few orgs with hundreds of repos. Which are indexed? I don’t know.

This makes your search suck for my organization. I understand the reasons. They aren’t reasonable for me. I don’t want to search using your tool if it won’t work for my org.

Code search isn’t just for what’s popular. It needs to be for what is real and accurate.


Yeah, we need to do better on the visibility of this for owners. And we're trying to scale the system so it's not a problem at all. In the meanwhile, you can ask support to index all the repos in your orgs and we can take care of it.


When a user search includes a non-indexed repository, GitHub needs to include a warning message along with the search results, something similar to what you just mentioned:

"One or more repositories of this search is not yet indexed, please try your search later for accurate results."

Likewise for the 2nd case.


For the repositories not in the index, we do show a message to try again later. For documents that have been excluded, we agree we need better visibility.


Do you list the repos not in the index?


What does "Exhaustive search is not supported" mean? (that phrase is from your link, in the bullet-points near the top)


It means that we have a result limit. If a term matches in a lot of files, not all the hits will be shown.


The new UX does a really bad job of helping you realize when you're logged out and search results are being suppressed. It very much implies "zero results" when in fact it means to say "zero results, except in code, where there might be results, but you need to log in first to find out".


Sure but I was talking about incorrect results when logged in, and for public repos.


Weird, I've never found that to be the case. There's a "flash" sort of message right at the top of the search results that says "Sign in to see results in the `x` org"


Yup I was about to write something about how I quite literally clone & grep code bases after GitHub search fails probably half the time.


For repository wide search

   1. Press dot "."
   2. CTRL + SHIFT + F
   3. Enjoy results powered by rg (ripgrep)


I just use sourcegraph it's faster than cloning usually.


Source graph is nice but completely overpriced. I have 500 people who want to search, maybe 1000. I’m not paying $100/user/year forever when I can just clone all thousand repos onto a drive and let people grep it.

Grep is not as good, but the cost is like $500/year for infinite users vs $50m/year with a marginal cost for new users.

Also, my general comment on pricing applies here. Copilot is $100/year. So I’m supposed to think being able to search my code is the same price as being able to write the code.

Also, they are kind of in trouble because this will exactly be replaced once I just fork over $500/user/year for superchatgpt and it will definitely cover code search.


GitHub search used to be amazing, then circa 2019(?) they nerfed it, I think because of security credential leakage.


Nah, it's always been unreliable. Around 2016 a coworker wrote a code indexer for our private repos and we used that to find occurrences of things which were critical.

We've been burnt sooo many times ('are you sure we changed all the occurrences of this in all our 3252 microservices?').

Between shoddy UX (the + being part of the code you were copying!), blatantly missing features for years (ZenHub, late GitHub actions), GitHub really succeeded only because of timing and network effect. The developers' community is quite powerful.

I'm happy for the original team, but now it's just another MS acquisition I hope we'll stop using.


I don't know if it is related to the changes to the code search (I use github exclusively as guest), but in the last year github got slower and heavier to the point that even scrolling a page showing a file of 200 lines is very painful. One with more than 2000 lines crashes the tab. I don't have the most powerful machines at my disposal, but what I have should be more than powerful enough to browse a repository. My intuition points at the syntax highligther and the file browser/go-to-symbol as possible causes. Anybody knows of some magic setting I can use to have a better experience? Right now I am forced to avoid using it as much as possible, and clone the repositories to browse them locally.


There was also a recent rollout of React, of which I've seen a number of complaints about performance related to the timing.

Personally, my Firefox Nightly on Android will regularly crash (along with the whole System UI) when opening / trying to type into a GitHub page


the treadmill of perpetual tech migrations only to find out about the downsides after months of work


Same experience here on Firefox stable on Android. The file editing portion of Github is unusable for files more than a few hundred lines.


The magic setting seems to be don't use Github, unfortunately.

There were a couple GH discussion threads about the dreadful UI rollout, but someone complained one too many times about (code) search returning the wrong results so they've since been locked. For me it's the issue search that seems to consistently return the wrong results, and the text input widgets that are glitchy as hell in Firefox.

Performance-wise it's worth keeping in mind that Github is now rendering all the text client side. If you go through the discussion threads you can see where the GH devs didn't know how to account for multibyte codepoints and how that broke their non-native rendering with things like emoji and Chinese characters.


all these posts about how it's so obvious that search doesn't work, but no examples that demonstrate it from anyone.


You're certainly welcome to go find the discussion threads on Github where people were posting use cases and screenshots. I already found my solution: don't use Github.


Oh yeah, they use react now, and have to use smart tricks so you can actually select/search highlighted code. The code is also probably added as you scroll too, using a virtual list.


GitHub has definitely gotten worse.

Stale data, PR’s and such don’t update properly, page loads are glacial, half the time it’s doing some loading bar of its own-no doubt as part of some trying-to-be-clever SPA thing, and it’s awful.


In the actions overview, just the spinner icon (svg w/ css animation) takes 25-40% cpu and 10% gpu in my chrome. Enough to keep my fingers very warm on the laptop


I'm looking into options for syncing issues and PRs for local browsing as well due to the frontend performance regressions in GitHub.

There's a hanging draft for it in Gitea (inb4 yes also Forgejo).

https://github.com/go-gitea/gitea/pull/20311

Maybe this is the push I need to actually start using taskwarrior...

https://github.com/GothenburgBitFactory/bugwarrior


I’ve noticed the code reader is worthless on even slightly out of date browsers now, and even on newer browsers it tends to choke and stutter on large files. Sad :( it used to be the best


Oh hey, that's my talk! Thanks for submitting it. It was a huge honor to present my team's work at Strange Loop.


I’d just like to say: thank you!

GitHub’s previous search was not great, and when the new version launched it was a massive leap forward, where it’s now part of my daily workflow. Before this I thought good search at this scale might just be an intractable problem. :)


Meh before Microsoft acquisition, you could get an API key for any service you want by just making a search on github, not sure how many people knew about it, it was probably a dirty secret but I used to crawl tons of stuff by just rotating API keys found from github, none of that is possible anymore.

On the plus side I don't count how many reports I've done to companies who did leak not only their username/password but also all the cool proxy you could use to go inside their network. The weirdest one of them was a guy working in security at thales which is supposed to handle security sensitive stuff for governments leaking all that information as he was working on a side project involving poker during business hours ...


This is definitely still possible. Saw a guy a year or two ago in a web scraping Discord who does this for fun and found all sorts of API keys. I think he found a 2captcha API key for an account with a 5 grand balance by spamming the search API endpoints. I hope he didn’t actually use any of that….

Pretty sure some people also made a fortune in crypto exchange API keys because I’ve seen threads where people advertise services to “cash out” Binance API keys for 5 cents in the dollar. I assume they use the balance in the account to inflate the price of some random coin that the attacker bought just before the attack. Yeah, that’s what this world is coming to.


This was novel 10 years ago. Maybe you were still doing it but hardly worth mentioning.


Do you have any paper/talk that gives more details about the "geometric XOR filter"? If not, is there any plan to publish something?


Yes! My colleague who created it has started working on an open source version so we can publish it. I am not sure when it will be ready, but I'm excited because it is extremely interesting and has a lot of potential use cases.


Amazing, thank you!

PS: where/what can I follow to know when it is published? Could you share your colleague github handle (I guess he'll have one :) )?


We'll definitely blog about it (https://github.blog/). His handle is https://github.com/aneubeck (it's on the thank you slide of my talk btw)


Thank you!


> more details about the "geometric XOR filter"?

I know we don’t typically say this here but, you have a very relevant username for your question xoranth :D


Does anyone know what geometric means in this context?


It refers to the geometric progression of bucket sizes used in the data structure.


Hi Luke,

Could you please give an update on whether or not GitHub is still considering adding “sort by recent” to search?

——

E: I just saw you answered that already. It’s a dearly missed feature.


Please see my other comment about why this is difficult: https://news.ycombinator.com/item?id=38638214


Hey, it would be nice to be able to browse in other branches besides main


Really great summary of a huge work! Thank you.


Here's a detailed text outline with key frames for those who don't have time to watch the 36 minute video:

https://www.videogist.co/videos/lessons-from-building-github...


Nice! Seems like a useful tool for digesting videos.


this is awesome. i think it'd be super cool if it can read/summarize the comments too


Not sure if it can, but Kagi can:

• GitHub's previous code search was slow, limited, and did not support searching Forks due to indexing challenges. A new system called Blackbird was built from scratch to address these issues.

• Indexing code poses unique challenges compared to natural language documents, such as handling file changes in version control systems and deduplicating shared code across repositories.

• The talk discussed techniques used in Blackbird like trigram tokenization, delta compression, caching, and dynamic shard assignment to improve indexing speed and efficiency at scale.

• Architectural decisions like separating indexing from querying and using message queues helped Blackbird scale independently without competing for resources.

• Data structures like geometric XOR filters were developed to efficiently estimate differences between codebases and enable features like delta compression.

• Iteration speed was improved by making the system easier to change through frequent index version increments without migrations.

• Resource usage was optimized through techniques such as document deduplication, caching, and compaction to reduce indexing costs.

• Blackbird's design allowed it to efficiently support over 100 million code repositories while the previous system struggled at millions.

• Building custom solutions from scratch can be worthwhile when leveraging data structure to outperform generic tools for a domain.

• Anticipating and addressing scaling challenges at each magnitude is important to ensure a system remains performant as it grows over time.


Those don't look like video comments. Which are: fantastic/well done/wonderful/great talk etc.


Everyone has 36 minutes to watch a video. Just skip bing watching Netflix.

Poor society we are part of, if everything needs to be consumed in 5 minute chunks.

One should also think and reflect about the content being presented. Grasp the ideas.

It is also about honoring the time the speaker put into the presentation preparing it.


Since we talk GH code search, a frequent issue raised on community forums is that some repos stopped being indexed, and yield 0 results. (Happened also to repos in my bigcorp).

Has there been a systemic fix for that issue, other than asking GH support teams to reindex the repo?


I also enjoyed the Treesitter talk from 5 years ago by Max Brunsfeld https://www.youtube.com/watch?v=Jes3bD6P0To

I'm currently building a query language whose grammar is very much inspired by Github's search syntax. I'm using Lezer, which is a GLR like Treesitter, so this talk learned me some parser generators (I've no formal CS education). Here's my grammar, a playground, and an example search query if anyone wants to play with it

https://github.com/AlexErrant/Pentive/blob/main/app/src/quer...

https://littletools.app/lezer

    -(a) spider-man -a b -c -"(quote\"d) str" OR "l o l" OR  a b c ((a "c") b) tag:what -deck:"x y"
I just converted the syntax tree to an abstract syntax tree so I can do De Morgan's law transformations. Literally the only time in my professional life where I feel like I'm solving a leetcode-style problem.


Tree-sitter is very cool! We use it to power semantic analysis for GitHub code search.


Do they mention why the new search still doesn't provide sorting options? I found them very useful in the previous version.


(I'm the speaker.) This is frequently requested, and we've tried to answer it in GitHub's feedback forums. The reason it's not implemented is it's quite complicated and there are other things we'd rather work on (sorting by recently was mostly used by scrapers so it's not high yield for us, unfortunately). It's complicated due to some consequences of things I discuss in the talk.

First of all, the old system didn't really implement sorting for "recently changed" files. It implemented sorting for "recently indexed" documents. This was a decent proxy when we only rebuilt the index every two years. But as I discuss in the talk, we rebuild the index all the time now, so it would be pretty worthless as a sorting option. Another reason is due to the content deduplication I also discussed in the talk. When a blob is shared between repositories, what time do you store? Finally, it's complicated because of how Git works. If we did implement this, we'd like it to reflect when the path changed in the branch, which is what people mean by "when did this file last change". But Git doesn't have an efficient way to retrieve that information, so we'd have to build a system to keep track of it.

In short, it's hard to do right and easy to abuse. :-(


I found sorting by recency to be helpful to see what mistakes users would make. There is not much feedback in open-source until after an api is released, at which point it is usually too late to fix problems. Seeing what users did, what frustrated them, and providing feedback helped make a better library. That's typically what internal teams get to do on a large corpus, like google3, and github is equally excellent for these insights.

An alert on recently indexed content that matches keyword subscriptions, ala google alerts, would be an excellent alternative for that use-case.


Thanks for sharing, that's good feedback.


> not implemented is it's quite complicated and there are other things we'd rather work on

I get this, but why break things if you don’t want to fix them? That’s great that you want to work on other things, but that feature was useful and existed for years and people depended on it. I pay for GitHub and you’re taking away features.

Not you personally, but this attitude is frustrating for me.

As a user, I don’t agree that the new features you implemented are better than the ones you took away.

It’s your choice, of course, but I don’t like this shift in dev mindset where really basic features that have been around since Unix time and are essential to programmers aren’t implemented because they are too hard.

Thanks for engaging in this thread and glad you’re working on this. But hoping since you’re involved in the development that you might be able to shift things a little toward “the good way.”


I hear you, but we didn't break this for no reason. We rewrote our entire code search stack to support features it didn't before, like exact match, regular expressions, indexing forks, navigating to where methods are defined, not timing out constantly, etc. None of that was possible in the old system. Unfortunately, the architecture we chose to make the new features possible made some of the features of the old system difficult. We made the choice to focus on things we believe (more) users get (more) value out of.


Seconded. I liked the previous GitHub code search where I sort by recency to pick up new mentions of keywords. Now the new search is useless to me.


One of the (among many) reasons I stopped using GH code search is that the default ranking algorithm is extremely poor. Search relevance is awful, especially when it comes to surfacing forks.

90+% of the time I'm executing a code search, I'm looking for example uses of a library API in open source code. But most of the time, Github code search just surfaces pages upon pages of the same exact code snippet across the origin repository and hundreds of forks.

For example, one search I executed earlier today: https://github.com/search?q=load%28%22%40rules_foreign_cc%2F...


You might like grep.app. It looks like it filtered out all of the noise in this case:

https://grep.app/search?q=load%28%22%40rules_foreign_cc//for...


I work on this, and I agree we should do more to improve ranking. Exact duplicates are supressed, but often forks have different versions of some files, so they come up in the results.

If you don't want to see forks, you can exclude them. Here's your same search, but converted to a regex and not including forks. I only get 3 results.

https://github.com/search?q=%2Fload%5C%28%22%40rules_foreign...


Not including forks should be the default IMO. There can be a button in the Advanced section of the sidebar to include them.


Are you at all embarrassed at all by how buggy and intentionally unusable Github has become?


Totally agree both on the intended use case (see some project where some library is being used and how) and result — most often finding bupkis.


Yeah, Debian code search often works better for this purpose.


Do they explain the recent change to require login for code search?


I was about to type “why would the engineers who upgraded search have a say on the login behavior” then I answered it myself by realizing a far more powerful search feature means they spend more compute per query. Thus it actually makes sense they use login requirement to limit the search to users who are adding actual value to their bottom line.

Again you could disagree but it is a decision that can be rooted in sound logic.


The same sound logic could have been used to put the old search being a login. That also consumed some amount of compute per query after all. Actually there's no reason to serve logged out users with anything except a login page. Why spend any resources on users who aren't adding value?

Or, maybe this is all value-destroying nonsense that only makes the service worse.

In any case, the performance of the new search system is supposed to be better, not worse. [1]

[1] https://github.blog/2023-05-08-github-code-search-is-general...


Making something more efficient but also more useful can still make it more expensive because significantly more people will now use it.

In principle they could make it all login only, but hosting open source repositories is one of the main charters of GitHub or at least it was. So it makes sense and is consistent with that vision to keep code browsing free without login. Not to mention the requirement to be available on search engines.

Code search however doesn’t seem like it’s some mandatory part of that mission, I mean all they’re asking is that you login. People in this thread seem to have a chip on their shoulder believing corporations should just give them compute for free. No they don’t!


It's amazing how you can make something explictly worse, specifically to fatten your bottom line, and have people cheer you for it.

If the goal is openness, i.e. why GitHub has the prestige it does, allying itself with open software development goals, then you should be able to do most things without having to sign up for the service. That includes browsing and searching. Neither of these need your identity.

On the other hand, if GitHub really just wants a walled garden with paying customers (either directly or the money to be made from datamining their identity and activity), it ought to shut itself off completely and get no benefit from being associated with openness.

What they've done is insidious. They're open, they're the trusted custodians of open source, but ah ah not really. For a preem experience you gotta pay up, choob! They let you search for 15 years but now they don't. And here you are cheering them rather than questioning them.

What's next? The certificate transparency project requires sign-up and logins, it's just too expensive to let anonymous users see the transparency logs you see, and by some amazing co-incidence, everyone signed in @google.com sees no results for mis-issued GMail certificates?


> If the goal is openness, i.e. why GitHub has the prestige it does, allying itself with open software development goals

I guess that is why Microsoft acquired them.


The rest of github (viewing code and issues, downloading releases and zips, etc) is easily cacheable, minimizing the cost.


> Why spend any resources on users who aren't adding value?

What makes you think that logged out users aren't adding any value?

The behaviour of github here is consistent with them assuming that logged out users provide a small but non-zero amount of value for them.

The link you provided seems to suggest that the code search is supposed to be faster for users, but I can't see anything in it that suggests that it's less taxing on the backend?


> What makes you think that logged out users aren't adding any value?

I'm curious because I'm not sure of the answer myself: what value do you think anonymous users provide to them?

I suspect gating it behind a login is a simple way to rate limit, and prevent abuse with a cheap and effective way to mitigate false positives. As someone who's had to make similar decisions in the past (albeit smaller scale), I can sympathise with them turning off a computationally expensive feature that's not at the core of their product, but just an add-on.


Anonymous users are potential future customers. Too much friction and they will leave and never convert.

Also network effects of those users since GitHub is not just a git repo host but a social media and community for code.

I think product needs to strike the right balance of free and paid features to maximize their profits.


Right, I kinda see that point, but I'm also not sure how I feel about the code search issue. On one hand, I agree that it makes the user experience nicer and can aid conversion, but on the other it's easy to abuse the feature (in a few ways off the top my head).

Some of the things you highlighted (ie the social network effects) are best experienced when logged in, and other than "just" browsing, there's not much in interaction for logged out users already (can't host code, open issues, comment, star, etc). I think allowing people to browse code and its history, alongside other parts of the platform for logged out users already feels pretty open to me


I may not have made it clear enough, but my first line was meant to be absurd and obviously wrong.

I think a service should focus on how it provides value to users, not the other way around. People use it because of what it provides to them, not because they desire to support the company's bottom line.

GitHub built a remarkably valuable product, which attracted 100M users and $1B in revenue. It has tons of public features that never required login. They could have put search, or any number of other things, behind a login at any point in the last 15 year. But they didn't, because that's not actually a good thing to do. Doing that can only ever make the feature less usable.

What drives them to do it now is unclear. Perhaps just wanting to juice their login numbers or whatever. But, performing little bait-and-switches on random features isn't going to make people use them more or spend more money. It's going to make them seek hedges against the destruction of the features they enjoy.

As for my link, the search result speed for users is directly related to cost per query, unless you believe that all of this is fake and the new search system is secretly just a bunch of expensive new hardware.


Cynic in means says it's two main things:

1. Effectively serves as a walled garden for copilot to make sure that competitors can't use their data (your code) in coding assistants, knowledge discovery, etc.

2. Ensure that competitors who are using github are maximally mineable so they can easily discover and implement their solutions to eat their lunch (much like OpenAI's strategy here)

This is covered up by the excuse of performance, but it's fundamentally no different than the loginwalls we saw go up around twitter and reddit

... except this time done by a big industry player who LARPs as part of the open source community but is positioned to cannibalize it from within.


Shouldn't the competitors just clone the repository locally and train their model on it instead of relying search api which probably cost more computation resources?


You could, yes, and probably many are doing this, but you now have to git pull on all of those if you want to, say, know which LLM libraries are currently trending, or how quickly PopularLib v0.2 is being used in codebases related to Y, etc.

IMO It's much less about the legacy code (there exist already terabyte size datasets that take in a lot of things on github) and MUCH more about how up-to-date your LLM/AI is with new repos, "best practices" (or most common practices), etc.

Plus you often get "LLM Code poisoning" from older training data as it attempts to use functionality which has experienced breaking changes versus the current stable release. Current is King.

Also there's the whole goldmine of github discussions, issues, etc that a repo just.. doesn't have.

Right now you can still easily index those (though iirc they sometimes ban datacenter IPs), but they may also fall victim to the loginwall.


Very likely because public search APIs are a huge target for bot activity.

For reference, about 99.5% of the search queries that goes toward my search engine is demonstrably bot traffic (excluding the traffic that goes to the free and public API endpoint I offer).

A capable search functionality isn't exactly cheap to run, so requiring a login is the obvious way to be able to rate limit searches, which is otherwise impossible against a botnet.


Gotta pump those MAUs.


Its also a perf / ddos thing. If anyone from anywhere can run expensive searches it could degrade performance. A little log in won't hurt.


Now I just clone the repo and run grep myself. Anything I can do to help lighten the load.


Same, but silver searcher or fuzzy search ftw :)


FWIW, github.dev has a search feature that automates this flow using the fastest available subroutines for each step, all you do is a repo full text search with the same UI as normal VS Code. In my testing it searches about as fast as AG, and much faster than Grep. RipGrep beats it in a direct search time comparison, but the overhead of cloning the files is dominant in large repos. GitHub.dev searches via a .tar.gz archive stored in IndexedDB, which is much faster than cloning even at depth 1 and building up the files on disk (~50% speedup on my testing in very large repos, ~90k files).


Thanks for the pointer.


Bingo. Some product manager is probably getting a fat bonus for that.



I wonder if he got a job at Github or not

https://news.ycombinator.com/item?id=22396824


One of my favourite tools, I use it almost daily at org level to find related code.


Excellent talk! Sad seeing Strange Loop go…


Does anyone know of a good (preferably US based) replacement? It had a unique vibe.


It boggles my mind that github search is case-insensitive and doesn't search the whole words. For a search term getInfo it will return: GetInfoCell

Don't get me started about how it's harder than it needs to be to limit the search to your own repos


Use sourcegraph to search instead. It doesn't even require you to be logged in.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: