I use their new code search a lot to grok how people use certain features, or implement certain things. But I do wish there was a way to filter out forks. Sometimes I search a string and just get a bunch of forks all with the same result. For example, searching a common class in a Rails app often just shows a bunch of rails/rails forks, which is a lot of noise to sift through when you're trying to see how devs commonly use a certain feature.
Thanks for the feedback! That's coming, we've been prioritizing scaling the index and ingest process and haven't had a chance got add that yet. There are a bunch of value-add features like this I am looking forward to knocking out soon.
Out of interest, if I have a repo with many millions of files that compress quite nicely down to about a 1.4gb packfile, is it better for the ingestion and/or indexer if I break this down into many smaller pushes or one large push?
Because I pushed such a repo yesterday and it’s still not been indexed.
Sorry, there are size limits on the number of files in a repositories that we'll index. That is probably why your repository wasn't indexed. Out of curiosity, what are you storing in this repository?
Sequences of code from Python package releases from pypi, as an experiment I’m working on. They compress quite nicely as the deltas between releases are fairly small.
I thought it would be nice to be able to search through them, but I guess the file limit was reached. That’s a shame.
It's not - it's something that builds upon that. It's not really ready for prime time yet, but it will be here: https://github.com/orf/pypi-code-repo
if it works, which I really hope it does, we should have a repo containing every file (sub-25mb) published to pypi.
it's not really useful to clone this, but the git packfile + index seems to be pretty ideal for storing + querying this data, reducing 14tb of compressed packages down to like 100gb or so.
this should enable large scale exploration of the contents of pypi from only your laptop, which is useful if you want to look at how the python language is evolving (how many packages use f-strings over time?).
Is it possible for you to clarify the bigram weight function? I'm assuming it's inverse frequency but that would a be great tidbit to better understand why the covering ngrams work.
As an open source library maintainer I've been using it for that too - it's fantastic for answering questions like "is anyone using this API method I wrote?" and "how much of a mess would it likely cause people if I deprecated this function?"
I run into this a lot where I'd like to see some real world use cases of a method in FooClass, but get a hundred pages of forks of the FooClass.h header file. In some cases I've been successful adding "-filename:FooClass.h" to filter out that header file. If the method is used a lot in the primary project it can be a game of whack-a-mole but it often eventually works.
Perhaps it could benefit from something like a "dissimilarity" filter, which ranks the current result set by returning the most unique hits first. You wouldn't always want this, because sometimes you're searching for how something is typically used, and with many duplicative results you can confirm that's the preferred pattern. But other times you're looking for more estoteric usage of a certain function, and it would be nice to filter the "standard usage" from the results (although you can already do this with carefully chosen negated keywords).
Personally, I'm happy with the new code search so far. I stopped using Sourcegraph because I could never get the deeper results I wanted - it would just return the top five repositories including some common code snippet and I couldn't explore further than that. GitHub Code Search doesn't seem to have this problem to such a degree, since I can use negation more naturally, and since my query is not limited to some shallow subset of the corpus before refining it.
Re: Sourcegraph, we're working on improving that, and sorry you couldn't get the results you wanted. We primarily build for the code within customers, where this particular problem is less common than across all open-source repositories. But we want it to work really well in every case.
Sure! I will try it next time I'm using code search. I can't remember the specific queries I've tried, but I only ever used it when looking for "how have people used this function," so I remember the limited depth of results being particularly annoying.
Thanks for the link to your blog post. I didn't realize Steve Yegge had joined your team - congratulations, that's quite an endorsement!
I've always liked Sourcegraph just because our company names are so similar (founder of Splitgraph here)... we might have even gotten a few candidates because of that initial confusion. :)
How has GitHub Code Search impacted your product direction? Do you see it as an opportunity to focus more on the internal use case, or do you have plans for some other differentiation? It's always unfortunate when a big company introduces a product so similar to the core product of a startup, but I'm sure there is a silver lining there, especially when you have a talented team and a mature codebase (for example, Fly.io has been able to carve out a niche for itself despite Cloudflare moving to compete in the same areas). Either way, best of luck to you from a fellow S-grapher!
Similar to this, there are some generated files it would be nice to rank very low, or even completely ignore, as they probably weren't intended to be committed in the first place. Some search results are dominated by files in paths like ".npm/_cacache/...", ".apm/...", ".svn/...".
As a C++ developer, I maybe-ironically have this same complaint with JavaScript projects ;P. I particularly hate it when people embed minified versions of libraries, as line-based search and display then will experience a hit for essentially every query on the one giant line that is tens of kilocharacters long.
If your language lacks a de facto way to install dependencies, people will commit vendored code. See: C, C++.
If your language's package manager puts your deps within the repo directory by default, people will commit the vendored code. See: Node, Go (since go.mod).
As long as we can all agree that committing compiled code is a crime punishable by 24 hours in the shame cube.
I have this issue with GitHub code search very often. I'll search up a function to see some examples of how to use it, and the first 20 or so pages of results are just copies of the source where its defined.
I’m glad to see some folks in the comments section here using this. My first thought on seeing the headline was, “why not just use grep?”, which the article addressed. My second was, “who in earth wants to search all of GitHub?” I’m happy someone does.
But also, you know what you’re looking for and you can filter the results.
Someone less experienced than you may take any of those results as gospel. Gospel the same way StackOverflow could be, but this time by a source who will do it’s best to tell you what you want.
my personal rule about AI is about the same as the rule about being a lawyer. Never ask a question you don’t know the answer to.
> Just use grep?
First though, let’s explore the brute force approach to the problem. We get this question a lot: “Why don’t you just use grep?” To answer that, let’s do a little napkin math using ripgrep on that 115 TB of content. On a machine with an eight core Intel CPU, ripgrep can run an exhaustive regular expression query on a 13 GB file cached in memory in 2.769 seconds, or about 0.6 GB/sec/core.
But you don't NEED to do this do you? I'm ALREADY in a repository, I just don't want to check out, say all of WebKit, I just need to find where a specific reference is defined.
Maybe, maybe on a really serious day do I need to search an entire organization. But hardly ever.
I have never, in over a decade ever, wanted sophisticated symbolic searching from GitHub code search, I just need remote grep.
Why is the code search not feature bisected into this 99% use case, and then the occasional global repository search, which can behave entirely differently?
If you check out our prior blog post, "A brief history of code search at GitHub" (https://github.blog/2021-12-15-a-brief-history-of-code-searc...), you can learn a bit about the evolution of this feature. And, in fact, we used to use git grep to search repositories.
This doesn't work well at GitHub's scale. We have 100M users and over 200M repositories in a multi-tenant environment. Your git grep is going to be competing for resources with other user's pushes and clones.
From a resource utilization perspective, is it easier on your servers if I just clone the repo and do the grep myself? Because when the search doesn’t show me what I’m looking for on the first page, this is exactly what I do. (And I can never remember if it’s —shallow or —depth=1, so they’re not shallow clones, either.)
I use github-wide searches all the time to see how people are using certain APIs, to find libraries used in some blob from the strings I find there, to find people working with the same data I'm about to attempt to work it, and the list goes on.
What you use github search for doesn't require all this engineering, but what I use it for does. Why wouldn't they build something that satisfies both our necessities well?
Same. I'll go even further and say that the narrower the search the less useful it is, as if I want to search a single repository I can and will just download it as local grep is so much better than remote grep it is painful. They could make a really amazing remote search for a repository--with a find-as-you-type low-latency display that supported language-specific syntax tree queries--but it would require way too many resources and so they are never going to bother. The result: if I am searching on GitHub at all, it is probably a full-site query, and I absolutely do use full-organization searches to figure out what repository to download as so many people now do extremely fine-grained linked repositories (though, in my experience, GitHub's search for this purpose sucks... maybe it is better now).
Same here. I find doing a quick org-wide code search for a specific call a great starting point for impact analysis for API changes. Using it multiple times a week!
Actually I've been using https://grep.app for ages and while I agree on GitHub I basically only search the repo I'm in, that's mainly because Github's existing search sucks.
On grep.app I regularly search all repos. It's very useful for finding out how to use APIs or where APIs from dependencies are defined.
So I suspect you don't want it because subconsciously you know that Github's "search all" feature won't return you useful results.
Hell they still don't provide a way to filter out test directories which makes the code search inside a single repo useless a lot of the time.
> I'm ALREADY in a repository, I just don't want to check out, say all of WebKit, I just need to find where a specific reference is defined.
If you're in some repository that uses a webkit api, and you want to know how that api is defined, how do you do that without global cross references or a global lookup?
Even for local lookups, indicies are useful (as any ctags user will tell you!), but for any kind of cross repo xrefs they're ncecessary.
That seems very slow by today's standards. There's a rather... eccentric guy who easily beat that almost 10 years ago with his implementations of string search: https://news.ycombinator.com/item?id=6954298
The difference may be regex matching. This can often be optimized to an impressive degree, depending on the regex, but unless it's a simple substring search without any metacharacters, I'm not sure those approaches are comparable.
Have you tried using github.dev for your remote grep? Press . in a code repo (while logged in) and it loads a remote VS Code for you in that repo. I don't entirely understand the complex interplay between how github.dev uses both the GitHub search indexes and in-browser in-memory grep, but being as it tries its best reproduce the local VS Code ripgrep experience, it may still be more optimized for your use cases.
My beef with GitHub's code search is that it doesn't distinguish between the definition of a symbol and the uses of the symbol, so you need to wade through 5 pages of results to get the one result you're looking for. I would contrast that to my IDE which usually scores a direct hit if I enter a search in the right box.
The indexing they talk about in that article seems like rearranging the deck chairs on the Titanic so far as that is concerned.
The new code search supports regular expressions so this is pretty easy to do yourself. If you’re looking for where a go method is defined:
/func.+MyFunc/
vs a search for where it is used:
/\.MyFunc/
I agree that it would be nice to have more language aware IDE-style features but I’m just happy to have regex support which is almost always powerful enough to express what I am looking for.
I'm not sure how to read the post and come away thinking any of this is "super trivial", but I'd imagine if it were that easy, they would have done it already.
The fact that most IDEs implement it doesn’t mean it’s trivial. I believe most IDEs leverage “language servers” written ~once per language and used by many IDEs.
It’s possible GitHub could also leverage those language servers (which would be super cool) but doing it at GitHub scale is certainly not trivial.
Poor man’s version using the new GitHub search would be to construct a regex that matches one but not the other.
I think running a language server over a repo would be a reasonable addition to the current index. Should also be pretty small compared to the current index because the number of n grams far exceeds the number of symbols/items.
GitHub hosts 28 million public repositories. Do you think your IDE can open 28 million projects at once and search there "trivially" without hanging? Unless you're talking about searching inside a single repository?
I am talking about searching in a single repository, who would expect to get useful results otherwise? I have no idea how you’re going to rank 28 million repos in a way that matches my perception of relevance.
To be specific, I was looking for the definition of one method in Highcharts so I could understand what it does and override it, GitHub gave 6 pages of results. I was able to find the function immediately in my IDE once I checked out the 100mb+ repository and it indexed it. If I’d been able to the same w/ the search on GitHub it would have saved me considerable time and hassle.
This search could be implemented by something that compiles and indexes like the IDE (sourcegraph) or maybe some kind of shallower parsing. Highcharts is in typescript which I’d don’t know well but in JavaScript the later might be a little tricky because there are so many ways to define a function (one hell of a regency.). I’d contrast to Java where is would be very easy to write a rule that would turn up a class definition if not a method definition, in my case finding the class would have solved my problem.
Sourcegraph engineer here. I'd be interested to know what you were searching for and what your expectation for top result was. We already do things like boost class name matches higher than functions (GitHub's new search does the same) amongst other possible signals.
This is exciting! I see a lot of familiar pieces here that propagated from Google's Code Search and I know few people from Code Search went to GitHub, probably specifically to work on this. I always wondered why GitHub didn't invest into a decent code searching features, but I'm happy it finally gets to the State of the Art one step at a time. Some of the folks going to GitHub to work on this I know are just incredible and I have no doubt GitHub's code search will be amazing.
I also worked on something similar to the search engine that is described here for the purposes of making auto-complete fast for C++ in Clangd. That was my intern project back in 2018 and it was very successful in reducing the delays and latencies in the auto-complete pipeline. That project was a lot of fun and was also based on Russ Cox's original Google Code Search trigram index. My implementation of the index is still largely untouched and is a hot path of Clangd. I made a huge effort to document it as much as I can and the code is, I believe, very readable (although I'm obviously very biased because I spent a loot of time with it).
I also wrote a... very long design document about how exactly this works, so if you're interested in understanding the internals of a code search engine, you can check it out:
As a comparison to Sourcegraph: Sourcegraph shards and indexes a repository at a time, and uses trigrams and bloom filters (to skip shards).
Github shards and indexes individual files according to their hashes. It also uses variable length ngrams (neat!). This makes horizontal scaling simpler, but also means more of the index needs to be scanned for org/repo-scoped queries ("Due to our sharding strategy, a query request must be sent to each shard in the cluster.").
The github code search looks like an impressive piece work (congrats!).
That said, I'm curious about the nuances regarding corpus size. Their blog post claims they have 115 Tb of source code, but that a positional index is "too expensive". A positional index is a 3.5x blow-up, which is ~500 Tb of data. A 1 Tb SSD retails for $50, so that's $25,000 for storing a positional index. 500T of GCP local SSD is also ~25 k$/year. Even if you factor in replication/redundancy, the resource cost is far less than hiring a software engineer. I guess they think machines with local SSD are too much overhead to manage?
The sparse grams solution to deal with stupidly common ngrams such as for or tes is very interesting.
I’d love to see more discussion on how they are dealing with the false positives though. It looks like a positional index is being used to achieve this, but that usually blows out your index size.
Additional information about deduplication would be especially interesting to me as well. It seems to solve this quite well. I usually try a search of Jquery to test this and it does not return multiple copies of different versions of it which is a good indicator that it’s slightly fuzzy.
What I find really interesting about all the code search engines I know of is that each one implemented its own index. Nobody is using off the shelf software for this. I suspect that might be down to no off the shelf software providing a decent enough solution, and none providing a solution that scales. At least none that scales with decent costs.
I did a small comparison of GitHub code search a while ago https://twitter.com/boyter/status/1480667185475244036?s=61&t... But I should note a lot has improved since then, and it looks like sourcegraph now also does default AND of terms rather than exact match, so my complaints there are resolved.
Impressive work by GitHub. I am sure some of the people behind it will read this comment, let me say well done to you all. I am very impressed. Also please post more information like this. There is so little out there.
Thanks! I enjoyed reading your blog posts about building your code search engine. One minor point of clarification, we do not use a positional ngram index, which as you note blows up the index size. Instead, we use the covering sparse ngrams to produce candidate documents and then search the content.
An early version of Blackbird experimented with trigrams plus a bitmask of the next character, but it didn't work well because it wasn't selective enough. This is mentioned in the blog post:
We tried a number of strategies to fix this like adding follow masks, which use bitmasks for the character following the trigram (basically halfway to quad grams), but they saturate too quickly to be useful.
That's what I get for a cursory glance at 4am when I wrote this. I will have a much better look after I get some coffee into me.
Thanks for the clarification. Looking forward to see what else you and your team end up writing about. Which reminds me to publish some other posts I have about searchcode.
Cannot edit previous reply, but I would love to know more about how the sparse grams work. There isn't enough detail in the post, just a few tantalizing crumbs of information.
Seems a lot of others in this thread are interested as well.
Implementing your own index gives you more control over it. I think at this scale you probably want to tweak things specifically to your product rather than using a generic solution. I would guess that what you're indexing on (E.g. language, file, repo, etc) and sharding strategy affects the structure of your index as well.
Believe me I am aware. I am one of those who implemented their own index for a code search engine :) I did it for my own learning, but find it interesting because something like elastic with trigrams can get you very close, albeit at a far greater cost.
I'm reading your blog posts about building your own index now.
I started writing my own very simple index and search engine, but quickly decided to just use ClickHouse via https://tinybird.co as my backend (Serverless SQL with automatic APIs is pretty sweet) because I wanted to build out the product side of things and my data is really small, so I felt like it was going to be a lot of effort for little reward.
Maybe one day I will need to write a custom index or search engine that actually scales though :).
I've been using it since it entered beta a few months back. It's a huge improvement, still not perfect but it's actually usable now, and I don't just give up with it ever time.
Hey everyone, I'm Colin from GitHub's code search team: happy to answer any questions people have about it. Also, you can sign up to get access here: https://github.com/features/code-search
Hi Colin, I’m curious as to how you search repeated letters through ngram index? I understand the example search with the string “limits” (find intersection of “lim”, “imi”, “mit” and “its”). However, if the user wants to search the string “aaaaa” how would you go about searching that?
Good question. We still construct ngrams for it, exactly the same way. So for example, we might extract `aaa`, `aaa`, and `aaa`. Or we may extract `aaaa` and `aaaa`, or perhaps `aaaaa`. Then we deduplicate to find the unique ngrams and look them up in the index.
So it's possible that a document containing `aaa` might match our ngram search, but we double check after retrieving them and exclude them from the result set.
Have you considered using an index directly on language tokens (eg. the abstract language tree representing the file) instead of ngrams on the source text?
Actually, our search engine is so fast that syntax highlighting the search results is often slower than finding them... so if we store the language tokens directly in the index, we'll be able to directly emit syntax highlighted snippets and make it even faster.
It may also enable some interesting search capabilities in the future, like searching within comments or by code structure.
I really appreciate that this includes details about how search permissions work - how they ensure that search results include data from my private repos.
I'd always wondered how they implemented that: it turns out they add extra internal filters to their searches along the lines of "RepoIDs(...) or PublicRepo".
Question for the team: Do you have an additional permission check in the view layer before the results are shown to the end-user? I worry that if I switch a repo from public to private it may take a while for the code search index to catch up to the new permissions.
Yes, we never fully trust the search index so before anything is displayed to the user there are a number of final checks performed to make sure you're actually allowed to see that content.
Another fun example is that your SSO session might have expired. While you technically have access to view the result, we can't show it until you go through the refresh dance to get another valid token.
I’ve been using the new code search for a couple of months and I like it, but the UI is kind of antagonistic to how I typically want to search for things. For one, the new experience doesn’t actually load code onto the page, it does some sort of lazy loading thing as you scroll around, so ⌘F doesn’t work. I understand that there’s a custom search box to try to get around this but it’s pretty slow and fiddly and I don’t really want to use it. I also find the layout to be pretty annoying, because invariably there’s a symbol panel on the side that doesn’t work for the code I want to look at, and then it’s just there taking space. If I hit “t” to enter a file name and start typing the text field loses focus after a second and I need to click on it again. I know there are a couple of people on the team in this thread: I search a lot of code on GitHub and I feel like there’s a couple of tweaks that would greatly improve my experience. Like, I think I could even show you a video of all the places where the UI has gotten less usable for me. What would be the best way to get this feedback to you? I’ve posted stuff on the forum or whatever but it’s unclear to me if this is the intended way to raise issues.
Hey saagarjha, thanks for the feedback. It's our goal to make the experience as good as possible, and we're aware of shortcomings with cmd+F and `t`, among other things. We're working on it, and your feedback helps us a lot.
We read all the feedback on the forum here: https://github.com/orgs/community/discussions/38692, so please keep providing it. Videos and screenshots are super helpful too. Thanks for bearing with us as we continue to polish the UX!
I really like the new search. Though sometimes it is a bit deceptive. I.e. when searching for a function name by clicking on a piece of code and suddenly you are in an entitely different code base with an unrelated function though it shares the name.
It feels like github code browsing is a step between a full editor with lsp and a static site. I Hope they work out the Kinks and make it more smooth
It sounds like you are navigating in a language where we currently only support "fuzzy" or "search-based" Code Navigation, which only uses the _unqualified_ symbol name as the search target. As you point out, this can be very imprecise when there are particular symbol names that are used a lot. (Think `render` in a Rails codebase, for instance.)
We also have Precise Code Navigation — still currently only for Python [1], but we're working on other languages as fast as we can. As mentioned down-thread, that is built on our new stack graphs framework [2]. We're often asked why we're not leaning on something LSP-shaped for this. I go into a built of detail about that in a FOSDEM talk from a couple years back [3], and also in the Related Work section of a paper that we just published [4].
I can only think of one hosted repo service that provides a working go-to-definition feature and it is not github or sourcegraph. What makes you think this will suddenly become widespread? Github spent years doing this and their new thing is strictly non-semantic. It doesn't have the faintest idea where the name is defined, or if there's even a difference between a function name, a parameter name, or a word in a comment.
> It doesn't have the faintest idea where the name is defined, or if there's even a difference between a function name, a parameter name, or a word in a comment.
I don't think what you are saying is actually true for stack-graphs[0][1].
FWIW Sourcegraph has fully precise/semantic go-to-definition, find-references, etc. We use SCIP code indexers (a spiritual successor to LSIF, the Microsoft standard for indexing LSP servers)
Not for C++. To test my recollection I navigated to abseil-cpp/strings/str_split.h, clicked on the declaration of absl::ByString::Find, and clicked "Go to definition". I was presented with every function in Abseil named "Find" regardless of its scope or parameter types. That's not "precise code intelligence"!
In the top right corner of the tooltip it will say either "Search-based" or "Precise" - in this case, you're right, we don't have the abseil-cpp repo indexed so it falls back to search-based as you describe.
We do have a C++ code indexer in beta, https://github.com/sourcegraph/lsif-clang - it is based on clang but C++ indexing is notably harder to do automatically/without-setup due to the varying build systems that need to be understood in order to invoke the compiler.
As my colleague mentioned in a sibling comment, we have an existing indexer lsif-clang which supports C++. I just added a Chromium example to the lsif-clang README right now: (direct link) https://sourcegraph.com/github.com/chromium/chromium@cab0660...
We are also actively working on a new SCIP indexer which should support features like cross-repo references in the future. https://github.com/sourcegraph/scip-clang
Right now, Abseil doesn't have precise code navigation because no one has uploaded an index for it. In an ideal world, we would automatically have precise indexes for all the C++ code on Sourcegraph, but that's a hard problem because of the large variety in build systems, build configurations, and system dependencies that are often specified outside the build system.
The rise and popularity of LSP and projects such as treesitter are a superb foundation for features such as this. Both support a wealth of languages, it is still and will be quite hard to assume the toolchain and settings required for producing accurate information though.
But this could be tied into CI, especially for projects utilizing runners for building the code.
So the barrier to entry now is orders of magnitude less and with github and others pushing for codespaces it could be the final piece to tie everything together.
> But this could be tied into CI, especially for projects utilizing runners for building the code.
One often overlooked drawback of generating this data during CI is that you, the project owner, are now paying for the compute. Of course, maybe you qualify for a free tier. And if not, if you're already running a CI job for testing/linting/etc, the marginal cost of also generating code nav symbols might not be too bad. But one benefit of the stack graphs approach mentioned up-thread (and the code search indexing described in OP) is that all of the analysis and extraction work is done in dedicated server-side jobs that GitHub is footing the bill for.
By necessity I already do it many times a day on my workstation. It would be neat for the cloud to start utilizing the endless compute available on client machines. A back-channel when doing a git push to publish some related artifacts?
I think that there should be some sort of standardized, language-agnostic metadata format for semantically indexing a codebase. It could include e.g type information for expressions or declared variables (for languages that infer types), and an index of symbols and how they're connected.
This metadata file would be generated by a language-specific tool. For instance, Cargo could generate it for Rust projects, ctags/cmake for C / C++, Sorbet for Ruby.
Then a service like Github / Gitlab / your own homegrown code viewer could provide things like "Show type", "Jump to source" without ever needing to build a language-specific parser or interpreter (which seems arbitrarily difficult, most build systems provide escape hatches, so you can't assume much about project structure).
Basically, LSP as a static file in a standard format that tools can read to understand a codebase, without needing to model the language's semantics.
This is pretty much exactly what we've built at Sourcegraph. Microsoft had introduced (but pretty much abandoned before it even started) LSIF, a static index format for LSP servers which encodes in detail all possible LSP requests/responses, effectively.
We took that torch and carried it forward, building the spiritual successor called SCIP[0]. It's language agnostic, we have indexers for quite a few languages already, and we genuinely intend for it to be vendor neutral / a proper OSS project[1].
All of our SCIP indexers are open-source: scip-java (for Java, Kotlin and Scala), scip-typescript (for TypeScript and JavaScript), scip-python, scip-ruby, scip-go and scip-clang (for C and C++).
SemanticDB (https://scalameta.org/docs/semanticdb/guide.html) is a protobuf-based file format that does almost exactly this for JVM languages, primarily Scala (I was a contributor a while back). It is used to build an intelligent online code browser, as the backend for a language server, and to do intelligent refactorings.
I think a language-agnostic semantic metadata format is a good idea, but requires a lot of compromise. ctags partially does this, but only to a very coarse level (mostly definitions and references). I think some ctags implementations also define 'extension fields' that could be used to give type information, but I don't know how/if these are used in practice. SemanticDB is extremely fine-grained, but highly specialized to JVM languages and type systems that are designed to work with the JVM. Finding a common set of semantic features that can be used across languages and type systems that is fine-grained enough to be more useful than ctags sounds very difficult to me.
I think simple things like "go to reference" or "show type" would be sufficient for 95% of usecases. But if you split languages up into a few different categories (maybe along the lines of Algol-like vs Lisp-like), and were flexible with extensions, I'd imagine we'd see some common patterns emerging, and clients would take advantage of that. Best effort is probably good enough to greatly improve the ergonomics of search.
This is a great intro / overview of full-text search for those wondering how to build your own search engine.
It's a great 101-level exercise to write an inverted index implementation you can do it in an afternoon , and then expand to a leaf /aggregator in follow-up exercises.
I did this for our organization using sqlite's FTS module and datasette and boy was it fast. Unfortunately I did get (temporarily) banned from the organisations github account, but it was definitely worth it.
Even now I find myself using it despite the index being a few months out of date.
It is indeed Information Retrieval 101 level stuff which leads to the question of why this is the best GitHub can do with all the resources of Microsoft behind them. It's almost useless, at least for C++. It can't tell the difference between foo(int) and foo(double) or this::foo vs. that::foo.
If I wanted the kind of search engine I can get a teenager to write in 16 weeks why would I expect my org to be paying $$$ for the service?
What a shit take. The article itself is perhaps a nice light overview of 101-ish level concepts, although knowing how and when to apply them in a real engineering context is not something I would consider 101 level. And certainly, building something that is actually at the scale of GitHub Search is nowhere near 101 level.
Have you tried the new search? Thanks to the variable length ngram indexing mentioned in the post, it can handle all of those cases. Sign up here to try it: https://github.com/features/code-search
Symbol extraction for C and C++ is currently disabled because we were having problems with the performance of the tree-sitter queries we were using, but we are planning to bring that back.
Sorry, it cannot handle any of those cases. You're talking about the ability to find the literal `this::foo` but that's not how it would normally appear. It normally will appear anywhere inside a `namespace this` scope, which cs.github does not grok. And cs.github cannot address finding the definition related to a given call site. It doesn't even try.
Grimoire (I see that in the HTML and HTTP requests for cs.chromium.org and cs.android.com, so I presume that's what it's called) is really cool, although sadly obviously not OSS. It completely falls apart when faced with JS, which is increasingly being checked in as part of frontend and Mojo glue code, so it's (from the perspective of an outsider trying to get their feet wet) a bit creaky, but being able to click around in C++, which I don't really understand at this point, and learn something new almost every time, is really cool, and IMO representative of at least one concrete beneficial outcome whenever you do get to this.
I wonder if it would be possible to leverage LSP as a kind of tokenization generalization framework, or even piggyback off of the existing effort by incorporating search-friendly/-helpful metadata into future versions of the protocol spec.
"all the resources of Microsoft" doesn't really say anything about the size of the team involved here. Frankly, it sounds like a pretty small one: clearly GitHub is a successful business with many customers and a significant user base even with very basic code search.
It seems to me that you know what you want from such a service, but focusing on making C++ exceptionally great in this service would come at the cost of, say, general quality across all languages, or frontend usability. A very reasonable tradeoff for a beta-quality product.
With current search, I can search [0] the Django repo for a class that definitely exists [1] in Django, there are 0 code results. Zero. GitHub search is mystifyingly bad, I hope this is a LOT better.
I use this one almost daily. It's great to find real world examples of APIs/contracts being used. Also, instant results!
The underlying data may be limited (I have no idea how large it is, I doubt it has indexed every public repository out there), but I never failed to find examples of what I was looking for.
Was looking for more details on the data structure 'Geometric filter' mentioned in the footnotes.
Couldn't find anything (a few unrelated papers in object recognition aside). If anybody can share anything that would be great !
Damn, it's about time, the current search sucks. What I have found to work very well is SourceGraph; they offer search for public repos. Maybe this'll be an alternative to it.
I wish they provide short name versions for their filters. For example: instead of "withContext language:python path:tests", I could write "withContext l:python p:tests".
Blackbird written in Rust is a natural approach. Those who try to sell build the whole thing with a whole thing is unwise (look at you isomorphic javascript)
My biggest feature request would be sorting or filtering by code/commit/repo age, or even repo stars.
Most often I end up using code search for figuring out where a piece of code originated, just to find thousands of random projects that have also copied the same code verbatim. Sorting for "relevance" or "latest/oldest indexed" are equally useless.
In general, I really recommend code search as a tool for supplementing reading the documentation and source code of your dependencies (you are reading the source code, right?). I reach for it almost every day, and I find it's a reliable tool for identifying "the right way" to use a library, especially one that isn't fully documented.
1) I never want to search all repos globally. At worst I want to search all of my org's repos.
2) the search UI is a little clunky, in a way I'd need to be using it again to remember.
Between those two I think there's loads of progress to be made outside of raw search power. Of course it's nice to have that, but that's what I'm really after.
I use global search from time to time to see how other projects use certain libraries. When the documentation of said libraries is sparse this can sometimes be a good timesaver.
The other big thing that works well is being able to jump directly into the source of an open source library from your code. That is powerful, but again, possibly doesn't need a giant ultra search. Just some clever linking.
This sounds like Cross-Repo Code Navigation [1], which we do support, though only for Python at the moment. And you're right that it does not use the same search index under the covers — OP is specifically describing the index for "search box"-style code search across a large set of repositories (org-scoped, global-scoped, me-scoped, etc). For Code Navigation, we have a _different_ set of interesting and bespoke indexers and storage formats. I've posted some links here if you want to learn more: [2]
So in the sparse grams explanation, what are the bigram weights?
Is it inverse frequency, so common bigrams get split last? And the goal is to be able to search on a larger gram that covers the more common trigrams as often as possible?
One nit I have about current search: I’ll look something up and find I’m getting results for some obtuse commit in some old branch somewhere. I’d like to be able to optionally say “latest commit on branches only please” or “main branch only please.”
Another thing, which might betray that I don’t understand search all that well: language aware searching that knows, for example, that a single or a double quote are syntactically interchangeable. Don’t omit half the results because I used one quote over the other when looking up `interpolation = ‘nearest’`
Will this allow for a happy closure of this question about searching partial words? [0]
Like searching for "OPTION" and getting "-DOPTION=TRUE" among the results. Very commonly needed to find all usages of a flag, even instances where the flag is being passed to (at least, that I know of) CMake and Meson.
I don't think Sourcegraph is in big trouble here. Their whole play is enterprises, who likely have code spread across many different hosts. On top of that, their code search is still miles ahead of GitHub's.
It's very useful for enterprises to have one solution like github. Github isn't the best in every fields, but overall it's the best offer (they have git hosting, continuous integration, bug tracking, discussion forums, hosted dev environment, and now code search). It reminds me of a bundling strategy used in MS Office.
I've worked alongside the CEO/CTO of Sourcegraph for the past 8 years, everyone else is at our company offsite so I figured I'd chime in :) nobody asked me to write this (nor did I ask) :)
The article is a top-notch technical write-up, the devs on GitHub code search should be proud of what they've achieved so far!
Honestly, we're rooting for GitHub to improve their code search, viewing them as a close peer-not a competitor. We also maintain OSS projects like Zoekt, which IIRC GitLab is maybe looking at using for their own. The more devs that 'get' code search, the better off Sourcegraph is frankly!
GitHub has a nice intuitive/simple UX, we could learn a thing or two there (though, easier to do with less features.)
Still, Sourcegraph search tech is quite a bit more powerful:
* Searching over commit messages, diffs, filename, etc. are super nice for tracking down regressions / finding 'that PR I swear my coworker made'
* Expressiveness like "find this regexp in repositories, but only if the repo has had a commit in the last month AND has a file named package.json in its root"
* Since Steve Yegge joined us, we've started thinking about ranking of search results, a notoriously difficult thing to do well in code search unless you have great factors to rank on (e.g. a semantic understanding of code): https://about.sourcegraph.com/blog/new-search-ranking
* We stream results back, so you can get a comprehensive set of results - not just a few pages, from our API.
* Works in GitHub Enterprise, not just GitHub.com. Plus on all your code hosts, think BitBucket, GitLab, Azure DevOps, Gerrit, Phabricator, etc. and even non-Git VCS like Perforce.
* Respects permissions of all your code hosts (a very difficult problem, as there are no official APIs to query this info from code hosts in general)
Having code search is one thing, but using it is another:
* Code Insights (we use search as an API to gather statistics about code, track code quality, keywords, etc. both over time and retroactively and let you build dashboards)
* Batch changes (find+replace, but over thousands of repositories. Run a Docker container per repo, run your custom linter script etc. and then draft or send PRs to thousands of repos, manage/track campaigns with thousands of PRs like that over time, etc.)
* Precise code intel / semantic awareness of code, we use SCIP indexers for this (spiritual successor to Microsoft's LSIF format for indexing LSP servers.)
I am super happy GitHub continues to push their code search effort, and genuinely believe it's a great thing for all developers and us over at Sourcegraph. Also excited to see when they do their public rollout of this :)
Anyway, that's just my take as someone who works there-other Sourcegraphers will chime in later if anything I said above feels off to them I'm sure :)
Thanks for the information! I came to the hackernews discussion because SourceGraph was so curiously absent from the article and the "brief history of codesearch" article it linked to, and I wanted to see what others thought!
The new code search includes private repos. Miles ahead doesn’t matter if you get the search at GitHub for free. GitHub actions started out pretty terrible but they are now dominating hosted CI.
SourceGraph likely has challenging times ahead considering the valuation.
Sourcegraph CEO here. “Challenging times” is how it should be all the time. That means competition is forcing both GitHub and us to build better stuff for devs. Devs win.
But to be clear, as a company we are doing well and growing nicely inside customers, with a ton of cash in the bank, an awesome team, and a huge opportunity ahead of us. GitHub’s new code search has been out for 14 months now, so this is nothing new.
It’s a big market and there’s way more room for differentiation and dev choice in code search/intelligence than in CI. There’s a lot of code intelligence that GitHub won’t support (precise code nav for more languages, comprehensive code ownership, metadata from other dev tools that know things about code outside the GitHub/Microsoft suite, etc.), there’s a need for the ability to fix (with our Batch Changes) not just find, and even in the core search workflow there’s so much room for improvement with AI fine-tuned on your own code, etc.
But talk is cheap and only shipping matters. So, watch what we ship, and send any feedback and requests our way!
Sourcegraph has become pretty obscenely expensive in the last couple years (with borderline hostile sales folks to boot). I know at least one company who would love to be able to cancel that contract if GH search is "good enough" now.
Sourcegraph CEO here. We definitely need a cheaper tier for smaller companies or those who don’t need our entire feature set. I agree! What do you think that should be?
Overall, we’re building what our customers need, and our product goes way beyond what GitHub can offer. Sourcegraph indexes all the code and increasingly all the code intelligence (including code nav but also code ownership and other metadata in the future from your other dev tools). We charge based on active usage, so we make money when devs at customers /choose/ to use us over the alternatives. We’re trying to do this the right way, and tons of customers agree. (If anyone reading this disagrees, please let me know!)
Re: your comment about our sales team, I’m really sorry to hear that and want to understand more so I can fix the problem. Can you please email me at sqs@sourcegraph.com?
Sourcegraph had to have known GitHub would do this if they didn't accept the offer. Since this should be expected, the launch of this feature shouldn't change what their decision should have been.
Sourcegraph CEO here. Just to be clear, so internet rumors don’t get started, there was no “offer” here. We started Sourcegraph with the intent of remaining independent because building really good code search and intelligence means working across all code (not just on GitHub), all devs, and all code intelligence sources (code nav plus every dev tool you use that knows stuff about code, not just the ones in the GitHub/Microsoft suite bundle). We’ve never entertained any kind of acquisition interest for this reason.
We don’t think any of today’s code host vendors with their current strategies can make truly great code search and intelligence because they’ll be biased toward their own bundled tools and limited to the subset of code hosted on that instance. It’d be kind of like Encyclopedia Britannica or The NY Times building a web search engine: helpful, but so much more limited compared to what the independent Google became.
And yes, none of this was a surprise. GitHub’s new code search has been out for 14 months now.
GitHub's code search product looks limited compared to what Sourcegraph provides, but I don't think being limited to code on their instance is a problem. Even if a company doesn't want to use the full GitHub source code management stack, if GH search becomes good enough, people could mirror their repos onto GitHub or GitHub Enterprise Server just to use its search functionality, and Microsoft will go after that segment if they see a market there.
> Shard by Git blob object ID which gives us a nice way of evenly distributing documents between the shards while avoiding any duplication. There won’t be any hot servers due to special repositories and we can easily scale the number of shards as necessary.
What exactly do they mean by "special repositories" here?
(I edited this post.) Maybe that could have been worded more clearly... What we mean is repositories with atypical activity patterns, for example a busy monorepo with lots of files and continuous pushes. If we sharded by repository, a single shard would be responsible for processing all the updated documents for this busy repository.
The biggest problems I have with their code search are basic usability features, not the search itself. I need a way to exclude private repositories in the result so I’m not clogged by internal instances of what I’m looking for. I need the UI to improve so I don’t have to go to advanced search for every filter I want to do.
Interesting stuff, was curious how they search repeated letters through ngram index? I understand their example search with the string “limits” (find intersection of “lim”, “imi”, “mit” and “its”. However, if the user wants to search the string “aaaaa” how would they go about searching that?
Search is a fascinating topic because it's such a fundamental problem and every search engine is based around the same extremely simple data structure (Posting list/inverted index). Despite that, search isn't easy and every search engine seems to be quite unique. It also seems to get exponentially harder with scale.
You can write your own search engine that will perform very well on a surprisingly large amount of data, even doing naive full-text search. A search tool I came across a while back is a great example of something at that scale: https://pagefind.app/.
For anyone who doesn't know anything about search I highly recommend reading this (It's mentioned in the blog post as well): https://swtch.com/~rsc/regexp/regexp4.html.
It's interesting that GitHub seems to have quite a few shards. Algolia basically has a monolithic architecture with 3 different hosts which replicate data and they embed their search engine in Nginx:
"Our search engine is a C++ module which is directly embedded inside Nginx. So when the query enters Nginx, we directly run it through the search engine and send it back to the client."
I'm guessing GitHub probably doesn't store repos in a custom binary format like Algolia does though:
"Each index is a binary file in our own format. We put the information in a specific order so that it is very fast to perform queries on it."
"Our Nginx C++ module will directly open the index file in memory-mapped mode in order to share memory between the different Nginx processes and will apply the query on the memory-mapped data structure."
100ms p99 seems pretty good, but I'm curious what the p50 is and how much time is spent searching vs ranking. I've seen Dan Luu say that majority of time should be spent ranking rather than searching and when I've snooped on https://hn.algolia.com I've seen single digit millisecond search times in the responses, which seems to corroborate this.
I'm curious why they chose to optimize ingestion when it only took 36hrs to re-index the entire corpus without optimizations. A 50% speedup is nice, but 36hrs and 18hrs are the same order of magnitude and it sounds like there was a fair amount of engineering effort put into this. An index 1/5 of the size is pretty sweet though, I have to assume that's a bigger win that 50% faster ingestion.
Since they're indexing by language I wonder if they have custom indexing/searching for each language, or if their ngram strategy is generic over all languages. Perhaps their "sparse grams" naturally token different for every language. Hard to tell when they leave out the juiciest part of the strategy though: "Assume you have some function that given a bigram gives a weight".
It's interesting that GitHub seems to have quite a few shards. Algolia basically has a monolithic architecture with 3 different hosts
I used to work at an Algolia competitor. I don't know for sure, but my guess is that Algolia shards their indices by customer. Algolia does not provide global search. GitHub code search does. That, and the desire to deduplicate data, is what led us to our current sharding strategy (notably, it is different than the old GitHub code search's sharding.).
I'm guessing GitHub probably doesn't store repos in a custom binary format like Algolia does though:
We have a custom index format, so I would say this is the same, unless you mean something different. We of course translate repos from their Git form to our index document form for indexing.
I'm curious why they chose to optimize ingestion when it only took 36hrs to re-index the entire corpus without optimizations. A 50% speedup is nice, but 36hrs and 18hrs are the same order of magnitude and it sounds like there was a fair amount of engineering effort put into this. An index 1/5 of the size is pretty sweet though, I have to assume that's a bigger win that 50% faster ingestion.
The index size is a bigger win, but being able to reindex quickly is huge for our development velocity and trying things out. We really feel it when things are slow. This is also not our final goal, we want to scale the system up considerably.
I'm not familiar with production search systems at scale (Very curious about them though). How do you think Algolia shards their data given that architecture? Based on their description it seems like the search engine itself is monolithic. Maybe they're running a 3-node cluster with a monolithic index for each customer?
Interesting, do you keep a copy of the index document form of repos or is that done on the fly during indexing? Is your custom index format a binary format? I have no idea whether that's standard practice, or just a compressed text format is enough. I guess that non-binary formats would be enormous though, and given that an index is by definition relatively unique it probably wouldn't compress that well.
I do feel the development velocity thing. I've felt something similar on my smaller scale projects. Being able to fully re-index the corpus in less than a day definitely seems like it would provide a lot of opportunities to experiment and try stuff out without it being too costly.
Scale up in terms of what? Is the current system not indexing all of GitHub, or you mean you want to index on more things (E.g. commits, PRs, etc)?
How do you think Algolia shards their data given that architecture?
My guess is that Algolia's indices are sharded by customer and each cluster probably has multiple customer indices.
do you keep a copy of the index document form of repos or is that done on the fly during indexing?
As mentioned in the post, the index contains the full content. Our ingest process essentially flattens git repos (which are stored as DAGs) into a list of documents to index (the prior state is diffed for changes).
Is your custom index format a binary format? I have no idea whether that's standard practice, or just a compressed text format is enough. I guess that non-binary formats would be enormous though, and given that an index is by definition relatively unique it probably wouldn't compress that well.
Binary formats are normal, posting lists are giant sorted blocks of numbers, so there are a lot of techniques that can be used to compress them. Lucene's index format is pretty well documented if you're interested in learning more (interestingly, Lucene has a text format for debugging: https://lucene.apache.org/core/8_6_3/codecs/org/apache/lucen...).
Scale up in terms of what? Is the current system not indexing all of GitHub, or you mean you want to index on more things (E.g. commits, PRs, etc)?
It's not indexing all of GitHub yet, nor do all users have access yet. Those are the things we are focusing on now. In the future, we want to support indexing branches.
I've been using this since it was still an email signup beta. I don't do anything too complicated, but man it's been invaluable to do exact-string searches across all of my organization's repos. I use it most days at work
I feel search is the most complex domain tech wise. I always feel overwhelmed how people design such systems. Would love to learn more about search. Any books or courses? Right now I can only do binary search.
I was working on a research project awhile ago and every time I searched for something particular it immediately thought I was a bot after like 2-3 particular/exact queries.
Ever since then, I've exclusively used sourcegraph.
Kythe is not a regex search engine. It depends on extracting precise semantics of all the code it runs on to compute correct edges like "calls-function". This only works for a few languages, and is extremely difficult to do generically across all of github.
Our Code Search team is currently working on moving to Zoekt[0] which is expected to be a significant improvement as it is purpose-built for code search.
We also shipped an improvement[1] to our existing search functionality at the end of last year. If you haven't used it recently, I'd encourage you to check out code search again to see if the quality has been improved for you.
Hmm not sure if I should delete my (2nd) Github account again, just thinking about how much data they are getting from users, it could become the Facebook of Git.