Hacker News new | past | comments | ask | show | jobs | submit login
Improving GitHub Code Search (github.blog)
505 points by todsacerdoti on Dec 8, 2021 | hide | past | favorite | 213 comments

If anyone from GitHub is listening, being able to exclude test code with a few clicks would be an absolute game-changer. By far the biggest source of noise in my GH code search results, and I use the tool (and similar tools like Searchfox) super super heavily. Either way, stoked to try this out.

Thanks for the feedback! We downrank test files with a heuristic, though we'll definitely be looking to make this more sophisticated. You can also exclude results using a regular expression, like `foo NOT path:/_test\.go$/`.

And also note that if you often need to add this kind of qualifier to many searches, you can create a "custom scope" that includes it for you transparently.

And inversely, searching specifically for test code can also be useful. For example if searching for an implementation example.

Exactly this. It will also reduce unnecessary requests on their servers.

> Search for an exact string, with support for substring matches and special characters, or use regular expressions (enclosed in / separators).


Search-for-literal is so important when you have technical users working on non-prose text.

They say this is going in a dedicated search page 'to start with', if "<literally any text>" doesn't work in the top bar eventually this is still going to be miserable.

I'm from the team that developed this at GitHub - if you are in the technology preview, then you can jump into cs.github.com from searches done at the top bar.

Thank you

I use Github's UI for exploring and searching codebases more often than my own environment, since I do a lot of curious browsing.

No offense, but the search is so bad for anything worse than a single word, that I've developed a sort of intuition for how to phrase things -- and then still spend a lot of time crawling pages of results haha.

This was sorely needed

Couldn't agree more - that's why we built it! Please give the new search a shot, I think you'll like it :D

how to get access to the new search :)

Hi Abdallah. If you haven't signed up yet, you can do so here: https://github.com/features/code-search/signup. People on the waitlist are getting access as quickly as the team can support it.

What's your take on developing a new code search instead of partnering with an existing global code graph like Sourcegraph? What are the advantages of GitHub Code Search over Sourcegraph?

Well, in the past I've tried Sourcegraph several times, but it never give me experiences that match the was-dead-many-years-ago Google Code Search. I wish the new github code search does that.

Hey Edwin, if you're open to providing feedback, I'd love to understand which types of searches worked well for you in Google Code Search but not in Sourcegraph. We've invested a lot of thought into our query syntax, supporting literal matches, regex, and the Comby pattern matching syntax with a rich set of keywords and filters—but we know the syntax isn't always intuitive for every user. We're always trying to improve the experience for all our users (I'm the CTO at Sourcegraph), so if you have any recollections you're open to sharing, would love to hear them!

For me for one thing, although I do use Sourcegraph for searching public code (thanks for creating and maintaining it!), I find the website too slow and heavy-feeling. Not in a way that makes it impossible to use, for sure, but somehow just slow enough to clearly not feel "snappy" but rather "bulky" and lagging. In particular this (and maybe also some other UX choices, like not 100% always fully supporting right-clicking and "open in new tab"? Not sure now if that's indeed so, but kinda feeling I'm always afraid to do this - or maybe because I fear slowing down my browser?) makes me do less searching with it than I'd need/want to, and makes me hesitate internally every single time I'm thinking whether to open a (new) tab with Sourcegraph. Although I know the results will be very good; but I know also I'll feel tired by the lagginess.

is there a way to change the default search context(which is global)

It's not quite what you're asking, but it you want to limit your searches to a particular repo, you can just type that into the URL bar (and then bookmark that): https://cs.github.com/$OWNER/$REPO, just like how the repo's primary site on GitHub is https://github.com/$OWNER/$REPO

Not right now, but that's some common feedback that we hope to soon address.

+1 pretty much what was on my mind, seeing this. does this compete or complement sourcegraph?

I use github search a lot and this would be an insane productivity boost. I signed up for the waitlist. can you please give me a nudge in the queue? This is my profile https://github.com/abdallahmansour6

Given the shoutouts to Burntsushi and Lemire this is almost certainly a bitmap trigram index based engine similar to https://github.com/google/zoekt

The index is likely based on Roaring bitmaps, presumably https://github.com/RoaringBitmap/roaring-rs in this case.

Nice architecture, exactly how I would have done it also.

Nope, I would have used an existing search solution, like xapian. It does so much more, and much faster.

You need to support a proper query syntax, with tags, rankings, stopwords, stemming. Then you need to have a proper db backend (reverse indices). Trigrams dont help for regex. Then a templated representation. Google codesearch would do only the 2nd of 3. ElasticSearch is commercial, and only java.

Doing that from scratch is a bit silly.

Oh OK, you have clearly spent more time thinking about this problem than the team of engineers at GitHub who've been researching code search at scale for more than four years. I bet they feel real silly right now knowing they could have shipped this search engine in a couple weeks taping together off-the-shelf libraries if they only had your talent for software architecture.

Sure, I did. I've implemented a proper search for a very big companies document and knowledge base. Similar to gmane, which also used xapian. Much better than what I see there. Or with the old google code search

Everything you are talking about is useful for full-text search. It's practically pointless for code search. Trigrams definitely help for regex, you can make use of a trigram index for a quick false positive index lookup.

Code and log search are two specialized use-cases of search that definitely warrant a non-full-text approach and as far as I am aware the trigram bitmap index would be considered state of the art for both.

That isn't to say you can't solve the problem with a full-text search engine, many people do with the solutions you alluded to. However they are drastically less efficient and probably out of the question at Githubs scale.

Code search typically does not need many (most?) full-text search features like TF-IDF, stopwords, stemming, tagging, etc. It's a categorically different domain.

> Trigrams dont help for regex.


PostgreSQL uses trigrams to optimize regex searches.

I love how the Microsoft acquisition continues to result in increased investment in github with microsoft's resources, and real vision; not always how an acquisition goes.

There was a moment where I thought GitLab was poised to start hitting critical mass and GitHub's best days were behind it, that was probably around right before or after the acquisition.

Microsoft has done a great job of actually improving the product and investing in it, something that seems to not happen most of the time with giant acquisitions.

GitHub today, at least for me, is definitely improved over the GitHub five or so years ago. It does feel like some things are bloated due to a push for feature after feature, but the core features have gotten better to the point I don't care. I just wish I didn't have to spend 2 minutes turning off all the annoying features I don't care about for small projects on every new repo I create.

(From a GitHub product manager) Thank you for this feedback about the pain of having to repeatedly change repo settings. It makes great sense that you'd want to repeat certain settings. Would it meet your needs if repository templates allowed you to have settings that got copied when you created a repo from the template (https://docs.github.com/en/repositories/creating-and-managin...)? I also wonder if you'd be interested in repository settings being settable in a text file. If you want to continue chatting about this it would be great if you could post it in GitHub's feedback discussions here: https://github.com/github/feedback/discussions/categories/ge.... Thanks again!

Having some sort of repo config file (either account-wide or in the .github folder) would be very helpful.

Thanks for your helpful perspective, geerlingguy.

and a departure of all the key execs

what about it? That's not even a sentence.

Microsoft has always been a dev tool company.

I doubt that before WSL this would be something. I mean developing on windows was always far lot difficult than Linux or MacOS.

Win32 API isn't nice, but Microsoft was always relatively good with documentation etc. and don't forget all the developer support within Excel, VBA, Visual basic etc. Bill Gates early on understood the premise of building a platform and not breaking it. Even if that meant win32 API became ugly over time. Old windows programs still work on newer releases.

It depends on what you were developing.

You wouldn't know it looking at MS Visual Studio though.

What makes you say that? I love to hate on Microsoft, wouldn’t touch Windows with a 100ft pole, but I always give credit to Microsoft for VS, VS Code (and my favorite languages, TS and C#).

I admittedly haven’t used Visual Studio in a while (though I use VS Code daily) but I remember it as my favorite IDE. Certainly better than XCode, and I even preferred it to IntelliJ.

To the credit of teams past, SSMS still does not have an electron-based equal (azure data studio will get there its not there yet).

Still waiting for the ability to search in other branches. It's a pain when some codebases have stable releases on the next/dev branch but keep their main branch to the previous release.

Absolutely. I get they don't want to index every branch but at least set some heuristics like it it has a certain amount of activity or something per repo. Or even allow repo to opt into 1 to 2 other branches besides main. Especially for bigger projects

That'd cover 95% of repo I've seen.

That will blow up the size of the index a lot. Need to be clever about this.

Great. I'm using grep.app[1] usually as for me the GitHub search is mostly useless. Your mileage may vary though. That being said there are many other great search interfaces that I am using often when I'm trying to find solutions to common problems or specific design patterns. Chromium search[2] comes to mind, Mozilla's Firefox[3], Android[4] or of course Google[5]

[1] https://grep.app/

[2] https://cs.chromium.org/

[3] https://dxr.mozilla.org/mozilla-central/source/

[4] https://cs.android.com/

[5] https://cs.opensource.google/


[6] https://about.sourcegraph.com

See also https://searchfox.org for Firefox. Not as featureful as DXR, but quite a bit faster.

Now, can we please get GitHub issues back into third party search engines? Now, whenever I search for something I know is in an issue I only ever get results from those crappy GitHub scraper sites. This is happening on both Google and DuckDuckGo.

I don't think Github has any control over this without changing their content license.

Only thing missing is indexing of branches and forks.

My main use case for GitHub search is identifying provenance of misc. changes in vendor source code tarballs for e.g. Android kernel releases. It's hard, but sometimes possible to rehydrate most of the existing commits through cherry-picks and careful rebases.

The biggest problem with the lack of indexing branches and forks is that sometimes vendors makes releases through branches, or that sometimes repos of interests are forks of e.g. `torvalds/linux`.

Hopefully we can see those being indexed in the future.

I'm also curious: has the plan to drop "less active" repos from the index gone through? Has anything changed?

> I'm also curious: has the plan to drop "less active" repos from the index gone through? Has anything changed?

Whaaat? I hope it doesn't go through. I use GitHub code search for clues when reverse engineering cheap Chinese IoT crap. Usually I can find some headers / SDKs accidentally uploaded and set to public by a random Chinese guy. Those repos usually have one commit and zero traffic, but they contain invaluable information about proprietary MCUs.

I would personally like to see less indexing of duplicate files! There are many things I’ve searched for which return 100s of results from independent checkin-uploads of big libraries like the Android SDK. It would be great if results were filtered by file similarity regardless of git history (if that is in fact the issue).

Nice idea, ihnorton. Thanks for the feedback.

Got into the preview, can finally search for actual code! One thing I'd like to see, though, is the ability to mark directories to be ignored in the search results. No one needs to search the raw HTML of my generated documentation, yet it shows up in every search for project symbols. And since HTML is considered "source", I can't filter it out unless I select a particular language.

Also the search text field is bit messed up in Safari when the text gets longer than the field.

GitHub Code Search developer here - try creating a custom scope to filter out that stuff! Click on the scopes dropdown and scroll to the bottom. You can filter out HTML by using a query like:

NOT language:html

It would be great if this used the same filter format as sourcegraph and other internal code search tools. ex. -file:.html is enough to filter away files ending in html in the main search box.

Having to use dropdowns and multiple input fields is more cumbersome than the filter language of repo:, file:, lang: etc.

Here are the details of that syntax: https://cs.github.com/about/syntax

Ah, I was trying language:!html.

Would still be great to ignore my docs directory.

You could do this with `-path:docs/` (or `NOT path:docs/`).

Are there any open source powerful code search engines out there? As a Googler the internal code search we have here is one of the most incredible things I've ever seen, it's so fast and powerful I'm amazed by it daily. Is there anything near that quality out there?

We built Sourcegraph taking inspiration from Google Code Search (https://about.sourcegraph.com/blog/ex-googler-guide-dev-tool...) to bring the power of code search—and precise code intelligence that just works—to every dev. Try it out here: https://sourcegraph.com. A super common thing we see is people leaving Google, missing code search, and then bringing Sourcegraph into their new org. We'd love to hear your feedback!

The best thing about the Sourcegraph instance hosted on sourcegraph.com is that you can edit the URL in your browser from https://github.com/foo/bar to https://sourcegraph.com/github.com/foo/bar to be dropped down into a Sourcegraph search for that GH repo. I've been using it for a long time because of this convenience.

(Though it would be even better if the two options for case-sensitivity and regex search were enabled by default instead of needing me to toggle them on every time.)

You should be able to do that over in your User Settings (Click your picture in the top right and then Settings.) Adding these two things should change that default for you:

   "search.defaultCaseSensitive": true,
   "search.defaultPatternType": "regexp",

Also see: https://docs.sourcegraph.com/admin/config/settings#search-de...

I don't have a user account (nor do I want to make one).

Sourcegraph is open-core, with a dual licensing approach. You can run the open-source version here: https://github.com/sourcegraph/sourcegraph#sourcegraph-oss, and we have an enterprise offering for companies that want to adopt for their teams. Similar to GitLab, both our enterprise and OSS code is publicly available.

Are you worried this new Github Code Search might steal all your users?

I helped write DXR for indexing Mozilla's source code based on an instrumented compiler run; this has eventually been developed into mozsearch (https://github.com/mozsearch/mozsearch), whose indexing for mozilla-central is visible here: https://searchfox.org.

I thought it was abandoned! This is great to hear it just moved. Is there anyone at Mozilla that can update the old DXR repo [0] to direct people to MozSearch?

[0]: https://github.com/mozilla/dxr

I work on a very large c++ monolith at work and DXR has been a real game changer for helping me just figure out how so much of the codebase works. Thanks!!

Hoogle is pretty neat — you can search by type signature and it’ll find matching APIs from hackage packages: https://hoogle.haskell.org/

Source: https://github.com/ndmitchell/hoogle

My job uses https://oracle.github.io/opengrok/ and I'm generally happy with it. It has some problems with special character searches at times but generally does what I want. It's certainly better than code search in our on-prem github instance.

Yeah, opengrok is great. It is very fast and usually returns good results.

https://grep.app/ is a great alternative to github's current search engine.

For the grepping aspect, https://github.com/google/zoekt is a powerful one-stop-shop. For the navigating, I don't know. SourceGraph maybe, but the linking is somewhat heuristic I assume, not compilation-graph powered. But maybe that changes or depends per language.

We're using https://searchcodeserver.com/ internally and it works quite good, it was miles ahead GH search. After today, I'll have to test GH again to reassert this statement.

If you don’t mind me asking, any insight into why it hasn’t been open sourced?

There is some older version that's open source, I haven't tried it and I don't know how much of today's code search is based on it.


Not a Googler, so I can't say. There was Mozilla DXR but it has been abandoned.

DXR has largely been replaced with mozsearch (https://github.com/mozsearch/mozsearch), and a quick glance through the really early history does show that it adopted a fair amount of stuff from DXR. The downside is that it's not as easy to set up a local mozsearch instance as old-school DXR was.

Would you by any chance be allowed to record a demo screencast?

You can try it yourself, e.g., the instance the Android team uses: https://cs.android.com/

Oh, I didn't know this existed. The syntax seems to be on par with the internal one, I couldn't find any info on what's driving it.

Also don't know how search works there, but the cross-reference functionality is powered by an open-source Kythe project: https://kythe.io/

It really looks like they took a lot of inspiration from https://sourcegraph.com/search with this. Not a bad thing at all. I hope SourceGraph doesn't get obsoleted by this though, they're great people.

Sourcegraph CEO here. Imitation is the sincerest form of flattery. We are very transparent, have a ton of users, and are open-core, so it's easy to get inspiration from us. :) We want way more devs to be using code search since it's so valuable 10x+/day, and if this helps, then we are very happy for that. Devs get to choose the code search tool they use, so the best tool will win (you wouldn't use Bing if your boss made you...likewise, code search isn't like team chat or team docs).

I remember seeing this years ago and thought it was a bit subpar but it appears they've made strides since then. I might start using this again.

had a pretty awful interview experience there a while back. Can't say I experienced great people

I've met two of their devs randomly in different Discord servers. Both were great people (Noah, Olaf) and are very active in OSS communities. Perhaps not coincidentally, both worked on Language Server related stuff.

Ólafur is responsible for a lot of Scala tooling and some pretty neat original ideas.

Sourcegraph also came up with LSIF, which is useful format for building tooling for language servers:


If you want to build this sort of stuff, the work Sourcegraph has done with LSIF + SemanticDB is probably your easiest bet.

N=2 isn't great, but there's my experiences if we're tossing them out there.

Sourcegraph CEO here. I'm really sorry about that. We work really hard on making our interviews good for everyone, including documenting it publicly at https://handbook.sourcegraph.com/talent/interview_process. Could you please email me at sqs@sourcegraph.com so I could find out what happened?

I interviewed for Sourcegraph and it was one of the best. Super transparent process, open source handbook, fun coding tasks -- really nothing to complaint about. Would be curious to know what made you have such a different experience.

I'm surprised... I absolutely loved my interview with Sourcegraph. I kind of wish every tech company interviewed like they do.

Seems to be that Rust's killer app is burntsushi's mind and ripgrep. :-)

I just want to say about time. A lot of the time when using libraries with inadequate documentation, being able to find usages of a method or class gives really good insight into the library. But the current code search's stemming removes all the context needed to find that and then gives alternate spellings too.

I always though Github's bad search functionality was a business decision. It was so bad for so long. Even if basic improvements are significantly harder at their scale, I just can't comprehend how Microsoft left something so potentially useful be so bad for so long.

Yeah, I'd bet on https://about.sourcegraph.com. Fully focused on code search and are still light years ahead.

Meanwhile on GitLab, you can't even search in issue comments (only the title/description from the author).

GitLab team member here.

Comment (and code) search is available for projects in all GitLab tiers: https://docs.gitlab.com/ee/user/search/#basic-search

Premium and Ultimate users have access to Advanced Search: https://docs.gitlab.com/ee/user/search/advanced_search.html

There is a way to search for comments using the "global search", but no way to search for text over issues and their comments. In particular, no way to search from the issue tab, no way to search over comments only in issues (or only in merge requests), no way to combine a text search with label/milestone/status filters, etc.

So it's a workaround, but a bad one.

Here's the ticket (2015): https://gitlab.com/gitlab-org/gitlab/-/issues/13891. The fact that it has so many duplicates in your own project's issue tracker is a good indicator of how bad your issue search is.

GitLab team member here.

> no way to combine a text search with label/milestone/status filters, etc.

You can combine text search with field search (like label/milestone/status. Here’s an example: https://gitlab.com/gitlab-org/gitlab/-/issues?search=Visuali...

This doesn't search over comments.

Not sure if it's good enough to replace https://grep.app/

I think that app triggered the inspiration to do this. So I would think what they deliver will be similar or have some feature parity.

(I worked on this.)

Give it a shot and let us know what you think! Where can we improve it?

Today I wanted to search for "strstr[a-z]+?_r" but got the error message "This is a partial result set. The search was stopped early because it would take too long to check every file for this regular expression.". However, I got results for the less restrictive regex "strstr.+?_r" which is weird since I'd expect that it would be easier to return results for more restrictive regular expressions. Not sure if there is a perfect solution for this, but in many cases, you could probably search for the less restrictive version and filter the results with the more restrictive one after that.

Also it would be great if more repositories were indexed. How do things work behind the scenes? Maybe it is possible to build a more memory-efficient index just for exact string search, which probably make up most searches.

Anyway, this website is amazing and I use it quite often. Thank you a lot for working on this!

You can search for that regular expression on Sourcegraph (disclaimer: I work there) https://sourcegraph.com/search?q=context:global+strstr%5Ba-z...

Nice! Which repositories are indexed?

Some feedback:

* Copy & Paste does not work. When trying to select text from code snippets, the code is dragged like an image instead of being selected.

* The GitHub icon during search is not clickable. Clicking to the right of the GitHub icon does not go to GitHub but instead shows an embedded view of that repository. There, the GitHub icon somewhere on the right is a bit hard to find (maybe write "View on GitHub" next to it so it is discoverable with Ctrl + F?), but at least it is clickable.

Thanks for this feedback, johndough. The copy & paste problem is likely a bug that only happens with Firefox and we'll fix it. More details here: https://github.com/github/feedback/discussions/8567

Thanks for the feedback, we're working on some changes to improve regular expression performance.

We're also working hard to increase the number of repositories indexed. :)

What kind of indexes do you use to provide regex searches?

Wow, HUGE feature, congrats to the team working on it! GH code search is a feature with such massive potential utility, but the old implementation was so weak it was basically useless. Looking forward to this, will use it constantly if it’s good.

A way to exclude seeing the exact same line of code in each of the 400 forks of some project on the next 40 pages would be just great.

The results are just so full of repetition

Yes please! I like to search for examples of how to use libraries and often times the results are all the same exact call in forks or copies of the same code in multiple places. Perhaps deduplication could be optional when searching?

We hear you, jhokanson. Thanks for this feedback.

Of all the tools I use on a daily basis Github is probably the worst. I mean the "Find a repository..." input field on the start page can not even filter out named repositories I have access to in all my organizations. It works for some repos but not all.

Search improvements? It is impossible to create a worse search experience than Github. Just clone and use git grep instead in most cases.

Edit: ...and the 425% price increase for SSO..

Could be worse, could be Reddit search.

(Granted, this is largely due to a culture of titles like "Check out this thing" that provide zero searchable metadata + no tag system.)

No need to clone if you just use https://sourcegraph.com/search.

I hope they add deduplication. I can’t count the number of times when I get 100 pages of results where 95 pages is from the same included library.

(I worked on this.)

This is on our radar! We de-duplicate exact matches now, but we'd like to do the same for near-similar documents.

De-duping exact matches is a game changed -- search has been miserable to use because of the dupes for so long. I can live with near-similar documents. Very excited to test this out.

Another GitHub Code Search developer here - to add more to this, we rank all the search results, and try to bring the most relevant results to the top. Ideally, if you have 10 pages of results, you shouldn't have to leave page 1 to find what you're looking for :D

That would be a tough problem. As de-dup you probably want to show/point towards the 'original' tree. But which one is the source? Or even worse someone abandons a project but someone else forked it and kept going should it show that one instead? Or should it show the one it was forked from depending on the version number. Which one is the 'true' repo now? Most certainly an interesting problem.

I kind of don't care about correctness. Just hide results that seem to be duplicate.

I get that. Just remove the 'extra'. That is a good first pass. I was thinking the longer term you want to show the 'original' higher in the list? Wouldnt you? What sort of criteria would you use to make it so it shows one copy vs another? Probably in many cases it probably would not mater much. But if you wanted to figure out linage of imports it could be? Some projects could have thousands of forks. Yet only maybe a dozen of those actually have anything going on. Those would be more useful to show?

I use github search a lot and this would be an insane productivity boost. I signed up for the waitlist. Does anyone working at Github want to bump me in the queue? This is my profile https://github.com/adamnemecek/

I just got access to it. I'm not sure if someone here helped but if yes, then thank you very much.

You can skip the wait and use https://sourcegraph.com/search instead.

this looks awesome! two things I've always wanted and haven't found satisfying solutions for in code search (in an editor)

1) an ability to easily express higher level concepts in a search that's aware of code semantics ("match only function names", "find call sites of a method") etc. Maybe this is possible with existing tools (probably is?) but I tend to get lazy about learning DSLs - would love to see this in a UI if it's possible

2) ability to save searches I do frequently - after a certain level of complexity in a query (I've added ignore rules, I crafted the right regex, etc), I want to be able to save the "context" of a search so that I can easily return to it later

GitHub Code Search developer here:

> would love to see this in a UI if it's possible

We do have code navigation via the UI, so in a way it's possible!

> ability to save searches I do frequently

Absolutely! This is possible using "custom scopes". If you're in the technology preview, click on the scope dropdown, scroll to the bottom, and choose "custom scopes". You can make a custom scope to search a set of respositories, a particular language, within a directory, or any combination with boolean operators!

I wonder if “custom scopes” is going to be a recognizable term for people?

It's local-only search, but you reminded me that this is possible with MacOS Spotlight. I wrote an indexer (for Common Lisp) that let you search for function definitions, etc.


For example, if you're looking for a search-and-replace function you know you wrote or had somewhere on your machine, you could do

    mdfind "org_lisp_defuns == '*search*replace*'"
(Or just use the regular Spotlight UI.)

I've been in the preview for a bit.

1) This doesn't seem to exist in quite that way, but you can prefix a literal with "def:" and the engine will return only definitions of that thing (so far as it can tell). It's not quite what you (or I!) want, but close.

2) This exists and is called "scopes". On the landing page, to the left of the search bar, click the grey pill that says "All repos". At the bottom there is a "custom scopes" option.

Might also be worth checking out the syntax guide: https://cs.github.com/about/syntax#symbol

One feature I would absolutely kill for is a setting that lets you hide issues and PRs from bot users in the global search.

I've been lucky enough to have a few projects that others have found useful, and so they've ended up in Conda forge, Gentoo, and other package repos. After I make a release, the Github-wide search is just absolutely flooded with dependabot PRs, Snyk PRs, and dozens of other bots. Literally thousands (and sometimes tens of thousands for CVEs) of automated PRs and issues that make it impossible for me to see how others are using or discussing my packages (usually to see if I've just horribly broken something).

Hah, reminds me of this issue where 90% of the content is spam: https://github.com/chronotope/chrono/issues/499

It looks like this doesn't really understand the code... If I have a bunch of functions in different files all called "print", it won't be able to determine exactly which one is linked in and called at runtime.

Googles codesearch tool[1] actually compiles the code and uses the compilers parse tree to make the search index. The only time it doesn't work is if you are looking for code that has been "#ifdef 0"'d, when it falls back to regular string matching, and the difference is night and day.

Please github... please try to make a search index by compiling everyones files. Plenty of projects have CI buildbots which have all the info to automatically compile millions of projects, and at the same time generate the necessary parse trees. Even for interpreted languages like javascript/npm, python/pip, you can use heuristics to make a cross-module function call graph accurately most of the time.

[1]: Example URL from it: https://source.chromium.org/chromium/chromium/src/+/main:thi...

We’ve been working on a framework called “stack graphs” that lets us extract exactly this kind of information without having to build anything. More details in my Strange Loop talk from October: https://dcreager.net/talks/2021-strange-loop/

Stay tuned for more!

This is great! As a project manager I am using github search everyday when I am searching for specific methods or part of the code in order to find logical issues or bugs in a code.

Got an opportunity to try it a few minutes ago and it's awesome so far. I was able to look for my code in repos I don't own, e.g `not org:user foo::bar`

Ah, great. GitHub throwing out special characters in searches was infuriating for languages with sigils and patterns, like $somevar or %sql% and so on.

Curious if this is something completely bespoke or simply a beefy ElasticSearch cluster which uses the (relatively) new "wildcard" field for enabling regex search on select fields. The search syntax certainly maps 1:1 to the ElasticSearch Query String syntax, including phrase search, boolean operations, grouping, regex search, etc.

(I worked on this and the prior version of code search that uses Elasticsearch.)

It is a custom search engine, built from the ground up for code. We'll be sharing more details about it on the GitHub blog soon.

I use github search a lot and this would be an insane productivity boost. I signed up for the waitlist. can you please nudge me in the queue? This is my profile https://github.com/abdallahmansour6

I wonder if it is the followup to this conversation from last year when https://grep.app was released: https://news.ycombinator.com/item?id=22397728

Probably not, they've been looking at/working on improved Code Search since 2019: https://youtu.be/9EoNqyxtSRM?t=1726

Just in case people from Github are listening to feedback here, please stop blocking search for logged out users? I mean, you're not exactly as terrible as your main competitor gitlab.com which entirely blocks Tor users from cloning repos (unless you add ".git" at the end of URI), but having to login every time i'm looking for a string eg. across an org is the worst. I understand i have to login to publish code and engage in discussions, but login for read-only content is bad UX.

PS: How do you feel about being bought by Microsoft? Maybe some of you feel it's a good time to implement s2s inter-forge federation to plant a nail in the coffin? Sounds like Gitea is on a good way to support it based on ActivityPub/forgefed and it would be sad if Github was relegated to a for-profit walled garden.

Reminds me of https://grep.app, Search across a half million git repos [1]

1. https://news.ycombinator.com/item?id=22396824

i still dont understand why it can be so far, haha

Some logic to exclude duplicate results would be useful. I often search to see how many external users there are of some API in postgres. But there's hundreds of separate repos with similar contents showing up in the search results...

We hear you, anarazel. Thanks for the feedback.

The addition of exact match search is so exciting that I haven’t internalized any of the other new features. I’ve abandoned an ungodly number of semi-common-word searches after getting 30 pages of results in a monorepo

I didn't even see this in the feature list before doing the signup. One of the signup questions is "how do you usually search?" or so, I wrote in the blank "I want to search for symbols, not substrings, so if I'm searching for `bar` I don't want `foo_bar` to show up as a match". I usually do this with word boundaries in regexes, but I pretty much have to have the repo downloaded, so it's useless for searching on github.com this way.

This is great, specified search on GitHub has previously been very hit or miss. Generally I use the search feature for learning / trying to see if something I'm trying to do already exists. I personally think vsCode has the best code search implementation, in terms of "exact", "partial" and "regex" matching. The UI is clear, non-technical team members can navigate their way around it and it's relatively fast assuming you don't have too many extraneous plugins installed.

Any plans to integrate Copilot goodness into this? All I want is to search for "the function that concatenate multiple items with an Oxford comma" and get to that bad boy!

I'd love some shorter keywords here for searching so this was quickly composable into something useful.


p: or f: instead of path: for filenames

l: instead of language:

-f: to exclude specific filenames (makes it easy to filter out tests)

You get the idea.

Great idea, ryanseys. Thanks for the suggestion. - GitHub PM

Glad to see GitHub’s search has improved. I hope GitHub finally improves the search functionality on gists. You can’t search your own gists by name.

I wish they would fix the advanced search feature. Searches that have multiple filters don't show any of the matches (has been broken for over a year - they acknowledged it as a known issue). Example search "camera -filename:camera.css -filename:depend.make" will say there is 100 million matches, but won't show any of them. Super useful feature when it's working

Thanks for the feedback, jjrob13. We'll definitely want to make sure this works well in the new search service. - GitHub PM

They haven't implemented wildcard search for... well, ever:


I don't even care it's very fast. Just make it work. I just hope this isn't snake oil. Weird that they claim regex support but no wildcard support.

Has the "Last indexed" been fixed?

whenever I search for code, it will say something like "Last indexed on Apr 2", but if you go to the actual file, the date will say 5 years ago or something. So currently the "Last indexed" listed date is completely useless, and you have to basically click through to every result.

(I worked on that system and the new one.)

Yes, sadly, that is literally when the file was _indexed_. So it's not particularly useful. It's a difficult problem to solve, but I'll bring up your feedback to the team.

i was actually really surprised that this did not exist when i went to search github for the first time. you would think that an open source giant would have this ability but i guess there is a ton of computational load to achieve search in general. i’ll probably get downvoted for bringing up a whacky idea, but imagine having some type of referencing system that is done through multi node p2p, so searching certain systems using shared resources. i guess the major problem would be if devs would actually spare some of their personal computational resources to help the community find things and not rely on special interest groups. i get it, i am old school as well. i started out on pascal and BASIC. but still think using creative solutions is fun. but you know, napster was cool back in the day prior to their lawsuits. and p2p was starting to pick up speed

There was a recent post on search engines where I believe a P2P solution was mentioned (but maybe it was on some related post within a few days of this one): https://news.ycombinator.com/item?id=29417061

Code search on GitHub is only available to people that log in with Microsoft. Clicking on 'Code' redirects to the login page.

It is not a friendly site. Open source projects would do better to use an open source code forge like <https://sr.ht/>.

GitHub only supports login with GitHub credentials unless another identity provider's SSO is configured.

I’ve been using grep.app for exact match search for a while. Exciting to see GitHub now has this too.

I always thought the search was purposely bad and overly limited to prevent scraping for credentials.

Now can they fix doing a language search for “Visual Basic”? If you filter a users repos or stars on that language it just shows all their repos or stars. Code search for language “Visual Basic” returns all repositories and does not limit by language like it should.

Will it be possible to see all the search results within each file? The lack of that feature is why I almost never use GitHub's code search for my repositories. Instead I'll download the repository locally and search there.

Related, I believe or at least worth sharing - https://news.ycombinator.com/item?id=22396824

Thank f for this. Github search has been complete garbage for years tbh. Being able to search, for instance, for 'def method' and not finding EVERY other def first is gonna be kinda nice

I have 10 different git + github instances across my org. (~50k strong workforce, pre github repos, m&a etc). Does this cs offer aggregated searches across all those distributed repos?

Hi zxienin. I'm a GitHub product manager. May I assume the GitHub instances you're describing are GitHub Enterprise Server instances? We plan to bring advanced code search features to all GitHub plans including Enterprise Server once we've stabilized the UX and feature set. But it sounds like your situation goes beyond that, where the search needs to include code from Git repositories outside of GitHub Enterprise Server. That makes good sense, and we'll definitely consider it. If you want to keep in touch about it, please feel free to post in our feedback forum: https://github.com/github/feedback/discussions/categories/co.... Thank you!

I shall, thnx.

ps: yes, enterprise server instances

Thank you!

I find when I do code-search, similar code that is copy and pasted into multiple repositories (vendor code) is always a problem too.

Skipping well known vendoring directories would be cool too.

Security researchers are gonna love this :)

Time to go secrets and url hunting.

You could already do this, grep.app for example has existed for a while. This is just bringing those features in-house.

grep.app doesn't index all repositories on GitHub though. I was doing some research a few months ago and couldn't find anything that would search all of github quickly.

It would be great if the search could exclude forks. I often search something and the result are the same code but in different repos (forks)

Thanks for the feedback, fhaldridge7. - GitHub PM

How about searching for keywords in file names? Believe it or not, GitHub search doesn't do this unless you prefix with filename:

Check out https://cs.github.com/about/syntax -- indeed, by default terms are searched in both content and paths. You can restrict to one or the other with `content:` or `path:`.

Then their documentation is wrong. I learned the hard way that GitHub code search didn't search file names in my case. I searched for a short bare string with some alphabet letters and one underscore, and it failed to find the file with that exact string in the file name, costing me a lot of time missing what I was looking for.

Unfortunately I can't reproduce the problem publicly because it happened while searching a private repo.

They should just buy sourcegraph I always have sourcegraph externsion installed it so much better than GitHub search.

Does anyone know (or guess) what kind of index they use to provide regex searches? I'm really curious.

We'll be sharing more details soon on the GitHub blog.

Is there any good reason why the search doesn't find file names? Or does it now with the new search?

The new search does find filenames! :D

this is awesome but i dont know why we needed faster search, seems like time wouldve been better spent on more search features. I guess this is class programmer pointless optimization

> Here are some things to look out for:

No mention of case-sensitive search...

Any idea how to get further ahead on the waitlist for co-pilot?

hopefully i'll be able to search for usages of a library's function without getting 30 pages of that library's source code cloned in vendor directories

Did anyone run the log4j RCE search?

Does it remove dupes from results?

Thank goodness

Slight tangent: The video has a guy describing the tool and he includes the fact that it’s written in rust when introducing it. I’ve always found this sort of name dropping in rust projects/devs baffling. Is there anything that I’m expected to infer from it? Is it that it’s backend is memory safe? I can’t think of anything else. Now it may very well be very memory safe but why include that highly specific detail when talking about a very high level thing that is the UX of search. What if it was written in Haskell or C#? Would it still be brought up? It’s almost as if being written in rust is a feature in itself these days. As a technical guy I can’t help but take the person less seriously, especially when it’s as unwarranted as this.

Hey! That was me in the video.

Not ashamed to be a Rust evangelist! The reason I mentioned Rust is because we spent a lot of time making the experience really fast - which is super important for a product like this. I really think getting the performance we have would have been enormously more difficult in any other language.

Fellow Rustacean here. Is the search engine secret sauce or something that could perhaps be open sourced? I'd like better tooling for searching private code bases. Also, would you consider writing about optimization techniques you used?

We are looking into open sourcing some libraries that we've developed for search. And we're going to write a blog post with way more technical details soon!

I agree with you, but I just wanted to point out the following:

In general, Rust, C, and C++ are going to be faster than languages like Ruby*. He brought up Rust while discussing the performance of the new tool. Although performance is more complex than language choice, etc., saying it's written in Rust gives the viewer an approximate lower bound as to how fast the tool should be.

*: (GH started as a Ruby shop, so I wouldn't be surprised if that's what the original tool was written in).

He’s talking about text search and the post thanks @BurntSushi. That means they’re using the fastest text search tool out there - ripgrep. I won’t mention what it’s written in, because that clearly upsets you.

Benchmark - ripgrep is faster than {grep, ag, git grep, ucg, pt, sift} (2016) - https://blog.burntsushi.net/ripgrep/

It's obviously personal preference but as a technical guy I am always curious what lang. a project is using.

Go had this issue for a while, too, it's finally started to calm down as Go hits a mainstream that is (imo) much farther than Rust is currently. I think much is just people trying to add validity to Rust for large-scale production workloads, in the same way that Kubernetes was "a compute scheduler written in Go" or Terraform was "infrastructure as code written in Go" (maybe those are bad examples, but I know I've seen the "X written in Go" thing going on).

This is exactly how I see it as well. Rust used to be an obscure language with a compiler written in OCAML. If something was written in D or zig, it’s noteworthy so you mention it. I think rust has come into the mainstream enough that we can drop the “written in rust” line imo.

I think depending on where the audience is coming from—for example people who primarily work in scripting/interpreted languages—Rust can also be a positive signal for performance.

It really is like the joke about vegans. So tiring.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact