I'm at Sourcegraph (mentioned in the blog post). We obviously have to deal with massive scale, but for anyone starting out adding code search to their product, I'd recommend not starting with an index and just doing on-the-fly searching until that does not scale. It actually will scale well for longer than you think if you just need to find the first N matches (because that result buffer can be filled without needing to search everything exhaustively). Happy to chat with anyone who's building this kind of thing, including with folks at Val Town, which is awesome.
And when you're ready to do indexed search, Zoekt (over which Sourcegraph graciously took maintainership a while ago) is the best way to do it that I've found. After discounting both Livegrep and Hound (they both struggled to perform in various dimensions with the amount of stuff we wanted indexed, Hound moreso than Livegrep), we migrated to Zoekt from a (necessarily) very old and creaky deployment of OpenGrok and it's night and day, both in terms of indexing performance and search performance/ergonomics.
Sourcegraph of course adds many more sophisticated features on top of just the code search that Zoekt provides.
I've been surprised at how far you can get without indexing.
Ex. I always assume we'll need to add an index to speed up GritQL (https://github.com/getgrit/gritql), but we've gotten pretty far with doing search entirely on the fly.
Yes, exactly. When doing a search, we parse and search every file without any indexing.
Of course, it could still be sped up considerably with an index but brute force is surprisingly effective (we use some of the same techniques/crates as ripgrep).
I apply this thinking to lots of problems. Do the dumb thing that involves the least state and prove we need to lean more towards memory for speed. It’s much simpler to keep things correct when nothing is cached
There was someone doing temporal databases that was compressing blocks on disk and doing streaming decompress and search on them. Things in L2 cache go very very fast.
Hello! I am Head of Engineering at Sourcegraph. I'd love to get feedback on which SCIP indexers you've had issues with, and, if you have the time, feedback on what sort of problems you've had with them. Thank you so much!
Hey guys, it's been over two months since I've been in the weeds with SCIP so I'm not going to be able to write very detailed issues, most of my experiences were with scip python and some in typescript.
1. roles incorrectly assigned to symbol occurences
2. symbols missing - this is a big one. I've seen many instances of symbols being included in "relationships" array that were not included in "symbols" array for the document, and vice versa. Plus "definition" occurrences have been inconsistent/confusing - only some symbols have those, and they don't always match where the thing is actually defined (file/position), and sometimes a definition occurrence has no counterpart in symbols array
3. the treatment of external packages have been inconsistent, they sometimes get picked up as internal definitions and sometimes not
I think SCIP is a great idea and I'd explore using it again if it got better. But I see that there are issues staying in the backlog for 6+ months which makes it seem from the outside like Sourcegraph is not prioritizing further development of scip
It indeed is hard, and a good code search platform makes life so much easier. If I ever leave Google, the internal code search is for sure going to be the thing I miss the most. It's so well integrated into how everything else works (blaze target finding, guice bindings etc), I can't imagine my life without it.
I remember to appreciate it even more every time I use Github's search. Not that it's bad, it's just inherently so much harder to build a generalized code search platform.
If you ever leave you can use Livegrep, which was based on code-search work done at Google. I personally don't use it right now but it's great and will probably meet all your needs.
> If you ever leave you can use Livegrep, which was based on code-search work done at Google.
If I’ve learned anything from the fainting spells that I-work-at-X have over their internal tools on HN: no, whatever the public/OSS variant is always a mere shadow of the real thing.
You would be better off actually trying to understand those sentiments instead of posting sarcastic replies on HN.
The sort of tight knit integration and developer-focus that internal tools at developer-friendly companies like Google has cannot be matched by clobbering together 50 different SaaS products, half of which will probably run out of funding in 3 years.
You literally have entire companies just LARPing internal tools that Google has because they are just that good. Glean is literally Moma. There's really nothing like Critique or Buganizer.
My take is that there's a difference between a company that is willing to invest money into EngProd endeavors, and a company that uses SaaS for everything. While I can understand that most companies don't have the financial means to invest heavily into EngProd, the outcome is that the tightly integrated development experience in the former is far superior. Code Search is definitely #2 on the list of things I miss the most.
I suspect you're being sarcastic - but can confirm that being nearly two years out of Amazon, I still miss its in-house CD system nearly every day. I've actively looked around for OSS replacements and very few come anywhere close.
(I would be _delighted_ for someone to "Umm actually" me by providing a great product!)
> (I would be _delighted_ for someone to "Umm actually" me by providing a great product!)
I think the issue is, nobody would be willing to pay for a good solution since they usually come with a steep maintenance cost. I wouldn't be surprised if the in-house CD team at Amazon were putting out fires every week/month behind the scene.
My experience has been that any of these in-house things do not adapt well to the high chaos of external environments, as if there are 3 companies one will find 9 systems and processes in use thus making "one size fits all" a fantasy
But, I'll bite: what made the CD system so dreamy, and what have you evaluated thus far that fall short?
Amazon internal tools for building codes are _amazing_.
Brazil is their internal dependency management tool. It handles building and versioning software. It introduced the concept of version sets which essentially allows you to group up related software, e.g. version 1.0 of my app needs version 1.1 of library x and 2.0 of runtime y. This particular set of software versions get its own version number.
Everything from the CI/CD to the code review tool to your local builds use the same build configuration with Brazil. All software packages in Brazil are built from source on Amazon's gigantic fleet of build servers. Builds are cached, so even though Amazon builds its own version of Make, Java, etc., these are all built and cached by the build servers and downloaded.
A simple Java application at Amazon might have hundreds of dependencies (because you'll need to build Java from scratch), but since this is all cached you don't have to wait very long.
Lastly, you have Pipelines which is their internal CI/CD tool which integrates naturally with Brazil + the build fleet. It can deploy to their internal fleet with Apollo, or to AWS Lambda, S3 buckets, etc.
In all, everything is just very well integrated. I haven't seen anything come close to what you get internally at Amazon.
How did you avoid version hell? At Google, almost everything just shipped from master (except for some things that had more subtle bugs, those did their work on a dev branch and merged into master after testing).
Version sets take care of everything. A version set can be thought of as a Git repo with just one file. The file is just key/value pairs with the dependencies and major/minor version mappings, e.g.
<Name> <Major>-<Minor>
Java 8-123
Lombok 1.12-456
...
A version set revision is essentially a git commit of that version set file. It's what determines exactly what software version you use when building/developing/deploying/etc.
Your pipeline (which is a specific noun at Amazon, not the general term) acts on a single version set. When you clone a repo, you have to choose which version set you want, when you deploy you have to choose a version set, etc.
Unlike most other dependency management systems, there's no notion of a "version of a package" without choosing what version set you're working on, which can choose the minor versions of _all of the packages you're using_.
e.g. imagine you clone a Node project with all of its dependencies. Each dependency will have a package.json file declaring what versions it needs. You have some _additional_ metadata that goes a step further that chooses the exact minor version that a major version is mapped to.
All that to say that the package can declare what major version they depend on, but not what minor version. The version set that you're using determines what minor version is used. The package determines the major version.
Version sets can only have one minor version per major version of a package which prevents consistency issues.
e.g. I can have Java 8-123 and Java 11-123 in my version set, but I cannot have Java 8-123 and Java 8-456 in my version set.
Your pipeline will automatically build in new minor versions into your version set from upstream. If the build fails then someone needs to do something. Every commit produces a new minor version of a package, that is to say that you can say your package is major version X, but the minor version is left up to Brazil.
This scheme actually works pretty well. There are internal tools (Gordian Knot) which performs analysis on your dependencies to make sure that your dependencies are correct.
It's a lot to know. It took me a year or so to fully understand and appreciate. Most engineers at Amazon treat it like they do Git -- learn the things you need to and ignore the rest. For the most part, this stuff is all hands off, you just need one person on the team keeping everything correct.
so what I'm hearing is that app-1.0 needs app-1.0-runtime-build-20240410 which was, itself, built from a base of runtime-y-2.0 and layering library-x-1.11 upon it, kind of like
# in some "app-runtimes" project, they assemble your app's runtime
cat > Dockerfile <<FOO
FROM public.ecr.aws/runtimes/runtime-y:2.0
ADD https://cache.example/library-x/1.1/library-x-1.1.jar
FOO
tar -cf - Dockerfile | podman build -t public.ecr.aws/app-runtimes/app-1.0-runtime-build:20240410 -
# then you consume it in your project
cat > Dockerfile <<FOO
FROM public.ecr.aws/app-runtimes/app-1.0-runtime-build:20240410
ADD ./app-1.0.jar
FOO
cat > .gitlab-ci.yml <<'YML'
# you can also distribute artifacts other than just docker images
# https://docs.gitlab.com/ee/user/packages/package_registry/supported_package_managers.html
cook image:
stage: package
script:
# or this https://docs.gitlab.com/ee/topics/autodevops/customize.html#customize-buildpacks-with-cloud-native-buildpacks
- podman build -t $CI_REGISTRY_IMAGE .
# https://docs.gitlab.com/ee/user/packages/#container-registry is built in
- podman push $CI_REGISTRY_IMAGE
review env:
stage: staging
script: auto-devops deploy
# for free: https://docs.gitlab.com/ee/ci/review_apps/index.html
environment:
name: review/${CI_COMMIT_REF_SLUG}
url: https://${CI_ENVIRONMENT_SLUG}.int.example
on_stop: teardown-review
teardown-review:
stage: staging
script: auto-devops stop
when: manual
environment:
name: review/${CI_COMMIT_REF_SLUG}
action: stop
... etc ...
YML
which is why, as I originally asked GP: what have you already tried and what features were they missing
I presume by "exactly backwards" you mean that one should have absolutely zero knobs to influence anything because the Almighty Jeff Build System does all the things, which GitLab also supports but is less amusing to look at on an Internet forum because it's "you can't modify anything, it just works, trust me"
Or, you know, if you have something constructive to add to this discussion feel free to use more words than "lol, no"
I don't work at Amazon, and haven't for a long time, and this format is insufficient to fully express what they're doing, so I won't try.
You're better off searching for how Brazil and Apollo work.
That being said, the short of it is that: imagine when you push a new revision to source control, you (you) can run jobs testing every potential consumer of that new revision. As in, you push libx-1.1.2 and anyone consuming libx >= 1.1 (or any variety of filters) is identified. If the tests succeed, you can update their dependencies on your package and even deploy them, safely and gradually, to production without involving the downstream teams at all. If they don't, you can choose your own adventure: pin them, fork, fix them, patch the package, revise the versioning, whatever you want.
It's designed to be extremely safe and put power in the hands of those updating dependencies to do so safely within reason.
Imagine you work on a library and you can test your PR against every consumers.
It's not unlike what Google and other monorepos accomplish but it's quite different also. You can have many live versions simultaneously. You don't have to slog it out and patch all the dependents -- maybe you should, but you have plenty of options.
It all feels very simple. I'm glossing over a lot.
Sorry, I wish I could phrase it better for you. All I can say is that I have tried a _lot_ of tools, and nothing has come close. Amazon has done a lot of work to make efficient tools.
Yep! Well, a branch is created under-the-hood, and you can still check it out locally if you want to - but the standard code contribution workflow doesn't require you to think or work in those terms at all.
It's not just that. Livegrep isn't just a pale imitation of something inside Google. It's totally unrelated in implementation, capabilities, and use case.
Agreed. There are some public building blocks available (e.g. Kythe or meta's Glean) but having something generic that produces the kind of experience you can get on cs.chromium.org seems impossible. You need such bespoke build integration across an entire organization to get there.
Basic text search, as opposed to navigation, is all you'll get from anything out of the box.
In a past job I built a code search clone on top of Kythe, Zoekt and LSP (for languages that didn't have bazel integration). I got help from another colleague to make the UI based on Monaco. We create a demo that many people loved but we didn't productionize it for a few reasons (it was an unfunded hackathon project and the company was considering another solution when they already had Livegrep)
Producing the Kythe graph from the bazel artifacts was the most expensive part.
Working with Kythe is also not easy as there is no documentation on how to run it at scale.
Very cool. I tried to do things with Kythe at $JOB in the past, but gave up because the build (really, the many many independent builds) precluded any really useful integration.
I see most replies here ar mentioning the the build integration is what is mainly missing in the public tools. I wonder if nix and nixpkgs could be used here? Nix is a language agnostic build-system and with nixpkgs it has a build instructions for a massive amount of packages. Artifacts for all packages are also available via hydra.
Nix should also have enough context so that for any project it can get the source code of all dependencies and (optionally) all build-time dependencies.
Build integration is not the main thing that is missing between Livegrep and Code Search. The main thing that is missing is the semantic index. Kythe knows the difference between this::fn(int) and this::fn(double) and that::fn(double) and so on. So you can find all the callers of the nullary constructor of some class, without false positives of the callers of the copy constructor or the move constructor. Livegrep simply doesn't have that ability at all. Livegrep is what it says it is on the box: grep.
The build system coherence provided by a monorepo with a single build system is what makes you understand this::fn(double) as a single thing. Otherwise, you will get N different mostly compatible but subtly different flavors of entities depending on the build flavor, combinations of versioned dependencies, and other things.
Nix builds suck for development because there is no incrementality there. Any source file changes in any way, and your typical nix flake will rebuild the project from scratch. At best, you get to reuse builds of dependencies.
The short answer is context. The reason why Google's internal code search is so good, is it is tied into their build system. This means, when you search, you know exactly what files to consider. Without context, you are making an educated guess, with regards to what files to consider.
Try clicking around https://source.chromium.org/chromium/chromium/src, which is built with Kythe (I believe, or perhaps it's using something internal to Google that Kythe is the open source version of).
By hooking into C++ compilation, Kythe is giving you things like _macro-aware_ navigation. Instead of trying to process raw source text off to the side, it's using the same data the compiler used to compile the code in the first place. So things like cross-references are "perfect", with no false positives in the results: Kythe knows the difference between two symbols in two different source files with the same name, whereas a search engine naively indexing source text, or even something with limited semantic knowledge like tree sitter, cannot perfectly make the distinction.
Yes the clicking around on semantic links on source.chromoum.org is served off of an index built by the Kythe team at Google.
The internal Kythe has some interesting bits (mostly around scaling) that aren't open sourced, but it's probably doable to run something on chromium scale without too much of that.
The grep/search box up top is a different index, maintained by a different team.
If you want to build a product with a build system, you need to tell it what source to include. With this information, you know what files to consider and if you are dealing with a statically typed language like C or C++, you have build artifacts that can tell you where the implementation was defined. All of this, takes the guess work out of answering questions like "What foo() implentation was used".
If all you know are repo branches, the best you can do is return matches from different repo branches with the hopes that one of them is right.
Edit: I should also add that with a build system, you know what version of a file to use.
Google builds all the code in its momnorepo continuously, and the built artifacts are available for the search. Open source tools are never going to incur the cost of actually building all the code it indexes.
The short summary is: It's a suite of stuff that someone actually thought about making work together well, instead of a random assortment of pieces that, with tons of work, might be able to be cobbled together into a working system.
All the answers about the technical details or better/worseness mostly miss the point entirely - the public stuff doesn't work as well because it's 1000 providers who produce 1000 pieces that trade integration flexibility for product coherence. On purpose mind you, because it's hard to survive in business (or attract open source users if that's your thing) otherwise.
If you are trying to do something like make "code review" and "code search" work together well, it's a lot easier to build a coherent, easy to use system that feels good to a user if you are trying to make two things total work together, and the product management directly talks to each other.
Most open source doesn't have product management to begin with, and the corporate stuff often does but that's just one provider.
They also have a matrix of, generously, 10-20 tools with meaningful marketshare they might need to try to work with.
So if you are a code search provider are trying to make a code search tool integrate well with any of the top 20 code review tools, well, good luck.
Sometimes people come along and do a good enough job abstracting a problem that you can make this work (LSP is a good example), but it's pretty rare.
Now try it with "discover, search, edit, build, test, release, deploy, debug", etc. Once you are talking about 10x10x10x10x10x10x10x10 combinations of possible tools, with nobody who gets to decide which combinations are the well lit path, ...
Also, when you work somewhere like Google or Amazon, it's not just that someone made those specific things work really well together, but often, they have both data and insight into where you get stuck overall in the dev process and why (so they can fix it).
At a place like Google, I can actually tell you all the paths that people take when trying to achieve a journey. So that means I know all the loops (counts, times, etc) through development tools that start with something like "user opens their editor". Whether that's "open editor, make change, build, test, review, submit" or "open editor, make change, go to lunch", or "open editor, go look at docs, go back to editor, go back to docs, etc".
So i have real answers to something like "how often do people start in their IDE, discover they can't figure out how to do X, leave the IDE to go find the answer, not find it, give up, and go to lunch".
I can tell you what the top X where that happens is, and how much time is or is not wasted through this path, etc.
Just as an example. I can then use all of this to improve the tooling so users can get more done.
You will not find this in most public tooling, and to the degree telemetry exists that you could generate for your own use, nobody thinks about how all that telemetry works together.
Now, mind you, all the above is meant as an explanation - i'm trying to explain why the public attempts don't end up as "good". But myself, good/bad is all about what you value.
Most tradeoffs here were deliberate.
But they are tradeoffs.
Some people value the flexibility more than coherence.
or whatever.
I'm not gonna judge them, but I can explain why you can't have it all :)
Just want to note that Livegrep, its antecedent "codesearch", and other things that are basically grep bear no resemblance to that which a person working at Google calls "Code Search".
Basic code searching skills seems like something new developers are never explicitly taught, but which is an absolutely crucial skill to build early on.
I guess the knowledge progression I would recommend would look something kind this:
- Learning about Ctrl+F, which works basically everywhere.
- Transitioning to ripgrep https://github.com/BurntSushi/ripgrep - I wouldn't even call this optional, it's truly an incredible and very discoverable tool. Requires keeping a terminal open, but that's a good thing for a newbie!
- Optional, but highly recommended: Learning one of the powerhouse command line editors. Teenage me recommended Emacs; current me recommends vanilla vim, purely because some flavor of it is installed almost everywhere. This is so that you can grep around and edit in the same window.
- In the same vein, moving back from ripgrep and learning about good old fashioned grep, with a few flags rg uses by default: `grep -r` for recursive search, `grep -ri` for case insensitive recursive search, and `grep -ril` for case insensitive recursive "just show me which files this string is found in" search. Some others too, season to taste.
- Finally hitting the wall with what ripgrep can do for you and switching to an actual indexed, dedicated code search tool.
It does! And I only recently learned this, and it explains why I've always found the VS Code search and replace across all files to be tremendously useful.
GitHub's code search functionality is only available to people who are logged in.
It used to be possible to perform global/multi-org/multi-repo/single-repo code searches without being logged in but over time they removed all code search functionality for people who are not logged in.
It is completely stupid that it's not possible for a non-logged-in person to code search even within a single repo[0].
It is textbook enshittification by a company with an monumental amount of leverage over developers.
(The process will presumably continue until the day when being logged in is required to even view code from the myriad Free and Open Source projects who find themselves trapped there.)
[0] Which is why I, somewhat begrudgingly[1], use Sourcegraph for my non-local code search needs these days.
[1] Primarily because Sourcegraph are susceptible to the same forces that lead to enshittification but given they also have less leverage I've left that as a problem for future me to worry about. (But also the site is quite "heavy" for when one just wants to do a "quick" search in a single repo...)
You make it sound as if being logged in to github is somehow a big hurdle. It's free and it's easy, so why should one care if it's only available to logged in users?
Better Unicode support in the regex engine. More flexible ignore rules (you aren't just limited to what `.gitignore` says, you can also use `.ignore` and `.rgignore`). Automatic support for searching UTF-16 files. No special flags required to search outside of git repositories or even across multiple git repositories in one search. Preprocessors via the `--pre` flag that let you transform data before searching it (e.g., running `pdftotext` on `*.pdf` files). And maybe some other things.
`git grep` on the other hand has `--and/--or/--not` and `--show-function` that ripgrep doesn't have (yet).
Hound has made an interesting choice to not bound searches. https://codesearch.wmcloud.org/search/?q=test&files=&exclude... produces an ajax request that (for me) took 13s to produce a 55MB JSON response, and then takes many more seconds to render into the DOM.
It's why IDE and developer tool builders have long had the insight that in order to do code search properly, you need to open up the compiler platform as a lot of what you need to do boils down to reconstructing the exact same internal representations that a compiler would use. And of course good code search is the basis for refactoring support, auto completion, and other common IDE features.
Easier said then done of course as tools are often an afterthought for compiler builders. Even Jetbrains made this mistake with Kotlin initially, which is something they are partially rectifying with Kotlin 2.0 now to make it easier to support things like incremental compilation. The Rust community had this insight as well with a big effort a few years ago to make Rust more IDE friendly.
IBM actually nailed this with Eclipse back in the day and that hasn't really been matched since then. Intellij never even got close to this being 2-3 orders of magnitudes slower. We're talking seconds vs. milliseconds here. Eclipse had a blazing fast incremental compiler for Java that could even partially compile code in the presence of syntax errors. The IDEs representation of that code was hooked into that compiler.
With Eclipse, you could introduce a typo and break part of your code and watch the IDE mark all the files that now had issues across your code base getting red squiggles instantly. Fix the typo and the squiggles went away, also without any delay.
That's only possible if you have a mapping between those files and your syntax tree, which is exactly what Eclipse was doing because it was hooked into the incremental compiler.
Intellij was never able to do this, it will actively lie to you about things being fine/not fine until you rebuild your code and it will show phantom errors a lot when it's internal state gets out of sync with what's on disk. It often requires full rebuilds to fix this. If you run something, there's a several second lag while it compiles things. Reason: the IDE internal state is calculated separately from the compiler and this gets out of sync easily. When you run something, it has to compile your code because it hasn't been compiled yet. That's often when you find out the IDE was lying to you about things being ready to run.
With Eclipse all this was instantly and unambiguous because it shared the internal state with the compiler. If it compiled, your IDE would be error free, if it didn't it wouldn't be. And it compiled incrementally and really quickly so you would know instantly. It had many flaws and annoying bugs but that's a feature I miss.
While Eclipse truely have an incredible incremental compiler for Java, IntelliJ's better integration with external build systems like maven and gradle, together with better cross-languages support, was what win me over.
> This is a pretty bad index: it has words that should be stop words, like function, and won’t split a.toString() into two tokens because . is not a default word boundary.
So github used to (maybe still does) "fix" this one and it's annoying. Although github are ramping up their IDE like find-usages, it's still not perfect, so somethings you just want to a text search equivalent for "foo.bar()" for all the uses it misses and this stemming behaviour then finds every while where foo and bar are mentioned which bloats results.
I don't understand their hand-waving of Zoekt. It was built exactly for this purpose, and is not a "new infrastructure commitment" any more than the other options. The server is a single binary, the indexer is also a single binary, can't get any simpler than that.
To me it doesn't make sense to be more scared of it than Elasticsearch...
The hardest part about getting code search right imo is grabbing the right amount of surrounding context, which septum is aimed at solving on a per-file basis.
Another one I'm surprised hasn't been mentioned is stack-graphs (https://github.com/github/stack-graphs), which tries to incrementally resolve symbolic relationships across the whole codebase. It powers github's cross-file precise indexing and conceptually makes a lot of sense, though I've struggled to get the open source version to work
Oracle has USER/ALL/DBA_SOURCE views, and all of the PL/SQL (SQL/PSM) code that has been loaded into the database is presented there. These are all cleartext visible unless they have been purposefully obfuscated.
It has columns for the owner, object name, LINE[NUMBER] and TEXT[VARCHAR2(4000)] columns and you can use LIKE or regexp_like() on any of the retained source code.
I wonder if EnterpriseDB implements these inside of Postgres, and/or if they are otherwise available as an extension.
Since most of SQL/PSM came from Oracle anyway, these would be an obvious desired feature.
Is it? I find it near-useless most of the time, and cloning + ripgrep to be way more efficient. Perhaps the problem is more in the UX being awful than the actual search.
I suppose using something like tree sitter to get a consistent abstract syntax tree to work with would be a good starting point. And then try building a custom analyzer (if using elasticsearch lingo) with that?
Another option is to start with Kythe, which is Google's own open source framework for getting a uniform cross-language semantic layer: https://kythe.io/
Worth looking at as a source of inspiration and design ideas even if you don't want to use it itself.
Might be overkill unless you're looking to do semantic search. I've thought about what a search DSL for code would look like, it's challenging to embody a query like "method which takes an Int64 and has a variable idx in it" into something compact and memorable.
But a tokenizer seems like a good place to start, I think that's the right granularity for this kind of application. You'd want to do some chunking so that foo.bar doesn't find every foo and every bar, that sort of thing. Code search is, as the title says, a hard problem. But a language-aware token stream, the one you'd get from the lexer, is probably where one should start in building the database.
Sure you should definitively not try to do the overkill use case first but I would assume that tree sitter can emit "just" tokens as well? Getting the flexibility and control of a tool like tree sitter should allow you to quickly throw away stuff like comments and keywords if you want since you can do syntax aware filtering.
Then again I haven't used tree-sitter, can just imagine that this is a strength of it.
Code search is indeed hard. Stop words, stemming and such do rule out most off the shelf indexing solutions but you can usually turn them off. You can even get around the splitting issues of things like
a.toString()
With some pre-processing of the content. However were you really get into a world of pain is allowing someone to search for ring in the example. You can use partial term search, prefix, infix, or suffix but this massively bloats the index and is slow to run.
The next thing you try is trigrams, and suddenly you have to deal with false positive matches. So you add a positional portion to your index, and all of a sudden the underlying index is larger than the content you are indexing.
Its good fun though. For those curious about it I would also suggest reading posts by Michael Stapelberg https://michael.stapelberg.ch/posts/ who writes about Debian Code Search (which I believe he started) in addition to the other posts mentioned here. Shameless plug, I also write about this https://boyter.org/posts/how-i-built-my-own-index-for-search... where I go into some of the issues when building a custom index for searchcode.com
Oddly enough I think you can go a long way brute forcing the search if you don't do anything obviously wrong. For situations where you are only allowed to search a small portion of the content, say just your own (which looks applicable in this situation) that's what I would do. Adding an index is really only useful when you start searching at scale or you are getting semantic search out of it. For keywords which is what the article appears to be talking about, that's what I would be inclined to do.
The preprocessing that you need is (in Lucene nomenclature, but it's the same principle for search in general) an Analyzer (the component, which knows to prepare the plain text that gets inside for storing it in an index and the corresponding component for a search query) made for code search. That's not different from analyzers for other languages (Stemming sucks for almost everything but English). Thinking about it .. the frontend of most compilers for a language could maybe make a pretty good Analyzer. It already knows language specific components and can split them into parts it needs for further processing, which is basically what an analyzer does.
> It’s hard to find any accounts of code-search using FTS
I'm actually going to be doing this soon. I've thought about code search for close to a decade, but I walked away from it, because there really isn't a business for it. However, now with AI, I'm more interested in using it to help find relevant context and I have no reason to believe FTS won't work. In the past I used Lucene, but I'm planning on going all in with Postgres.
The magic to fast code search (search in general), is keeping things small. As long as your search solution is context aware, you can easily leverage Postgres sharding to reduce index sizes. I'm a strong believer in "disk space is cheap, time isn't", which means I'm not afraid to create as many indexes as required, to shave 100's of milliseconds of searches.
Mmm, it’s not that straight forward: indexes can vastly slow down large scale ingest, so it’s really about when to index as well.
I work with a lot of multi billion row datasets and a lot of my recent focus has been on developing strategies to avoid the slow down with ingest, and then enjoying the speed up for indexed on search.
I’ve also gotten some mjnd boggling speed increases by summarizing key searchable data in smaller tables, some with JSONB columns that are abstractions of other data, indexing those, and using pg prewarm to serve those tables purely from memory. I literally went from queries taking actual days to < 1 sec.
Yeah I agree. I've had a lot of practice so far with coordinating between hundreds of thousands of tables to ensure ingestion/lookup is fast. Everything boils down to optimizing for your query patterns.
I also believe in using what I call "compass tables" (like your summarization tables), which I guess are indexes of indexes.
Scaling databases both oddly frustrating and also rewarding. Getting that first query that executes at 10x of the old one feels great. The week of agony that makes it possible…less so.
Fully agree. I do have to give hardware a lot of credit though. With SSD and now NVME, fast random read/write speed is what makes a lot of things possible.
> Sourcegraph’s maintained fork of Zoekt is pretty cool, but is pretty fearfully niche and would be a big, new infrastructure commitment.
I don't think Zoekt is as scary as this article makes it out to be. I set this up at my current company after getting experience with it at Shopify and its really great.
Use ElasticSearch. It will scale more than Postgres. Three hosted options are AWS, Elastic, Bonsai. I founded Bonsai and retired (so am partial), but they will provide the best human support for you, and you won't have to worry about java Xmx.
Your goal with ES is to use the Regex PatternAnalyzer to split the code into reasonable exact code-shaped tokens (not english words).
GitLab is also using ElasticSearch, so one could recreate the ElasticSearch Indices they came up with. [1]
They also share some of the challenges, they faced along the way. It also discusses interesting challenges, like implementing the authorization model. [2], [3]
When GitHub removed its most useful Search feature, which is sorting results by date, I wrote a small “Search Engine” with ElasticSearch to selectively index Microsoft repositories. It works good enough for my needs. [4]
Elasticsearch is good, and it does scale, but it is much more cumbersome and expensive to scale and operate than Postgres. If you use the managed service, you'll pay for the operational pain in the form of higher pricing.
The Postgres movement is strong and extensions like ParadeDB https://github.com/paradedb/paradedb are designed specifically to solve this pain point (Disclaimer: I work for ParadeDB)
I've found embeddings to perform quite poorly on code because
1) user queries are not semantically similar to target code in most cases
2) often times two very concretely related pieces of code are not at all semantically similar
Are any of the tools mentioned in these comments better suited to searching SQL code, both DML and DDL?
We maintain a tree of files with each object in a separate "CREATE TABLE|VIEW|PROCEDURE|FUNCTION" script. This supports code search with grep, but something that could find references to an object when the name qualifications are not uniform would be very useful:
INSERT INTO table
INSERT INTO schema.table
INSERT INTO database.schema.table
Can all be done with regex, but search is not so easy for programmers new to expressions.
Why am I not seeing anything here about ctags[0] or cscope[1]? Are they that out of fashion? cscope language comprehension appears limited to C/C++ and Java, but “ctags” (I think I use “uctags” atm) language support is quite broad and ubiquitous…
exactly THIS <sorry for shouting !> the only problem with `cscope` is that for modern c++ based code-bases it is woefully inadequate. for plain / vanilla c based code-bases f.e. linux-kernel etc. it is just _excellent_
language-servers using clangd/ccls/... are definitely useful, but quite resource heavy. for example, each of these tools seem to starting new threads per file (!) and there are no knobs to not do that. i don't really understand this rationale at all. yes, i have seen this exact behavior with both clangd and ccls. oftentimes, the memory in these processes balloon to some godawful numbers (more with clangd than ccls), necessitating a kill.
moreover, this might be an unpopular opinion, but mixing any regex based tool (ripgrep/... come to mind) with language-server f.e. because the language server does not really find what you are looking for, or does not do that fast enough, are major points against it. if you already have language-server running, regex based tools should not be required at all.
i don't really understand the reason for sql'ization of code searches at all. it is not a 'natural' interface. typical usage is to see 'who calls this function', 'where is the definition at' of this function etc. etc.
There are tools from bioinformatics that would be more applicable here for code search than the ones linguistics has made for searching natural language.
Is it possible to combine n-gram and AST to dump a better indexing?
Take `sourceCode.toString()` as an example, the AST can dump it to `sourceCode` and `toString`. A further indexer can break `sourceCode` to `source` and `code`.
You could, but I don't know what you gain out of it. The underlying index would be almost the same size, and n-gram would also allow you to search for e.t for example which you are losing in this process.
Surprised not to see Livegrep [0] on the list of options. Very well-engineered technology; the codebase is clean (if a little underdocumented on the architecture side) and you should be able to index your code without much difficulty. Built with Bazel (~meh, but useful if you don't have an existing cpp toolchain all set up) and there are prebuilt containers you can run. Try that first.
By the way, there's a demo running here for the linux kernel, you can try it out and see what you think: https://livegrep.com/search/linux
EDIT: by the way, "code search" is deeply underspecified. Before trying to compare all these different options, you really would benefit from writing down all the different types of queries you think your users will want to ask, including why they want to run that query and what results they'd expect. Building/tuning search is almost as difficult a product problem as it is an engineering problem.
When I investigated using livegrep for code search at work, it really struggled to scale to a large number of repositories. At least at the time (a few years ago) indexing in livegrep was a monolithic operation: you index all repos at once, which produces one giant index. This does not work well once you're past a certain threshold.
I also recall that the indexes it produces are pretty heavyweight in terms of memory requirements, but I don't have any numbers on hand to justify that claim.
Zoekt (also mentioned in TFA) has the correct properties in this regard. Except in niche configurations that are probably only employed at sourcegraph, each repo is (re)indexed independently and produces a separate set of index files.
But its builtin web UI left much to be desired (especially compared to livegrep), so I built one: https://github.com/isker/neogrok.
I like this better than livegrep. I haven't actually operated either zoekt OR livegrep before, but I'll probably start with zoekt+neogrok next time I want to stand up a codesearch page. Thanks for building and sharing this!
>Lemmatization: some search indexes are even fancy enough to substitute synonyms for more common words, so that you can search for “excellent” and get results for documents including “great.”
This isn't what lemmatization is about.
Stemming the word ‘Caring‘ would return ‘Car‘.
Lemmatizing the word ‘Caring‘ would return ‘Care‘.
A feature I'd appreciate from Val Town is the ability to point it to a GitHub repo that I own and have it write the source code for all of my Vals to that repo, on an ongoing basis.
Then I could use GitHub code search, or even "git pull" and run ripgrep.
Right now it only commits one val, but it would be trivial to write it into a loop and then use a scheduled val to have it run over all your vals as a cron job!
nvm, just went in and changed it so now you could theoretically do all your vals at once. I realized though that it will not update file contents as is, so need to figure that out ...
When a val is deployed on val town, my understanding is that it's parsed/compiled. At that point, can you save the parts of the program that people might search for? Names of imports, functions, variables, comments, etc.
Is "hard" a bit of an overstatement for problems like "I'm using a library that mangles the query"? Couldn't you search for the literal text the user inputs? Maybe let them use regex?
I wrote zoekt. From what I understand valtown does, I would try to use brute force first (ie. something equivalent to ripgrep). Once that starts breaking down, you could use last-updated-timestamps to reduce the brute force:
* make a trigram index using Zoekt or Hound for JS snippets older than X
* do brute force on snippets newer than X
* advance X as you're indexing newer data.
If the snippets are small, you can probably use a (trigram => snippets) index for space savings relative to a (trigram => offset) index.
Its a result of trigrams themselves. For example turning searchcode (please ignore plug, this is just the example I had to hand) goes from 1 thing you would need to index into 8.
Tsvector is amazing and it goes a long way, but unfortunately as you say it lacks some of more complex FTS features like typo tolerance, language tokenizers, etc.
the rum index has worked well for us on roughly 1TB of pdfs. written by postgrespro, same folks who wrote core text search and json indexing. not sure why rum not in core. we have no problems.
RUM is good, but it lacks some of the more complex features like language tokenizers, etc. that a full search engine library like Lucene/Tantivy (and ParadeDB in Postgres) offer
i was under the impression the text search config and dictionaries did support such remappings. we only query in english or literal, so my knowledge is limited.
There’s a HN comment I’ll never forget where the commenter suggests that Discord move their search infrastructure to a series of text file searched with ripgrep, but Val.town’s scale is small enough that they could actually consider it.
It seems like some of their gists have documentation attached and maybe that’s enough? I’m not sure I’m all that interested in seeing undocumented gists in search results.