Tool author here. There are a bunch of really good tools for code search already which are a lot more sophisticated than this one - livegrep for example.
I built this partly as a learning exercise, to see how far I could push the idea of a Datasette plugin and to figure out how to call external processes from Python asyncio.
I also built it because I had a hunch that the "datasette publish" mechanism for deploying Datasette to serverless hosting providers such as Google Cloud Run could make it easy to deploy code search engines too. I think the most interesting thing about this project is the GitHub Actions workflow I wrote that deploys the demo instance running against code from 60+ repositories: https://github.com/simonw/datasette-ripgrep/blob/main/.githu...
People in comments suggesting alternatives like sourcegraph, zoekt, hound, etc -- I've set up local code search for repositories on my computer a while ago and tried out most of the tools I could find at the time [0]. Nothing comes remotely close to a simple ripgrep against literally all of the repositories, both in terms of convenience and speed (it's instantaneous against hundreds of repositories, at least on SSD!). The only hassle is configuring .ignore files (most is covered by .gitignores), but usually only have to do it a few times to exclude the few spammy offenders.
I'm using it with my emacs (+helm), I just have a global keybinding which instantly opens a new code search. I could imagine having a web frontend being very convenient for people who aren't hooked on Emacs, even I often want to persist a code search result for several days, so will give datasette-ripgrep a try!
It's a regular expression search engine built on ripgrep that can search in virtually anything (pdf files, zip archives, movie subtitles, sqlite databases, png+ocr, ...).
I have used regular expressions to search for code, like I guess most devs; but, apart from artificial limits such as maximum time or max number of results, I found regex has never been a reliable search method for me.
There is always the case that, when trying to find all instances of some code, it will miss that one case which is not covered by the initial intuition.
For example, to find uses of the function "foo()", the first idea is usually to search for ".foo(". But that won't find places where the function is passed as value. Nor places where devs were inconsistent and wrote / didn't write a space between the name and the opening brace, like ".foo (" (this is a common style for Gnome or Glib code). Not to mention if the language is C++ and then you have to distinguish between ".foo", "->foo", and "::foo".
In conclusion, regex might be OK to have a superficial look at some code, but never as a reliable source of information, unless a lot of work and ironing out of corner cases has been invested in the regex itself, effectively considering them part of the code, with unit tests and all the fuss that it entails.
Agreed, this is why I like the idea of searching parsed files, for example using scope selectors against code tokenized with a sublime-syntax grammar. Unfortunately, it means it needs to be parsed/indexed first, which is slower than a plain regex search.
I wonder if the AST built by Tree-Sitter could also help with this type of search - does anyone know of any existing solutions for this?
> does anyone know of any existing solutions for this?
https://semgrep.dev, though it's mostly an analysis tool it can be used as a search tool. IIRC it's not super fast, but for the cases where there is no way to really contort a regex into something suitable (the regex has way too many false positives and / or negatives) it works rather well.
> I like the idea of searching parsed files, for example using scope selectors against code tokenized with a sublime-syntax grammar.
I've been working on a trigram-based search engine with support for exactly this (via github.com/trishume/syntect) over the past several months and plan on open-sourcing it soon. Cool to see someone else with this idea!
You might also like https://comby.dev - it is aware of code structure without parsing files
> I've been working on a trigram-based search engine with support for exactly this (via github.com/trishume/syntect) over the past several months and plan on open-sourcing it soon.
Awesome, I look forward to that - you'll post a "show HN" for it, I hope? :)
Does syntect's lack of support for the newest sublime-syntax features cause you any problems?
> You might also like https://comby.dev - it is aware of code structure without parsing files
> Awesome, I look forward to that - you'll post a "show HN" for it, I hope? :) Does syntect's lack of support for the newest sublime-syntax features cause you any problems?
Show HN -> Yep :)
Lack of support for newest sublime-syntax features: less than you would expect, but a little for sure.
Most of the syntax definitions in the wild today don't really use the newer features, at Sourcegraph I wrote a little Rust HTTP server wrapping syntect[1] and we use it for all our syntax highlighting for the past several years, I would say it works on like 95% of code files even if you include a lot of additional syntaxes that are untested with Syntect[2]. That said, it does barf hard on some specific files - either taking a really long time to do the work or getting completely stuck in a busy waiting loop for some reason. That said, it's still the 2nd best syntax highlighter out there (second only to Sublime itself.)
One of my hopes for this side project is that I'll be able to contribute more time upstream with e.g. a more extensive test suite for Syntect against a much larger number of syntax definitions from the wild instead of just Sublime's built-in ones.
In some cases the problem can be somewhat mitigated if the search is incremental (and fast), as it basically allows you to refine your regex in real time and help train the correct intuition.
For example, you can type out "foo" at first which will, say, display 7 results, then when you add "()" you see the list shrunk by one, so you immediately know the regex missed one it's supposed to capture.
Etsy has a web app for searching a git repo, https://github.com/hound-search/hound. But you have to figure out how to efficiently load and update all interesting repos on disk.
I really wish the git protocol had a way to perform remote git grep so you don't need to install potentially massive repos just to search for keywords.
Suggestion: For the ".plugin_config(" example [0], using fixed-string search with -F option would be much better. Perhaps you could add an option to choose among -F, default regex and PCRE (for features like lookarounds).
Seems really useful, there's another tool -- https://grep.app that searches for code patterns throughout GitHub, very useful to take inspiration from other's code or, look for examples.
Code search is an amazing tool, and this is a great step, but it’s missing clickable cross-references, which are the killer feature in Kythe-based code search: https://kythe.io/
I'm a little bit confused by the state of that website, almost makes Kythe look abandoned, but if you look at the pulse tab on github it tells a different story (looks very much alive).
Is Kythe used to index some prominent open source project/s?
Just compare kythe.io with opensource.google, kythe's page looks more like a collection of markdown files than an alive project website.
Clicking "web ui" on the front page took me to this page [1] that is scant on details, linking to a caley release from a year ago, probably is was written long time ago and never updated (makes me think the rest of the pages may be the same). I can tell ease of deploying is not a priority too.
Just a little bit of css and linking to source.chromium.org/chromium and cs.opensource.google as example deploys would make the project look more alive.
OpenGrok is powered by Universal Ctags (https://github.com/universal-ctags/ctags) underneath and uses a Java servlet web server (Apache Tomcat) with search powered by Apache Lucene.
OpenGrok is older, boring technology but it works.
It is refreshing to see a new, Rust-powered approach to code search.
Feedback: code search in a repository (or multiple repositories) should allow blacklisting and/or whitelisting files and directories. Every time I try to search for something on GitHub, I have to sift through tons and tons of useless results from tests.
Edit: The discussion has taken a weird turn into support for ignore files. To be clear, this feedback is about dynamically filtering searches when using a website to do code search, most likely searching someone else’s repo on someone else’s website; you don’t have access to the filesystem, the search form is the only input. So all this “rg supports ignore files” (I’m aware, but thanks) is not really relevant.
GitHub searches are particularly bad. They always give you test folders and examples before and after actual relevant source code. You'd think they would know that folders named "test" should be in a separate section...
The worst thing about github searches is that it only shows the first result for each file, often that's not the interesting result. I almost never use github's search and instead clone the repository locally and use ripgrep on it (or git grep)
I agree your replies are weird. Ignore files would be nice if you wanted repo specific filters or even user wide global filters. But for on-demand filters at search time, the frontend could add a feature for it and implement it with the `-g/--glob` flag.
So I guess since I started this entire thread here is why it even came up: datasette (a tool I use) is something you run locally. In particular this plugin also is pointed at a local checkout of the repositories you're working with.
So for instance in my case I have all the code I work with in ~/Development. Since I also use ripgrep for local development I typically include .ignore/.rgignore files to control what I'm searching around in my repos.
Doing "excludes"/"includes" across a variety of repositories is hard at query time because they layouts of those repositories can be very different. For instance in some JS repos you do want to search in "dist" whereas in others you don't because "dist" gets generated out of "src" for instance and just ends up being a duplicate etc.
The way I see doing includes/excludes at query time is this. You run a search without excludes for example. You see a bunch of stuff coming back all in one directory that you don't want for this particular search. So you add an exclude rule for that directory. And then just refine results that way.
If you have a bunch of similarly named directories, then I think that just means that the exclude rules at query time need to be more specific. e.g., use the full path to the directory.
I think the feature being requested here is an interactive feature of the user interface, where as ignore files are more like a static fact of a directory tree.
> To be clear, this feedback is about dynamically filtering searches when using a website to do code search, most likely searching someone else’s repo on someone else’s website; you don’t have access to the filesystem, the search form is the only input.
I think people are imagining that it should be the repo-author's responsibility to create the equivalent of an `.rgignore` file ahead-of-time (since they know the codebase's structure, they know what's useful vs. useless to show up in a search of the code — pretty objectively, "code" should show up, and "not-code" shouldn't); and that code-search on websites like Github should respect the rules in such a file if it were discovered in the repo.
If such a thing existed and was in wide use, you — a random passerby to the codebase, who doesn't yet understand it — wouldn't need to struggle to do any dynamic filtering, because static filtering would already have been done for you.
> If such a thing existed and was in wide use, you — a random passerby to the codebase, who doesn't yet understand it — wouldn't need to struggle to do any dynamic filtering, because static filtering would already have been done for you.
That's the thing, it's impossible to create a generic ignore list that caters to everyone's interest. Could the_mitsuhiko, the creator of flask, provide such a list for flask? Nope. Some people want to understand how register_blueprint is implemented and don't want to see anything from test, docs and examples. Some people want to learn how to use register_blueprint, so they specifically want to limit themselves to examples. Some other people want to check if there are test cases for a certain API. You can't ignore any file that may be interesting to someone, which is basically all of them.
Right. Ignore files and search-time dynamic filters solve different problems. They aren't mutually exclusive. Otherwise, I wouldn't have added the -g/--glob flag to ripgrep in the first place.
> You can't ignore any file that may be interesting to someone, which is basically all of them.
ripgrep already does ignore certain files! Specifically, it ignores binary and hidden files, and files in your .gitignore. This is part of what makes ripgrep's default output so much more "readable" than GNU grep's default output: its few default filters mean that it's not matching from e.g. the contents of your .git/ directory, whereas GNU grep does show those matches by default.
> Could the_mitsuhiko, the creator of flask, provide such a list for flask?
That wasn't what I was suggesting. I'm saying that each repo owner should be maintaining such a list for their own repo, custom-tailored to it. Just like every repo owner manages their repo's .gitignore, .gitattributes, .dockerignore, etc.
And specifically, that list shouldn't be a set of things that are always filtered out of all searches, but rather a set of patterns that apply abstract purpose-type tags to files, ala .gitattributes, such that it makes them much easier (or even "default") to filter them in/out of your searches.
...ripgrep and other search tools could then see those files as having that purpose-tag "test" attached; and you'd be able to filter files in/out by their purpose-tag, just like you can filter files in/out by their filetype with ripgrep's -t/-T switches.
(You could go wild from there, if you like, and let this patterns-file match not just files but lines inside files. For example, doc-comments lines [=~ lines with syntax X in files of type Y] could be tagged as having purpose=docs.)
And I would then propose, on top of that, some higher-level behavior — not necessarily implemented in rg(1) itself — that could look at a file of "filter-context specifications" like this:
...and then, when you do a search, it would use that `default` search-scope if not otherwise specified; or use another scope if you name it. So given the config above, by default you'd be searching only code; but you could instead search your docs scope, or your examples scope, or an implicit "all" scope.
The big win with those purpose-tags and scopes, would be if they had conventional or formal (e.g. URNs in an RDF namespace) names, such that different repos could agree on their usage.
Then tools that do code-search across many different repos, like Github, could layer UI on top of these purpose-tags and search-scopes discovered from the repo, to enable the repo to be searched using this explicitly-encoded abstract understanding of the purpose/function of the different parts of the repo. (Picture the Github search autocomplete for a search 'foo' giving several drop-down options like "Search 'foo' in [code] of this repo", "Search 'foo' in [docs] of this repo", etc. Then, after you get to the search page, a set of checkboxes to refine your search from the original scope, into a custom scope, by including/excluding arbitrary purpose-tags.)
The has nothing to do with ignore files. Tests pushed to a GitHub repository obviously won't be matched by gitignore (unless you ignore them in gitignore then force commit them, which would be really rare), I'm just not interested in them in most of my searches.
I'm talking about excluding files and directories during searches, through a directive like, say, "-exclude tests -exclude test_*.py".
Sure, I guess technically you can add a bunch of additional patterns to gitignore before you use this web frontend, but that would be a really weird workflow, and it's impossible when the web frontend is run by someone else.
Which is again impossible when you use someone else’s web frontend for code search. Plus if you’re editing .rgignore on the filesystem for every search, what’s the point of using a web frontend?
Not sure why ignore files are not the solution here. You can check on an ignore file into your repo then whoever uses ripgrep through whatever fronend gets that logic applied. I’m using this and i don’t quite see the downsides of that.
Let's flip this around: why would that make sense? Filtering expressions belong to a query, not configuration. Having to use ignore files for this is like not being allowed to use the WHERE clause in SQL, and having to rely on .sqlwhere config file instead.
That’s great that it works for you, but it doesn’t work for me.
What about the one-off search of someone else’s code? I don’t want to check out the repo just to add/edit a file so I can run ripgrep locally. If GitHub has a search feature that is entirely online, it would make sense for people to request they improve it.
The topic is datasette and ripgrep. I would be curious why in the context of those tools ignore files are not a solution. I agree that github codesearch leaves a lot to be desired.
Put it this way, OP is basically a web interface for ripgrep; ripgrep supports not just 'automatic' filtering through ignore files, but also 'manual' filtering per-query through the -g flag. OP should have an interface to the -g flag.
I agree that a -g flag would be useful. However given the original comment and how datasette is used I figured the ignore file would be a good solution.
- I can’t control what someone else checks into their repo. I use web frontends for code search almost exclusively on other people’s repos, because my own repos are already on my machine, so what’s the point.
- Most of the time I don’t want to see search results in tests, but occasionally I actually want to find things in tests and not anywhere else. Different people have different filtering needs at different times, you can’t have a ignore file that satisfies every need. Search is about filtering to begin with.
Well, all this tool does is calling rg from a directory with a bunch of repos, so providing global filters on every query is as awful as rg providing a --glob flag... Which is arguably not that awful to a lot of users.
> Datasette is a tool you run against your stuff usually. ... datasette runs locally.
I don't think so? I could be wrong but I thought a major use case (or the major use case) is providing a frontend for your data to other people. Like on https://ripgrep.datasette.io/-/ripgrep.
> Datasette is aimed at data journalists, museum curators, archivists, local governments and anyone else who has data that they wish to share with the world.
> I don't think so? I could be wrong but I thought a major use case (or the major use case) is providing a frontend for your data to other people.
_I_ use datasette and I only ever used it locally or in a docker container. In either case for datasette-ripgrep to work you need to check out the repos you want to search in one folder that datasette will then invoke ripgrep on.
There are folks that publish datasette installations for public consumptions but even in that case I would assume that code search was preconfigured to make any sense with ignore files.
>> Datasette is aimed at data journalists, museum curators, archivists, local governments and anyone else who has data that they wish to share with the world.
That doesn't preclude installing the program locally.
The official "getting started" talks of remote / web datasettes as "demos" and "trials".
> That doesn't preclude installing the program locally.
Of course not. But the project makes it pretty clear that it's designed with multiuser in mind. Can you host multiuser web apps only for yourself? Of course, I do that all the time. The unreasonable thing is saying "I use this thing exclusively for my own stuff on my local network, so screw your requests for multiuser, public-facing use cases."
I built this partly as a learning exercise, to see how far I could push the idea of a Datasette plugin and to figure out how to call external processes from Python asyncio.
I also built it because I had a hunch that the "datasette publish" mechanism for deploying Datasette to serverless hosting providers such as Google Cloud Run could make it easy to deploy code search engines too. I think the most interesting thing about this project is the GitHub Actions workflow I wrote that deploys the demo instance running against code from 60+ repositories: https://github.com/simonw/datasette-ripgrep/blob/main/.githu...