I'll share my similarly named tool `grep-ast` [0], which sort of does the opposite of the OP's `ast-grep`. The OP's tool lets you specify your search as a chunk of code/AST (and then do AST transforms on matches).
My tool let's you grep a regex as usual, but shows you the matches in a helpful AST aware way. It works with most popular languages, thanks to tree-sitter.
It uses the abstract syntax tree (AST) of the source code to show how the matching lines fit into the code structure. It shows relevant code from every layer of the AST, above and below the matches.
It's useful when you're grepping to understand how functions, classes, variables etc are used within a non-trivial codebase.
Here's a snippet that shows grep-ast searching the django repo. Notice that it finds `ROOT_URLCONF` and then shows you the method and class that contain the matching line, including a helpful part of the docstring. If you ran this in the terminal, it would also colorize the matches.
django$ grep-ast ROOT_URLCONF
middleware/locale.py:
│from django.conf import settings
│from django.conf.urls.i18n import is_language_prefix_patterns_used
│from django.http import HttpResponseRedirect
⋮...
│class LocaleMiddleware(MiddlewareMixin):
│ """
│ Parse a request and decide what translation
│ object to install in the current thread context.
⋮...
│ def process_request(self, request):
▶ urlconf = getattr(request, "urlconf", settings.ROOT_URLCONF)
Hey paulg, ast-grep author here! This is something I also want to do in ast-grep!
ast-grep prints the surrounding lines around matches but they are not aware of which function/scope the matches are in.
May I ask how you do the scope detection in a general fashion? (say language agnostic)
https://github.com/ast-grep/ast-grep/issues/155
The command line tool is a thin wrapper around the `TreeContext` class, whose purpose is show you a set of "lines of interest" in the context of the entire AST. This all exists because my other project aider [0] uses TreeContext to display a repository map [1] so that GPT-4 can understand how the most important classes, methods, functions, etc fit into the entire code base of a git repository.
But it was easy to make a CLI interface to grep lines of interest and display them with TreeContext, and it turned out to be quite useful.
The TreeContext class is line-oriented, and is mainly interested in tracking language constructs whose scope spans multiple lines. Typically these are things like classes, methods, functions, loops, if/else constructs, etc. Given a line of interest, we look at all the multi-line scopes which contain it. For each such multi-line scope, we want to display some "header" lines to provide context.
In this example, the match for "two" is contained in the multi-line scopes of a method and a class. So we print their headers.
$ grep-ast two example.py
⋮...
│class MyClass:
│ "MyClass is great"
⋮...
│ def print2(self):
▶ print("two")
⋮...
The trick is how to determine the header for each multi-line scope? It's not ideal to just use the first line. For example, it's nice that the header for the class included the docstring as well as the bare `class MyClass:` line.
For any multi-line scope, we look at all the other AST scopes which start on the same line. We take the smallest such co-occurring scope, and declare the header to be the lines that it spans. For a simple method like `def print2(self):`, that's all that gets picked up.
But a complex method like `print1()` below picks up all the lines which are part of its full function signature:
$ grep-ast one example.py
⋮...
│class MyClass:
│ "MyClass is great"
⋮...
│ def print1(
│ self,
│ prefix,
│ suffix,
│ ):
⋮...
▶ print(f"{prefix} one {suffix}")
⋮...
It's a heuristic, but it seems to work well in practice.
Wow! What a coincidence. Just the other day I finished "v1" of a similar tool: https://github.com/alexpovel/srgn , calling it a combination of tr/sed, ripgrep and tree-sitter. It's more about editing code in-place, not finding matches.
I've spent a lot of time trying to find similar tools, and even list them in the README, but `AST-grep` did not come up! I was a bit confused, as I was sure such a thing must exist already. AST-grep looks much more capable and dynamic, great work, especially around the variable syntax.
This looks really interesting, thank you for putting this together! I’ll likely give it a go today. I say that as someone who has explored quite a few of these and found them mostly quite basic. srgn looks like more than the usual.
One minor comment: I personally found the first Python example involving a docstring a little hard to parse (ha). It may show a variety of features, but in particular I found that it was hard to spot at a glance what had changed.
If you could use diff formatting or a screenshot with color to show the differences it would make it much easier to follow. If I get around to using it later today, I might submit a PR for that. :)
I initially hope the queries would be more powerful, but they are really not. You can write queries and a resulting template in a yaml file. The program will scan a list of repositories for all these YAML files, and expose them as command line verbs.
oak go definitions /home/manuel/code/wesen/corporate-headquarters/geppetto/pkg/cmds/cmd.go
type GeppettoCommandDescription struct {
Name string `yaml:"name"`
Short string `yaml:"short"`
Long string `yaml:"long,omitempty"`
Flags []*parameters.ParameterDefinition `yaml:"flags,omitempty"`
Arguments []*parameters.ParameterDefinition `yaml:"arguments,omitempty"`
Layers []layers.ParameterLayer `yaml:"layers,omitempty"`
Prompt string `yaml:"prompt,omitempty"`
Messages []*geppetto_context.Message `yaml:"messages,omitempty"`
SystemPrompt string `yaml:"system-prompt,omitempty"`
}
type GeppettoCommand struct {
*glazedcmds.CommandDescription
StepSettings *settings.StepSettings
Prompt string
Messages []*geppetto_context.Message
SystemPrompt string
}
While I can use it for good effect for LLM prompting as is, I really would like to add a unification algorithm (like the one in Peter Norvig's Prolog compiler) to get better queries, and connect it to LSP as well.
# via Homebrew
brew install ast-grep
# via Cargo
cargo install ast-grep
# via npm
npm i @ast-grep/cli -g
# via pip
pip install ast-grep-cli
# I tested and pipx works too:
pipx install ast-grep-cli
I really like this - it means the tool is available to people with familiarity of any of those four distribution mechanisms.
I was curious so I had a look at how the "pip install ast-grep-cli" command works. It downloads a wheel for the correct platform from https://pypi.org/project/ast-grep-cli/#files
The wheel just contains the two binaries (sg and ast-grep) and no Python code:
This is how Ruff works too! (Ruff is also a standalone binary with no Python dependency.) If you're interested, I recommend checking out Maturin, which makes this pretty easy -- you can ship any standalone Rust binary as a Python package by zipping it into a wheel.
AST-grep is well done - the speed is particularly impressive and it's quite easy to get started with.
One of the downsides of the simplicity is that rules are written in yaml. This works nicely for simple rules, but if you try to save a complex migration as a rule, you end up programming in YAML (which is very hard).
Hey morgante, nice to meet you again! Indeed YAML is a compromise between expressiveness and easy-learning. Grit did a great job for providing advanced code manipulation!
A looping gif is an unfortunate choice for a demo. It looks cool to start, but then I'm trying to see what it's done when it restarts and I have to sit through it again. Some before and after still screenshots would help.
If you have mpv or avconv/ffmpeg based video player, you can play and seek the in the gif video file. You can use `mpv http-gif-url` to play it directly.
Hi Herrington_d,
Thanks for sharing this awesome tool! I'm trying to user it locally on some C++ code to match a method call:
v = map->GetValueForKey(somethingUseless, key);
For some reason my simple query pattern 'GetValueForKey' isn't a match. Is there a way to get the 'sg' CLI to output the AST for some line of my target file so I can see what kind of pattern I need to write?
Interesting use-case: We recently started using ast-grep at CodeRabbit[0] to review pull request changes with AI.
We use gpt-4 to generate ast-grep patterns to deep-dive and verify pull-request integrity. We just rolled this feature out 3 days back and are getting excellent results!
treesitter gives us a uniform interface to parse and manipulate code, which is awe-inspiring work. I wish tree-sitter could have more contributors to the core library. It still has a lot of improvement space.
Say, like performance. tree-sitter's initial parsing speed can be easily beaten by a carefully hand-crafted parser. Tree-sitter, written in C, has a similar JavaScript parsing speed as Babel, a JS-based parser. See the benchmark https://dev.to/herrington_darkholme/benchmark-typescript-par...
Besides, it doesn't shine at syntax highlighting, either! In the sense that it doesn't add anything that the traditional text-based algorithms embedded in practically any text editor can't already do. For example, if I declare a variable called "something", it should highlight all successive occurrences of "something" in a remarkably different style than "somethink". And the "a" in "sizeof(a)" should be rendered differently when it's a variable and when it's a type.
Also plugging my related project: https://github.com/Ichigo-Labs/cgrep
From the comments in this thread, it seems a lot of people have built or needed an easy way to quickly create static analysis checks, without a bunch of hassle. I think extended regex is a great way to do this.
Well, when I seach for "semgrep", I get a very nice corporate landing page with a "Book Demo" button. Which is a level of hassle that just isn't worth it for smaller teams, because "Book Demo" usually means "We're going to do a dance to see how much money we can extract from you." Which smaller teams may only want to do for a handful of key tools.
(4 years ago, I was more willing to put up with enterprise licensing. But in the last two years, I've seen way too many enterprise vendors try to squeeze every penny they can get from existing clients. An enterprise sales process now often means "Expect 30% annual price hikes once you're in too deep to back out." The lack of easy VC money seems to have made some enterprise vendors pretty desperate.)
There's also an open source "semgrep" project here: https://github.com/semgrep/semgrep. But this seems to be basically a vulernability scanner, going by the README.
Whereas AST-grep seems to focus heavily on things like:
1. One-off searching: "Search my tree for this pattern."
2. Refactoring: "Replace this pattern with this other pattern."
AST-grep also includes a vulnerability scanning mode like semgrep.
It's possible that semgrep also has nice support for (1) and (2), but it isn't clearly visible on their corporate landing page or the first open source README I found.
Thank ekidd for your kind words! ast-grep author here.
This is a hobby project and mainly focuses on developers' daily job like search and linting. Appreciate you like it!
Semgrep's vulnerability scanning is much more advanced, mostly for enterprise security usage.
yeah I had this feeling a bit, I guess im curious what problems they solve differently (if any). My sense it that semgrep is an enterprise managed solution of the same kind (and btw, is still itself OSS)
Hi, ast-grep author here. This is a great question and I asked this in the first place before I started the hobby project.
TLDR; I designed ast-grep to be on different tracks than semgrep.
Semgrep is for security and ast-grep is for development.
First and foremost, I have always been in awe of semgrep. Semgrep's documentation, product sites and Padioleau's podcast all gave me a lot of inspiration. Using code to find code is such a cool idea that I never need to craft an intricate regex or write a lengthy AST program. sgrep and patch from https://github.com/facebookarchive/pfff/wiki/Sgrep have helped me a lot in real large codebases.
When I used semgrep as a software engineer, instead of a security researcher, I found semgrep has not touched too much on routine development works. I can use `semgrep -e PATTERN` but the Python wrapper is not too fast compared to grep.
While pattern is cool, it cannot precisely match some syntax nodes. (example, selecting generator expression in Semgrep is very hard). It also does not have API to find code programmatically.
Why I think semgrep is a security tool different from ast-grep:
* Semgrep is security focused. It has many advanced static analysis features in its core product, such as dataflow analysis, symbolic propagation, and semantic equivalence, all of which are useful for security analysis. They are not available in ast-grep.
* Semgrep’s pattern syntax also prefers matching more potentially vulnerable semantics than matching precise syntax. Semantic level information is the better level of abstraction for security model. ast-grep, on the other hand, sticks to faithfully translating users' queries syntactically.
* Semgrep has a one-off search and rewrite feature, but it is not its primary focus. The CLI is a bit slow compared to other tools. ast-grep strives to be a fast CLI tool.
* Semgrep has a product matrix for vulnerability detection: detecting secrets, supply chain vulnerabilities, and cross-file detection. It also has a plethora of security rules in the registry. These features will not be included in ast-grep.
I was hoping this could be a local replacement for Azure DevOps's functional code search[1], but this seems lower-level than that. Basically, I want a tool where I can write something like `class:Logger` and it'll show me which file(s) define a class with that name, or `ref:Logger` to find all usages of that/those class(es).
The problem with any tree-sitter based tool is that there will typically be edge cases where the tree-sitter parser is wrong. Probably not a big deal most of the time, but it makes me wary of using it for security.
Cool! Like many others here, I also relatively recently created a similar tool called syntax-searcher or syns[1] which is focused on searching, and doesn't support replacing. It's implemented with a hand-written tokenizer/parser rather than using eg. tree-sitter. It works best with (mostly) context-free languages, but it doesn't crash or anything with Python and the like. At least I find its query syntax better than the alternatives, but it's probably because I wrote it :)
I find the best uses for it being answering questions like 'how is this function called' and 'what is this struct's definition':
# find struct definition
$ syns 'struct Span {}'
[./src/psi.rs:21-26]
pub struct Span {
/// Starting byte index of the span.
pub lo: usize,
/// End byte index of the span.
pub hi: usize,
}
I came up with a similar concept for in-editor SSR as an extension to existing find/replace functionality: https://codepatterns.io/
It worked great for the use case I built it around initially but I think it would need a scripting/logic component to generalise to any conceivable refactoring.
This looks exciting. One thing I've always wanted to do is search Rust code but excluding code in tests (marked by a #[cfg(test)] annotation). Can it do that?
I certainly hope some excellent AST-based CLI code search tools come to exist; hopefully this is one of them.
Thanks! How would you do that for a #[cfg(test)] attribute in Rust? (I believe that the true identifier of test code; `mod test {}` is just a convention). I assume Rust attributes "wrap" the AST node rooted at the node that follows them?
That looks like a case where "analyse the AST after constant folding" might be a theoretical path if you had a language frontend that could emit the AST at that point.
I suspect that things like "these two functions both start with the same conditional+early return" would be more useful to -me- given the sort of things I tend to be working on. Also a 'fuzzy possible copy+paste detector' in general to help identify refactoring targets.
It also strikes me that something that was mostly 'just' a structure-aware diff so e.g. you got diffs within-if-body and similar but I'm now into vigorous hand waving because it's been ages since I've thought about this and I probably need more coffee.
I -did- do a pure maths degree many years ago but I don't generally seem to end up working on computational code
to the downvoter: I thought this was a reasonable question? Semantic equivalence is IIRC undecidable in general. Some languages (Backus' FL?) try to deal with that but I dunno.
I've tried using this, but the documentation and learning resources weren't very good (at least at the time ~6 months ago) and structuring refactors with YAML made it very cumbersome for me to write and edit one-off commands.
Tree Sitter also leaves a lot to be desired for C++ editing, but that's a special problem.
I have been improving ast-grep's documentation since the release of the project.
It is arguably not that good/abundant for resources compared to eslint/babel. But it has improved a lot (say, example page[0] and deep dive page[1]). The doc site is also carefully crafted to make it accessible, compared to libcst or other similar projects.
I also appreciate blogs/introductions to ast-grep if the community can help! Let me know if there are issues/problems when using ast-grep!
Unless the search/replace is super simple, you need the YAML as far as I can tell. The refactor I gave up on automating had to do with changing variadic C++ macros into arithmetic expressions, which wasn't conceptually very complicated, but felt almost impossible while constantly tripping over YAML syntax errors.
> changing variadic C++ macros into arithmetic expressions
I don't know your exact use case but it sounds a little bit hard to me. I guess you want to change things like this.
```cpp
add_macro(a, b, c, d); // before
a + b + c + d; // after
```
It isn't that easy to do in pattern syntax. The pattern/replacement must support repetition and it isn't straightforward as far as I can tell from Rust's macro[0].
That said, if you need to support use cases like this, ast-grep has Python API to support programmatic usage[1].
hey.. are these tools (or combination there of) capable of converting parts of code in one language to another? Given no (or minimum) idiosyncracies...
e.g. python to javascript or other way around? (And no, ML is not the answer, i need provable correctness)
I've done a lot of work in this space, and unfortunately the answer is largely no.
These provide a nice frontend for writing simple rules, but I would not want to (essentially) write an entire transpiler in yaml.
For Python->JavaScript, you likely want a transpiler focused specifically on that.
Unfortunately, every such effort eventually hits serious limits in the emergent complexity for languages. There's a reason most of the SOTA techniques ML-based.
Provable correctness means you have to model your source and target languages.
And then translate the source model to the target model. It is theoretically possible, but in practice, modeling an industry language is way too much work. Some languages do not even have a spec :/
My tool let's you grep a regex as usual, but shows you the matches in a helpful AST aware way. It works with most popular languages, thanks to tree-sitter.
It uses the abstract syntax tree (AST) of the source code to show how the matching lines fit into the code structure. It shows relevant code from every layer of the AST, above and below the matches. It's useful when you're grepping to understand how functions, classes, variables etc are used within a non-trivial codebase.
Here's a snippet that shows grep-ast searching the django repo. Notice that it finds `ROOT_URLCONF` and then shows you the method and class that contain the matching line, including a helpful part of the docstring. If you ran this in the terminal, it would also colorize the matches.
[0] https://github.com/paul-gauthier/grep-ast