
Semgrep: Lightweight static analysis for many languages - kiyanwang
https://github.com/returntocorp/semgrep
======
ievans
I work on Semgrep; there are a bunch of examples at
[https://semgrep.live](https://semgrep.live) if you're curious about what the
syntax looks like.

For context, Semgrep started as a Facebook open-source project inspired from a
Inria project named Coccinelle, which has has made a couple thousand or so
automatic patches to the Linux kernel over the years using a semantic patch
language
([http://coccinelle.lip6.fr/sp.php](http://coccinelle.lip6.fr/sp.php))

~~~
danpozmanter
Impressive work!

Are there any plans to include C# or F#?

~~~
ievans
C# is high on the list, F# isn't a priority at the moment though. Behind the
scenes, we've recently changed to use tree-sitter as the parser library; if
there is a good F# tree-sitter library integration becomes quite easy. I don't
see one at [https://tree-sitter.github.io/tree-sitter/](https://tree-
sitter.github.io/tree-sitter/) but perhaps there's one maintained elsewhere.

------
tabbott
We've been using semgrep for Zulip's python codebase for the last few months;
here's our configuration:

[https://github.com/zulip/zulip/blob/master/tools/semgrep.yml](https://github.com/zulip/zulip/blob/master/tools/semgrep.yml)

I really appreciate the semantic checks. They're especially nice for security-
sensitive lint rules, but really it removes the hacky regular expressions feel
of adding lint rules to a codebase. It's also been useful for some codebase
migrations (semgrep is more precise than e.g. `git grep -w` for finding "All
the places we use code pattern X that we want to stop doing").

My main complaint about it is performance -- it's too slow per unit rule for
us to replace the regular expression based system that we run on our whole
codebase (so we can't happily convert our other ~100 regular expression-based
lint rules to semgrep
([https://github.com/zulip/zulip/blob/master/tools/linter_lib/...](https://github.com/zulip/zulip/blob/master/tools/linter_lib/custom_check.py)).

But performance has been improving a lot over time, and I think there's
potential for it to be faster (E.g. mypy, the Python type-checker, has gotten
way way faster in the last year or two). Because semgrep is getting active
investment from a venture-funded company that I imagine will improve the
performance, I expect semgrep to be a tool that most projects serious about
code quality are using in a few years.

I should add that performance may also be less important to others than it is
to us; we run all of our linters (currently 20 distinct linters, including
eslint, prettier, pyflakes, isort, shellcheck, etc.) in parallel using
[https://github.com/zulip/zulint](https://github.com/zulip/zulint), with the
goal of being able to lint the entire codebase in <30s or changed files in
under 1s (obviously time depends on number of files changed).

~~~
kevincox
I wonder if this could be improved by extracting fixed strings from the
pattern and only actually parsing the files that could possibly match. I think
the major issue would be alias support but even that should be possible for
most languages as your fixed-string extraction would notice the alias itself.

~~~
aryx
Great idea! Will do that.

------
stephen-bunn
Just went through the examples. Seems really intuitive and looks like it would
be a good approach for homegrown linters. Would also love to see some plugin
support for editors.

~~~
dlukeomalley
Agreed. What editors do you have in mind?

I filed a ticket for VS Code support because I’ve seen it mentioned in a few
of the other comments:
[https://github.com/returntocorp/semgrep/issues/1329](https://github.com/returntocorp/semgrep/issues/1329)

~~~
stephen-bunn
VS Code and vim would be the ones I would be most concerned about as I
typically jump between the two. Although a pre-commit hook is great and
something I will definitely use, having this hook reporting issues in a more
live manner would be a huge bonus.

------
staticassertion
Semgrep's pretty slick. I tried out a demo and I was pretty blown away by how
I could essentially just guess my way to a signature.

------
ccktlmazeltov
Regexes are such a horrible thing to deal with when you're just trying to
parse code quickly and don't want to deal with AST. I've always wished for a
library of regexes that just work.

------
vmchale
Cool stuff! Seems to hook into tree-sitter?

Love seeing OCaml (or any functional language) :)

------
glouwbug
I've always wondered if we could leverage the vast amount of GitHub code -
that assumably all compiles without error or undefined behaviour on their
master branches - train some sort of neural net to better catch syntax errors.

Has anyone done something like this, or am I riding the 2016 neural net hype
train still?

~~~
karlding
This isn't specifically for syntax errors, but Jacob Jackson released TabNine
[0] last year, which is an autocompleter trained on files from GitHub [1].

TabNine was acquired by Codota earlier this year [2].

[0] [https://www.tabnine.com/](https://www.tabnine.com/)

[1] [https://www.tabnine.com/blog/deep/](https://www.tabnine.com/blog/deep/)

[2] [https://techcrunch.com/2020/04/27/codota-picks-up-12m-for-
an...](https://techcrunch.com/2020/04/27/codota-picks-up-12m-for-an-ai-
platform-that-auto-completes-developers-code/)

~~~
glouwbug
Pretty amazing, and congrats to Jacob Jackson. (I may be a little envious) ;)

------
dorian-graph
I only recently came across Semgrep and then after that, Comby
([https://comby.dev/](https://comby.dev/)).

Has anyone compared the 2? They seem similar (structured find/replace, with
registries of rules).

~~~
rusbus
Comby seems more like "parenthesis matching + search" (they don't implement a
full parser for the language, just some basic required constructs to make a
basic AST. I imagine this limits the resolution of the search?

Semgrep uses an AST that's equivalent to the parser of the language itself so
it's much higher resolution in terms of what you can match.

~~~
dorian-graph
Ah yeah, that is a strong distinction. Comby seems have a little nicer UX, but
then as you've said, it would have a lower matching resolution.

That explains why too that Comby supports so many languages so easily, and how
easy it is to add your own DSL.

------
estebarb
Nice to see more work in this direction. I used coccinelle a lot for
automating changes/bug detection and I immediately missed it when working on
anything that is not C.

------
lsorber
Looks neat. Are you considering a flake8 extension like bandit for easy
adoption (in CI and in VS Code)?

------
skanga
pip3 install semgrep fails on windows 10 with Python 3.7.8 and pip 20.1.1 and
the error seems to be an invalid path separator char.

error: can't copy 'XXXXXXXXXXXXXX\Local\Temp\pip-install-cq40rzma\semgrep-
files/semgrep-core': doesn't exist or not a regular file

Anyone here know how to fix that?

~~~
dlukeomalley
Semgrep should work on Windows Subsystem for Linux (WSL). Mind filing a ticket
for myself and the other maintainers to help debug?

[https://github.com/returntocorp/semgrep/issues/new?assignees...](https://github.com/returntocorp/semgrep/issues/new?assignees=&labels=&template=bug_report.md&title=Invalid%20Path%20Separator)

~~~
skanga
Done

