
Don’t underestimate grep-based code scanning - perch56
https://littlemaninmyhead.wordpress.com/2019/08/04/dont-underestimate-grep-based-code-scanning/
======
secure
One thing which was not immediately obvious to me for a while: the stricter
your language’s formatting is, the easier it will be to grep source code.

I work a lot with Go, where all code in our repository is gofmt'ed. You can
get quite far with regular expressions for finding/analyzing Go code.

(And when regexps don’t cut it anymore, Go has excellent infrastructure for
working with it programmatically. [http://golang.org/s/types-
tutorial](http://golang.org/s/types-tutorial) is a great introduction!)

~~~
felipeerias
Related to this, it is generally a very good idea to be strict when naming
functions, parameters, variables, etc. so that each concept has exactly one
name throughout the codebase.

~~~
splittingTimes
But how do you effectively orhanize/enforce this for a code base of several
million LOC where geographically distributed teams are working on different
ends of the system all the time?

The amount of cross team coordination is staggering.

~~~
peterstensmyr
Fail verification in CI if the change doesn’t pass your checks. You can check
anything, such as whether it has a duplicate name already in the codebase.

~~~
splittingTimes
I would be interested to know, which CI tool can check "that each concept has
exactly one name throughout the codebase."

I thought code reviews are the only way and then you need to have every Dev
aligned and on the same page on this topic... Which never happens. :/

~~~
peterstensmyr
It wouldn’t be something provided by the CI tool, you’d have to write the test
yourself. At the end of the day it’s just another test, albeit a more complex
one than a standard unit test.

------
fredley
Don't use grep. Use ag[0], which is specifically designed for searching code.
It's _much_ faster, honors .gitignore, and the output can be piped back
through grep if you like.

    
    
        ag FooBar | grep -v Baz
    

It's in brew/apt/yum etc as `the_silver_searcher` (although brew install ag
works fine too).

0:
[https://github.com/ggreer/the_silver_searcher](https://github.com/ggreer/the_silver_searcher)

~~~
Joeboy
Is there a reason to use ag rather than rg? afair the latter was a lot faster
when I tried it (on ubuntu / intel).

~~~
dkarl
Same experience here, started with ack and switched to ag and then to rg for
speed. I've found them roughly equivalent in functionality, but for those who
need specific features here's a link to a feature comparison table:

[https://beyondgrep.com/feature-comparison/](https://beyondgrep.com/feature-
comparison/)

~~~
burntsushi
That table is quite out of date for ripgrep, which has added a number of
features. See:
[https://github.com/beyondgrep/website/issues/97](https://github.com/beyondgrep/website/issues/97)

~~~
dmix
Thanks for ripgrep, I use it daily and was recently going through the source
code to learn how to build production quality Rust apps!
([https://github.com/BurntSushi/ripgrep](https://github.com/BurntSushi/ripgrep))

------
jimmaswell
Still never going to beat AST-integrated searching like VS has for C#. Which
has a regex search too.

~~~
arethuza
Are there any stand-alone AST based search tools?

~~~
hchasestevens
I'm going to plug my own here: astpath, which is AST-based search for Python.
[https://github.com/hchasestevens/astpath/](https://github.com/hchasestevens/astpath/)

------
jolmg
The ability to easily grep for functions in C-like code is why I've come to
appreciate projects defining their functions like:

    
    
      int
      foo_func(void) {
    

You can grep for `^foo_func\b` to get to a declaration or definition, or
`^foo_func\b.* {$` to get to a definition or `^foo_func\b.* ;` to get to a
declaration. This is instead of using something like `^\w.* \bfoo_func\\(`,
which is what you'd need for:

    
    
      int foo_func(void) {
    

By the way, anyone know of a way to insert a literal asterisk here without
having to follow it up with a space?

~~~
paulddraper
Just one more reason to love languages with trailing instead of leading types
(Scala, Typescript).

    
    
        fooFunc(): Int {
    

"fooFunc returns an integer."

Not

"An integer is returned by fooFunc."

------
timwaagh
> If not, the reviewer can quickly dismiss it as a false positive

This is were you could be wrong. We would need to give a reason for dismissing
it and then the risk officer would need to approve it (or reject it). False
positives can be a real pain in the ass.

------
tannhaeuser
The post's core message seems to be lost on HN. It's about screening sources
for supposedly insecure and/or injection-prone funcs using simple text
scanning (such as strcat, which however is considered in iOS apps when it is a
C std API func); supposedly grepability is also about quickly finding code
locations of messages and variables. But comments are all about Rust or Go
superiority, irrelevant grep implementation details, and AST-based code
analysis tools when these are specifically dismissed in TFA as producing too
many false positives. Talk about bubbles and echo chambers.

~~~
secure
Or maybe the core message just resonates and people have additional discussion
in the comments?

------
KuhlMensch
I do a few VERY SIMPLE greps. The most useful, is a pre-commit hook to check
no blacklisted env vars exist in the commit diff. So, useful.

Grepping leans-in to shell. Though if you have other environments available
(python, javascript etc), it makes sense to lean-into them e.g I use
JavaScript examine my package.json to ensure my dependency SemVers' are
"exact".

That said, I rarely write static-analysis scripts: In JavaScript-world there
is already a plethora of easily configurable linting & type-checking tools. If
I wanted to focus in on static-analysis etc I'd probably reach for
[https://danger.systems/js/](https://danger.systems/js/)

SideNote: My CI generates a metrics.csv file, which serves as a "metric catch-
all" for any script I might write e.g. grep to count "// TODO" and "test.skip"
strings, plus my JavasScript tests generate performance metrics (via monkey-
patching React).

I don't actually DO ANYTHING with these metrics, but I'm quite happy knowing
the CI is chugging away at its little metric diary. One day I'll plug it into
something.

------
johnny-lee
I've gone down this road years ago.

While there's no install and initial results are quick to appear, the false
positives that grep or any string search tool generates will make the cynics
shoot down this simple attempt to find problems in the source code.

Problems that arose:

\- what about use of those questionable APIs/constants in strings (perhaps for
logging) or in comments?

\- some of the APIs listed in the article were only questionable when certain
values were used - sometimes you can get grep/search tool of choice to play
along, but if the API call spans multiple lines or the constant has been
assigned to a variable that is used instead, then a plain string search won't
help.

\- it's hard to ignore previously flagged but accepted uses of the
API/constants.

\- so there's a possible bug reported, but devs usually want to see the
context of the problem (the code that contains the problem) quickly/easily.
Some text editors can grok the grep output and place the cursor at the
particular line/character with the problem, some can't.

If you go down that road to try and reduce false positives, you'll end up with
a parser for your development language of choice.

~~~
time0ut
I haven't tried this approach, but having spent years using one of the best
commercial SAST tools, I'm reluctant to dismiss it too quickly.

My SAST generates tons of false positives and is unforgivably slow. If this is
orders of magnitude faster, it might be worth the extra false positives.

As a side note, my dream is a SAST that comments directly in the PR like a
human reviewer would. Maybe that exists?

~~~
johnny-lee
The SAST program is probably doing a lot more than a string search tool does.

If the SAST has to process C/C++ source code, then the SAST will parse all the
#include'd header files. The SAST may track values to determine if
illegal/uninitialized values are used.

A string search tool will skip doing all of that.

If the class of problems you're looking for contains only bad
functions/constants, then a string search tool may be fine.

But as I mentioned before, the string search tool may get confused if these
bad strings occur in strings/comments/irrelevant #if/#else/#elif sections.

There are another class of bugs dealing with data values which a string search
tool can't deal with easily.

As an example, PC-Lint lists the type of problems the program may flag -
[https://www.gimpel.com/html/lintchks.htm](https://www.gimpel.com/html/lintchks.htm).
A string search tool won't know about classes and virtual destructors or other
concepts relevant to the programming language in question.

For the string search tool, you'd either invoke the search string tool several
times with different search strings for the same source code or slightly more
efficient, have one long search string containing all your search strings as
alternate search targets for the string search tool.

Either case, when the string search tool spits out a positive result, it won't
explain why there is a problem. The dev will have to know or lookup the
problem associated with that search result.

When I worked on this area, C/C++ compilers stopped at syntax errors. Most
have gotten better at flagging popular problems like variable assignments
within if statements, operator precedence bugs, and printf-format string bugs.

Some divisions at Microsoft required devs to run a lightweight SAST before
committing changes to locate possible problems ASAP.

It's relatively easy to integrate an SAST into your build system to scan the
modified source code just before you're ready to commit the changes.

------
anon1253
I tend to work a lot in Lisp and XML, both are more or less trees if you
squint (with the Lisp syntax famously being the AST due to homoiconicity) and
it always makes me wonder if there are better command line tree search or tree
diff algorithms out there (extra awesome if it works with git merge
strategies). I mean whitespace preference is fine and all, but sometimes you
just don’t care :p

------
alxmdev
Hold on, strncat and strncpy are considered dangerous too, now? Not just the
older versions without the _size_t num_ argument?

~~~
unilynx
They don't \0-terminate the target on overflow, so you still need to test for
that condition. So most people will have a wrapper around those to ensure the
\0 is there.

~~~
guitarbill
I think BSD has strlcpy and strlcat for exactly this reason

~~~
jandrese
strlcpy has the braindamage that it returns the length of the source buffer,
which means it has to traverse the entire buffer to figure out the length.

If you want to copy out the first line from a buffer that happens to be a 10TB
mapped file, that strlcpy call will take a long time to finish. If you are
using strncpy/strlcpy because you don't trust the src buffer is properly null
terminated but you still want to stop the copy at the first null or when the
buffer is full, well, you're out of luck because strlcpy is going to blast
past the end of the source buffer regardless.

I would have been much happier if it had just returned a flag indicating
either successful copy (0), buffer was truncated (1), or an error occurred and
errno was set (-1). Possible errors could be that the src or dest was NULL or
the size was 0 (ERR_BAD_ARGUMENT).

------
parentheses
You can search your codebase using livegrep [0] and get near instant results.

[0]
[https://github.com/livegrep/livegrep](https://github.com/livegrep/livegrep)

------
jpalomaki
Random idea: maybe you could supercharge this by introducing to grep some
constructs from programming languages. Now you have things like "word
character", "whitespace", "start of line". In supercharged version you would
have "function", "identifier"

~~~
LandR
Visual Studio with Resharper does this.

It's very very fast / almost instant even with hundreds of source code files
and millions of lines of code.

I hit ctrl+T and then can search everything, this give me a drop down that
filters out the more I type, select the item in the dropdown and it goes to
that source file.

I can also type:

/t and search just types

/m members

/mm methods

/u unit tests

/f file

/fp project

/e event

/mp property

/mf field

/ff project folder

e.g.

/t Foo

will find all the Foos

/mm SavePhoto

will find any methods called SavePhoto

Same works in JetBrains Rider for C# stuff.

I couldn't dev without this now, and it's all built into my IDE.

------
switch007
The imperative headline strikes again!

------
K0nserv
Just a small note that I would highgly recommend ripgrep[0] over standard
grep. It's another modern tool that has been created by leveraging Rust and
it's from BurntSushi[1] who is excellent.

0:
[https://github.com/BurntSushi/ripgrep](https://github.com/BurntSushi/ripgrep)

1\. [https://github.com/BurntSushi](https://github.com/BurntSushi)

~~~
pepper_sauce
Why is it better than grep?

~~~
Canada
It's way faster, which is great when you're working with big repos. It's
designed for recursively searching through a lot of files.

~~~
masklinn
Grep is really fast at that at the actual search (gnu grep at least), the gain
there is mostly that "smarter" tools will ignore e.g. VCS data or binary files
by default whereas grep will trawl through your PNGs and git packfiles.

~~~
icebraining
Explanation from the original author on why GNU grep is fast:
[https://lists.freebsd.org/pipermail/freebsd-
current/2010-Aug...](https://lists.freebsd.org/pipermail/freebsd-
current/2010-August/019310.html)

Excerpt: "The result of this is that, in the limit, GNU grep averages fewer
than 3 x86 instructions executed for each input byte it actually looks at (and
it skips many bytes entirely)."

~~~
masklinn
There's also this bit: [https://ridiculousfish.com/blog/posts/old-age-and-
treachery....](https://ridiculousfish.com/blog/posts/old-age-and-
treachery.html)

However note
[https://news.ycombinator.com/item?id=19522987](https://news.ycombinator.com/item?id=19522987)

> It does not. ripgrep does not use Boyer-Moore in most searches.

> In particular, the advice in [the freebsd mailing list post] is generally
> out of date.

 _although_ the out of date bits are really the Boyer-Moore ones:
[https://lobste.rs/s/ycydmd](https://lobste.rs/s/ycydmd)

> much of Mike Haertel’s advice in this post is still good. The bits about
> literal scanning, avoiding searching line-by-line, and paying attention to
> your input handling are on the money.

> But the stuff about Boyer-Moore is outdated.

