
A hands-on introduction to static code analysis - dolftax
https://deepsource.io/blog/introduction-static-code-analysis/
======
UncleMeat
It's good to see discussions of static analysis, but I often feel that these
blog posts do a disservice to the techniques. The post leads by mentioning
applications like bugfinding and security vuln detection but the examples here
are barely above local syntactic checks. This is the common scenario in the
majority of blog posts I see about static analysis, probably because it is
just much easier to put together a quick write up on AST-linting. Heck, this
article has a diagram that directly states that an AST is the input to a
static analysis module, but that is true only for some kinds of things!

AST level analysis is certainly useful. Everybody should be using some sort of
style checker. But AST pattern matching is a _completely_ different technique
from the stuff used to do bugfinding that I worry that these blog posts will
give the wrong impression about what static analysis can do and what it can't
do.

I'd love to see blog posts about interprocedural pointer analysis, for
example.

~~~
rj722
Article author here. Agree that the post merely touches the surface for static
analysis -- because it was aimed towards an audience looking for an
introduction to static analysis. The scope for the examples in this post had
to be limited for this reason.

Inter-procedural pointer analysis -- Yes, a lot more trickier than these, but
definitely more juicier. Will try to write a post on it in the coming weeks.

~~~
UncleMeat
I think limiting the scope is fine in general. But one small suggestion would
be to make it more clear that this is just one very simple technique. This
does not come across at all in the blog post. The diagram you show, for
example, seems to state that this is just how static analyses work - they are
given ASTs to work with. Or at the very least include some examples of
semantic properties. It seems incongruent when you describe static analysis as
understanding the behavior of the program without running it and then use
examples that are about syntactic style violations.

------
saagarjha
The kinds of analyses mentioned here are typically grouped under
"linting"–more advanced static analysis tools will typically do things like
dataflow analysis.

~~~
g_delgado14
Any beginner friendly articles on more advanced analysis that you'd recommend?

~~~
jjtheblunt
[https://en.wikipedia.org/wiki/Static_single_assignment_form](https://en.wikipedia.org/wiki/Static_single_assignment_form)

~~~
UncleMeat
While computing phis for SSA does require dataflow analysis, SSA itself is not
tremendously useful. The natural follow up to this would be "so what?"
Something like live variable analysis is probably a much better first
introduction to dataflow analysis since its application is much more obvious.

SSA is also not even universal among IRs for static analysis at this point.
Heap-SSA is growing in popularity for complex dataflow problems involving
fields.

------
dtornabene
Going to drop a toplevel comment and say while this is interesting
(sincerely!) if people are interested in deeper tools/techniques the book
Practical Binary Analysis is excellent, it ends in taint checking, symbolic
excution techniques and uses Pin.
[https://practicalbinaryanalysis.com/](https://practicalbinaryanalysis.com/)

Also worth checking out is BAP, the Binary Analysis Platform, which is the
successor project to Bit Blaze, and is one of the most fascinating binary
analysis frameworks out there for my money. It was the only one of the darpa
CGC entries that ran on real binaries, not the much less complicated ones
developed specifically for the challenge.

[https://github.com/BinaryAnalysisPlatform/bap](https://github.com/BinaryAnalysisPlatform/bap)

~~~
saagarjha
I’m unsure of what you mean: while I did not participate in CGC personally
IIRC they used a custom platform that required teams to retool for. How would
an entry that runs “on real binaries” be useful for this situation?

~~~
dtornabene
because the test binaries they used were not really close enough to reality to
test finding real vulnerabilities. and BAP can, which, if you want to learn
static binary analysis, seems useful

------
flohofwoe
Slightly tangential to what the article is about, but at least in the C/C++
world, the most important change to make static analysis popular for "the rest
of us" was probably Xcode's decision to integrate clang analyzer right into
the Xcode UI under a menu item (Xcode doesn't do many things right, but this
is definitely one of the very good features).

This way, analyzing the code is a simple "button press" and works out of the
box on every Xcode project.

Soon after, Microsoft followed suit in Visual Studio (even though in my
experience, the MS analyzer doesn't catch quite as many things as the clang
analyzer).

Before that, static analyzers were those no doubt useful but obscure "magic
tools" which were very hard to integrate into an existing build process.

Even the most useful tool will be ignored when it is hard to use.

~~~
saagarjha
Somewhat annoyingly, the static analyzer that ships with Xcode doesn't seem to
be packaged separately as in the command line tools…

~~~
flohofwoe
Hmm, command-line clang accepts a --analyze option here ("Apple clang version
11.0.0"), and this seems to give additional output over the regular warnings.
I'm not sure if that's the same thing as the analyzer integrated into Xcode,
but some sort of static analyzer seems to be there.

~~~
saagarjha
Oh, I will have to try that. Thanks for sharing!

------
pwaivers
Thanks for this article, dolftax! I followed all the examples on my machine
with no problem, and I learned some new stuff.

I have a question: how difficult is it to implement the ast? It seems like
that the bulk of the work for this static code analysis.

~~~
rj722
"Crafting Interpreters" by by Bob Nystrom
([https://craftinginterpreters.com](https://craftinginterpreters.com)).
Although the book falls short in covering static analysis (obviously),
implementation of ast is covered in detail.

------
ecuaflo
For "Detecting unused imports", why not record the line numbers on the first
pass as well? Then we don't need to traverse the tree again

