
Static Analysis in GCC 10 - fanf2
https://developers.redhat.com/blog/2020/03/26/static-analysis-in-gcc-10/
======
WalterBright
Double-free's can be tracked by doing data flow analysis on the function. This
is how D does it in its nascent implementation of an Ownership/Borrowing
system. It can be done without DFA if getting it only 90% right and having
numerous false positives is acceptable.

I've used many static checkers in the past, and the rate of false positives
was high enough to dissuade me from using them. This is why D uses DFA to
catch 100% of the positives with 0% negatives. I knew this could be done
because the compilers were using DFA in the optimization pass.

In order to get the tracking to work, one cannot just track things for a
function named "free". After all, a common thing to do is write one's own
custom storage allocators, and the compiler won't know what they are. Hence,
there has to be some mechanism to tell the compiler when a pointer parameter
to a function is getting "consumed" by the caller, and when it is just
"loaned" to the caller (hence the nomenclature of an Ownership/Borrowing
system).

One of the difficulties to overcome with D in doing this is there are several
complex semantic constructs that needed to be deconstructed into their
component pointer operations. I noticed that Rust simplified this problem by
simplifying the language :-)

But once done, it works, and works satisfyingly well.

Note that none of this is a criticism of what GCC 10 does, because the article
gives insufficient detail to draw any informed conclusions. But I do see this
as part of a general trend that people are sick and tired of memory safety
bugs in programming languages, and it's good to see progress on all fronts
here.

~~~
anchpop
> This is why, for D, I was determined to use DFA to catch 100% of the
> positives with 0% negatives. I knew this could be done because my compilers
> were using DFA in the optimization pass.

Is this really true? I thought this was impossible due to Rice's theorem

~~~
WalterBright
Fortunately, I am unaware that it was impossible and did it anyway :-)

But it is possible I made a mistake.

It is also true that for it to work, one has to change the way one writes
code, like Rust does. This is why D requires and @live attribute be added to
functions to enable the checking for just those functions, so it doesn't break
every program out there. It will enable incremental use of the checking at the
user's option.

~~~
nullc
You're probably using a different definition of 100% than any impossibility
proof would use.

Consider some code:

\---

a=malloc(1);

needfree=true;

if (hashfn(first_factor(huge_static_rsanum1))&1){needfree=false;free(a);}

if (hashfn(first_factor(huge_static_rsanum2))&1){needfree=false;free(a);}

if(needfree)free(a);

\---

The decision if this has a double free or not depends on the factorizations of
two huge difficult to factor constants. It either double-frees or not
depending on those constants.

Surely your software cannot decide that...

What you probably mean is something like "100% on real programs rather than
contrived cases". Of course, in that case, your definition of 'real programs'
is the catch. :P

Sometimes things that seem like they should always work except on contrived
junk like the above example actually run into limitations in practice because
macros and machine code generation produce ... well ... contrived junk from
time to time.

~~~
WalterBright
> Surely your software cannot decide that...

The D implementation would reject such code. The DFA assumes all control paths
are executed. For example,

    
    
        if (c) free(p);
        *p = 3;
    

is rejected as use-after-free.

    
    
        if (c) free(p);
        if (!c) *p = 3;
    

is also rejected as use-after-free. If the DFA is done properly, you will not
be able to trick it.

~~~
UncleMeat
Then that doesn't mean "0% of the negatives".

~~~
bonzini
No, it means you have false positives. But no false negatives.

~~~
UncleMeat
And that's a "negative" in a practical sense.

An abstract interpretation that outputs Top for all programs is sound but
useless. In practice, most sound static analyses for complex problems aren't
too far from that.

~~~
bonzini
It's not a "negative", it's a disadvantage. "Negative" has a specific meaning
that should not be used in this context.

Safe Rust is also in the same boat: it has a lot of false positives that are
rejected by the borrow checker even though they would be okay, and yet it's
being used just fine. Think of doubly linked lists which are pretty much
impossible to implement in safe Rust unless you replace pointers with integer
IDs which basically disables borrow checking. Non-lexical lifetimes are an
example of downright changing the definition of the language in order to
remove some of these false positives.

~~~
UncleMeat
I didn't it that way. I read it as "downside" rather than "false negative",
especially because a sound static analysis is trivial and not something to be
proud of in the abstract.

"Output Top" is sound for all non-inverted lattices and takes constant time.
Woohoo! But it is also useless.

------
mynegation
I, at one time, worked on a tool, commercial and external to the compiler,
that did this (among other things). Probably the most intellectually
challenging job I have ever had. I am happy static analysis makes inroads into
mainstream!

Few takeaways from that time: inter procedural matters: if your function
reallocated a pointer passed as an argument, you want to treat it as ‘free’
regarding this argument, and conversely, if your function returns a newly
allocated memory, you want to mark it as such, and so on. There is also a
trade off between the breadth of the analysis and the human ability to
comprehend it, author mentions 110 node path in the article.

The subject of my unfinished PhD thesis and something I hope also picks up is
the combination of static and dynamic analysis, used iteratively. If your
static analysis flags a suspicious path but does not have the means to figure
out if it is true or not, instrument it and leave it to the dynamic analysis
to run through it (the idea here that total instrumentation a la valgrind is
detrimental to performance so you will get some gain from selective
instrumentation). Conversely, dynamic analysis may provide some hints as to
where static analysis should be applied at a greater depth and provide
automatic annotation of functions with regard to their behaviour and -
possibly - invariants, that help with the state explosion.

~~~
DyslexicAtheist
ca 2000 - 2004 I had the luck to work on a massive C/C++code base building
base station / telecoms infrastructure. We had several hundred engineers
contributing code with all kind of different philosophies (many grey beards
who did C/C++ for decades).

Running Flexelint was part of our CI chain and also part of the internal
coding standard (e.g. definition of done). There was no other time in my life
where I learned so much about secure coding as I did back then. Biggest
challenge was agreeing about false positives and we had 1 guy in the team who
maintained the official wiki document on when a lint warning needed to go on a
whitelist with an agreed description of why. The initial overhead to become
_lint-clean_ got a lot of push-back but thanks to management support we got
there and you could really see how things stayed at that level even after
years.

It felt at times bureaucratic or like yak shaving, but in retrospect linting
was what kept the code base at a quality I haven't seen ever again since. It
also ensured everyone was on the same page. Taking linting seriously required
a small learning curve though and lead to some discussions here or there.
These discussions were really valuable since we got to really learn from
another too.

When I left and went to another project it felt like a step back where we were
chasing the same old bugs due to bad coding practices and it was a major step
back in my career as a dev. I miss those days.

Really love that this becomes part of gcc.

~~~
neilv
Kudos for lint-clean, and for high-quality C/C++ programming. One of my own
stories of that (coincidentally, from when I worked on dev tools for
aerospace/datacomm):
[https://news.ycombinator.com/item?id=21158546](https://news.ycombinator.com/item?id=21158546)

------
archi42
Oh, that's really nice. Though, as a user one should remember: The approach
described here gives up at some point. So it doesn't prove the absence of a
bug class (e.g. double free), but it finds some instances. Which is already a
very good thing, and hugely non-trivial.

The problem with "not giving up at some point" is the computational
complexity: Analyzing big code bases takes half an eternity (days), while
using huge amounts of memory (>128GB). And once you enter the "least-defined
state", you either throw lots of false positives (which gives the users a hard
time) or you need to "give up" (and hence potentially miss bugs).

Disclaimer: I work for a company that builds static analysis tools. I don't
see this as competition, though. Our tools are used in industries where
"safety-critical" is _really_ important - so the "giving up"-part of the
analysis is no option for us, and solely relying only on GCC isn't an option
for our customers either ;-)

~~~
a1369209993
Per Rice's theorem, it's _not_ giving up that's not a option; it's just a
question of whether you have false positives or false negatives. (To be fair,
for safety-critical code, insisting on only false positives (ie treating
anything you give up on as a positive) is a pretty resonable choice.)

~~~
nullc
A tool like this could be sound but incomplete. E.g. return true, false, or
idunno.

~~~
a1369209993
Yes, exactly; "idunno" is the giving-up answer.

------
_bxg1
I have to wonder if Rust is putting pressure on C/++ to have more static
analysis (while at the same time blazing trails in what's possible in that
space, and what's possible in terms of error message helpfulness). I think
it's a great idea to start baking these things into the compiler, even if it
will never be 100% free of false-negatives because of the limitations of what
the language can express and guarantee. Still seems like a great way to
eliminate a lot of common problems, as a default across the ecosystem instead
of as an extra step.

~~~
rurban
It's clang, not rust.

And clang's analyzer has different UI concept via web, which is far superior.
And for the screen valgrind has a far superior solution. I don't see the
advantage of gcc's analyzer yet. Far too verbose. and the most important
errors, like wrong optimizer decisions based on their interpretation of UB
code are still silenced.

~~~
TwoBit
I'd love to see a compiler warn me that it's doing something potentially
unexpected due to UB considerations.

~~~
ali_m
clang has UBSan, which adds runtime checks for detecting various kinds of
undefined behaviour:
[https://clang.llvm.org/docs/UndefinedBehaviorSanitizer.html](https://clang.llvm.org/docs/UndefinedBehaviorSanitizer.html)

~~~
rurban
That's not helping. I'm talking about wrong decisions, made during compile-
time optimizations, like assuming dead code or a value being NULL, and then
ripping apart the written code. Without warning.

Or the famous optimized away memset call. Which is a security issue. At least
a warning would be in order. Or at least an analyzer warning.

------
saagarjha
Nice, this looks pretty cool! It seems a bit like Clang’s static analyzer:
[https://clang-analyzer.llvm.org/](https://clang-analyzer.llvm.org/)

------
jchw
> As of GCC 10, that option text is now a clickable hyperlink (again, assuming
> a sufficiently capable terminal)

Seems like mostly only GNOME Terminal and iTerm2. Here’s some that apparently
won’t work:

\- Konsole

\- Kitty

\- LXDE Terminal

\- MATE Terminal

\- hyper

\- Windows Terminal

\- ConEmu

\- PuTTY

... so it’s kind of weird to suggest this is an accepted standard. Especially
since some of the discussions in feature requests suggest they will likely not
implement it due to security concerns or otherwise.

~~~
guerrilla
how does on produce such a link in output? (not that i really want this bloat
and increased attack surface)

~~~
bonzini
Escape sequences. It's similar to the "tell the emulator about the current
directory" feature that is used to open new windows on the current directory.

~~~
labawi
Does the CWD feature actually use terminal escape sequences?

I always assumed the emulator accessed the working directory of the child
process (as in /proc/$PID/cwd). On my terminal the CWD feature only seems to
work for the topmost shell, and symlinks get resolved.

EDIT: Linked bug report mentions OCS7 (presumably an escape sequence), as one
of the ways to track CWD.

------
leni536
It looks great and useful. I suspect that this only works within a single
translation unit and can't work between separate translation units. But maybe
it could work together with lto, that would be pretty awesome.

Some of the worst lifetime issues are 3rd party library calls with unclear
ownership semantics and static analyzers are just as clueless as you are. The
function signature doesn't help you out in this regard (in C). My recent
"favorite" is libzip's zip_open_from_source that conditionally takes ownership
of its zip_source_t* argument.

[https://libzip.org/documentation/zip_open_from_source.html](https://libzip.org/documentation/zip_open_from_source.html)

[https://libzip.org/documentation/zip_source_free.html](https://libzip.org/documentation/zip_source_free.html)

------
mshockwave
Just a side note that the equivalent solution in LLVM/Clang is Clang Static
Analyzer: [https://clang-analyzer.llvm.org](https://clang-analyzer.llvm.org) .
But it's an external tool rather than integrating into clang

------
3fe9a03ccd14ca5
These types of tools go a really long way in improving the reliability and
safety of C code.

Hats off to the redhat team for putting in the effort on this. Their blog
posts have been really interesting. It’s definitely changing my perception of
what redhat really does.

~~~
jabedude
I've always had a positive impression of redhat, but I was recently blown away
with their dedication to upstreaming contributions across different open
source projects. I was investigating a new Linux kernel feature that redhat
contributed and saw that the same developer opened pull requests that added
support for the new kernel feature in three major open source projects. And
one of the projects took over a year to accept the changes, but he was
persistent in reaching out, making requested changes, etc. It really shows the
passion at the company to share their contributions.

~~~
Vogtinator
It doesn't have to be passion - having something upstreamed has a lot of other
benefits as well.

------
olivierduval
"using the compiler the code is written in as part of the compile-edit-debug
cycle, rather than having static analysis as an extra tool “on the side”
(perhaps proprietary)"

Mmmm... and why not have an external tool, part of the GCC family but with a
proper API, to allow to use ANY TOOL instead of bloating GCC with one more
tool that won't be usable on other compilers and will need specific
maintainers, althought this field is already really complex and need a lot of
different knowledges ?

For example, it could be based on intermediate code so better than just
source-code or machine-code analysis...

~~~
olivierduval
Just to be more specific: why not use the "UNIX philosophy" with a compiler to
compile (translate to Intermediate Representation), an optimizer to optimize,
an assembler to produce machine code from IR (with allocation registry) and so
on...?

~~~
UncleMeat
Because GCC is explicitly designed to be a tangled mess (as opposed to
Clang/LLVM), in part because it makes it harder for groups with different
beliefs about FLOSS code to repurpose it.

Its a choice that has caused them to cede a lot of territory to Clang/LLVM.

~~~
ndesaulniers
I think the major mistake was FSF refusing the objective c front end from
Apple.

~~~
NovaX
They also refused Apple's offer to relicense LLVM to the GPL, contribute it to
GCC, and assign copyright to FSF.

~~~
saagarjha
When did Apple offer this?

~~~
NovaX
In 2005.

[https://lists.gnu.org/archive/html/emacs-
devel/2015-02/msg00...](https://lists.gnu.org/archive/html/emacs-
devel/2015-02/msg00594.html)

------
thekhatribharat
I believe _type sytems /type theory_ is likely going to be the most popular
method for _formal verification of programs_ (aka _static analysis_ ). And of
course, there's a limit to what you can _prove_ about programs (ref: Rice's
theorem).

~~~
jfkebwjsbx
Static analysis is not formal verification.

~~~
thekhatribharat
Sure, things like enforcing style guides, etc. can be seen as _lightweight_
formal verification.

~~~
irundebian
Yes, but you wrote "formal verification of programs (aka static analysis)".
Formal verification is not also known as static analysis.

------
wyldfire
Previously submitted as
[https://news.ycombinator.com/item?id=22708586](https://news.ycombinator.com/item?id=22708586)

------
ufo
I wonder... Has anyone here tried using gcc with the -fanalizer option? Did it
find any bugs that you did not know about?

------
ape4
You don't want that option every time since its slower. But I wonder if there
would be a smart way to run it occasionally, like an option to -fanalyzer
every 10th time or when the size of a source file changes a lot, etc.

~~~
saagarjha
Perhaps as part of your CI?

------
6gvONxR4sf7o
I'm super happy to see more static tools to prevent or at least find buggy
code.

------
google234123
How far behind LLVM/Clang is this?

~~~
anarazel
Last time I checked - I'm not sure how long ago that is - llvm didn't detect
double frees etc statically. There's an annotation framework for locking
though, which I hope to play with more.

