
A semantic model for a substantial fragment of C - edwintorok
https://www.cl.cam.ac.uk/~pes20/cerberus/
======
gravypod
I've heard many "real" computer scientists say the "C remains central to our
computing infrastructure but still lacks a clear and complete semantics" line
but I have no idea what they mean.

Can someone explain to me what this means to the academic-CS type? How can
something with a parser & lexer not have "clear and complete semantics". If
you have a computer language which has defined elements of the syntax how can
you not have clearly defined meanings of combinations of the syntax?

Does someone have an example of a piece of C that is not "deterministic" (best
word I have for this) to a lexer & parser?

~~~
brianberns
Example of undefined behavior: What arguments will f be called with?

    
    
       int i = 0;
       f(i++, i++);
    

More examples here:
[https://www.cl.cam.ac.uk/teaching/1415/CandC++/lecture10.pdf](https://www.cl.cam.ac.uk/teaching/1415/CandC++/lecture10.pdf)

~~~
gravypod
I'd assume it would be

    
    
        f(1, 0); 
    

Because function arguments are pushed in reverse order in C and funnily
enough, GCC actually does that even though that's crazy. But clang catches it
as an "unsequenced modification".

I don't think that's a good example. The C standards make no guarantees about
unsequenced modifications so relying on them is the programmer not using C,
it's the programmer using a language derivative implemented by their compiler.

Why do people say "C remains central to our computing infrastructure but still
lacks a clear and complete semantics" when the standards are very clear about
what it allows? If you use something it doesn't allow you can't blame the spec
for that.

~~~
laughinghan
You're missing the point.

Can we agree that people are learning a programming language that they call
"C", and writing text files in that language and running programs that they
call "C compilers" such as gcc on those text files to produce binary
executables?

You are correct to note that the programming language I'm describing does not
perfectly conform to any of the documents that are ever called a "C standard".
But that programming language is central to our computing infrastructure, and
whether you or C standard authors like it or not that programming language is
widely called "C".

That language's semantics are unclear and incomplete---even though both gcc
and clang are commonly called "C compilers", they have differing behavior on
`f(i++, i++)` as you found; reportedly, there's code on which gcc behaves
differently depending on the optimization level option.

~~~
wruza
There is a point that you're probably missing. UB is one of greatest things
that could have been "invented" in engineering. Instead of covering idiotic
cases that should never happen via complex algorithms that poison the code
simplicity, you can just say that these are UB and rely on your client code to
respect that. If you have a registry of objects with names, you don't have to
check name uniqueness, because you can state that adding the same name is UB.
Like it is for removing non-existent record. If a range of function produces
wrong results (anything around limits.h), don't fix it — you're probably not
even able to do it correctly, just call it UB for arguments above INT_MAX/4\.
I use that a lot in my projects, because it allows you to start without
waiting for all sorts of these cases to be covered in callee while there are
four-to-ten absolutely correct callsites that will never produce UB.

UB is not a flaw. It is an efficient engineering method. Kids will be alright,
that's hard thing to learn, but once you get it, it is simple, clear and
complete. If kids want to stay in a sandbox, well, we have a lot of them.

Those who blame UB probably never turned their attention to electronics,
physics or chemistry. _That_ is a mess. Every time you pass wrong arguments or
do something out of spec, it can explode and really hurt your foot.

Edit: typo

~~~
laughinghan
If you're saying that sometimes UB is a good idea, as illustrated by your
examples, then I agree with you.

But if you're saying all of C's UB is good, and none if it is a flaw, I think
you'll find yourself in the minority there. Do you consider integer overflow
to be an "idiotic case that should never happen and poisons code simplicity"?
Relying on it to be two's complement is so common that successors like Java
and Go and Rust all standardized that.

If you'll concede that _some_ of C's UB is relied upon in real, important,
non-idiotic code, then it follows that it is indeed important for there to be
efforts like Cerberus that develop unambiguous semantics for more of C than is
currently covered by C standards.

~~~
wruza
>integer overflow, two's complement

As UB is an engineering method, we need to look at it from practical point of
view ONLY. In my limited knowledge, platforms with really random overflow-
then-back (x+n-m) do not exist. Note it has nothing to do with n's-complement,
it is just a sad consequence of (x+n)==undefined, so compiler is free to
assume (x+n <= INT_MAX) and throw out some "dead" code, which is not dead,
ymmw. This specific UB is about what you get in (INT_MAX + positive) use
cases, and these are very strange, because irl 127+3 is 130, not -126. It
doesn't even help in famous m=(l+r)/2.

Though I can see social rationale behind standardization you mentioned. Some
behaviors too _suddenly_ appear undefined to newcomers who do not bother to
read tfm or faq at least, and even those who program for 5-7 years still
cannot explain what restrict keyword means. Maybe I'm wrong or somewhat
elitist in it; today's software development is different. We have lots of
programmers who hardly go further than learning "a new syntax" and we are
doomed to use them as a productive force. Then these UB-clearing moves have
perfect sense, that's why I mentioned sandbox languages.

~~~
laughinghan
I have no idea what your point is, if any.

I will note that "from [a] practical point of view ONLY", it is incredibly
impractical for a compiler to gratuitously break real code that relies on
behavior that deviates from spec; if that reliance is common, the practical
thing to do is to support that behavior, which they do: "The optimizer does go
to some effort to "do the right thing" when it is obvious what the programmer
meant (such as code that does "* (int *)P" when P is a pointer to float)."
\--blogpost by the second most widely used C compiler, clang/llvm:
[http://blog.llvm.org/2011/05/what-every-c-programmer-
should-...](http://blog.llvm.org/2011/05/what-every-c-programmer-should-
know_21.html)

Which is why projects like Cerberus are valuable.

~~~
wruza
My point is that UB is almost never any bad, and when it is, there is a live
case for that. There is no value in defining the undefined. Clang does what it
does at internal levels for completely different purpose than supporting "real
code that relies on behavior that deviates from spec". The article describes
that in detail. That is completely disconnected from the value of Cerberus-
thing you mentioned.

------
nickpsecurity
Also see the work done in the under-utilized K Framework which captures &
rejects undefined behavior as well:

[http://fsl.cs.illinois.edu/index.php/Defining_the_Undefinedn...](http://fsl.cs.illinois.edu/index.php/Defining_the_Undefinedness_of_C)

Since it's K, they were able to turn it into a GCC-like compiler for you
verifying things about your application. If it even compiles, you have no
undefined behavior.

[https://github.com/kframework/c-semantics](https://github.com/kframework/c-semantics)

On concurrency side, it's built on top of Maude tool that an inexperienced
student was able to use to find errors in Eiffel's SCOOP model for
concurrency. So, it can probably handle that aspect as well.

~~~
fiddlerwoaroof
Any write up about the errors they found? Are they a problem with SCOOP as
such or an implementation issue?

~~~
nickpsecurity
My memory is hazy but the abstract suggests this is the one I read before:

[http://se.ethz.ch/~meyer/publications/concurrency/scoop_maud...](http://se.ethz.ch/~meyer/publications/concurrency/scoop_maude.pdf)

The good thing about SCOOP particularly is there was a lot of CompSci work
improving on it in many ways. Most of the safer, concurrency models didn't get
such attention. Example:

[http://cme.ethz.ch/publications/](http://cme.ethz.ch/publications/)

------
JoelJacobson
"Its front-end is written from scratch to closely follow the C11 standard,
including a parser that follows the C11 standard grammar, and a typechecker."

C front-end written from scratch sounds interesting. I cannot find the code
for this front-end anywhere. Has anyone found it somewhere or maybe it's not
released yet?

------
qznc
There is also the K framework, where C semantics were specified and an
interpreter is generated.

[http://www.kframework.org/index.php/Chucky_Ellison](http://www.kframework.org/index.php/Chucky_Ellison)

~~~
nickpsecurity
I didn't even see you beat me to it. You should also link to KCC when you
bring it up. Developers love practical stuff they can download and use to
their benefit. :)

[https://news.ycombinator.com/item?id=14012891](https://news.ycombinator.com/item?id=14012891)

