
Achieving Safety Incrementally with Checked C - pjmlp
https://www.cs.umd.edu/~mwh/papers/ruef18checkedc-incr.html
======
Animats
Interesting. The important development is the porting tool which automatically
infers the size of objects. Many people, including me [1], have proposed safer
variants of C. The problem is converting legacy code. They've made some
progress, but not enough to use yet.

The important result is in Table 1 of the actual paper.[2] The converter looks
at pointers, and classifies them as "pointer", "array of known size", or
"unknown". "Unknown" ranges from 49% to 76%. You can't convert an existing
code base automatically with that low a success rate.

The big design mistake in C was "pointer" and "array" being the same thing
syntactically. Trying to undo that has taken decades.

Their converter is on Github.[3] This is a Microsoft project.

The syntax they use is rather clunky:

    
    
        void copy(char ∗dst : bytecount(n),
                  const char ∗src : bytecount(n),
                  size_t n);
    

That's halfway between C and Pascal/Ada/Rust.

I proposed more C-like syntax like this:

    
    
        void copy(size_t n, char (&a)[n], const char (&b)[n])
    

It's annoying that you need the parentheses, but that's C/C++ syntax.

I wanted to put C++ style references into C so you could pass arrays as
arrays, rather than degrading them to pointers. I also wanted to add slices of
arrays, so you could pass a slice of an array as an array. Those are standard
features of most later languages, and handle most of the cases for which
pointer arithmetic is used. C doesn't have the expressive power to even talk
about those things unambiguously. So they can't be checked either at compile
time or run time.

The hard job is converting legacy code by automatically inferring the
information that C source code lacks. It's good that they're trying. But a 50%
success rate indicates they're just getting the easy hits. If anybody ever
gets a good array size inference system, it would be a huge win. Then existing
C could be converted to something better. Not necessarily "Checked C". Rust or
Go would be options.

[1]
[http://www.animats.com/papers/languages/safearraysforc43.pdf](http://www.animats.com/papers/languages/safearraysforc43.pdf)

[2] [http://www.cs.umd.edu/~mwh/papers/checkedc-
incr.pdf](http://www.cs.umd.edu/~mwh/papers/checkedc-incr.pdf)

[3] [https://github.com/Microsoft/checkedc-
clang](https://github.com/Microsoft/checkedc-clang)

~~~
duneroadrunner
> "Unknown" ranges from 49% to 76%.

Yeah, this is interesting. They're saying they can't determine whether a
pointer targets an array buffer or not? Perhaps they might want to take a look
at the (long neglected) "C to SaferCPlusPlus" translator[1] which can do this.
(It was an unexpectedly taxing undertaking though.) It converts C arrays and
allocated buffers used as arrays into memory safe implementations of
_std::array <>_s and _std::vector <>_s, so failure to properly identify them
would generally result in output code that wouldn't compile.

The examples they give of problematic code in the paper:

    
    
        void f(int* a) {
            *(int**)a = a;
        }
    

and

    
    
        f1(((int*) 0x8f8000));
    

don't strike me as the kind you would often encounter in real-world code.

> The syntax they use is rather clunky

The output code of the "C to SaferCPlusPlus" translator replaces the types and
declarations with macros[2] that can be redefined with a compile-time
directive to either use the safe C++ implementation, or revert to the original
unsafe native C implementation. The argument being that using macros instead
of custom syntax makes the source code more versatile. And existing C
programmers already "get" macros.

[1] shameless plug: [https://github.com/duneroadrunner/SaferCPlusPlus-
AutoTransla...](https://github.com/duneroadrunner/SaferCPlusPlus-
AutoTranslation)

[2]
[https://github.com/duneroadrunner/SaferCPlusPlus/blob/master...](https://github.com/duneroadrunner/SaferCPlusPlus/blob/master/mselegacyhelpers.h)

~~~
Animats
Saw this in the translated "SaferCPlusPlus" output examples.

    
    
        static void string_set(char** out, const char* in)
    

What happened there? Where are the array types? Wrong place to look?

If inference can't make a definitely good decision, maybe translators should
guess, conservatively. That is, if it looks like something needs an array type
parameter, make it an array type parameter with subscript checking. Then run
tests on the translated program and see if that works. That's what humans do
on such code. Machine learning has potential here. For any array in a working
program, there must be some expression of some variables that expresses the
size of the array. If humans can't find that expression, the program is
unmaintainable and probably has a bug.

There are really 3 cases.

1\. this is a pointer, and it's never subscripted or offset. That's a pointer
to a single instance of something.

2\. this is a pointer which is subscripted or offset, and we can tell from
context how big the array is.

3\. This is a pointer which is subscripted or offset, but auto-translation
fails to figure out how big the array is supposed to be.

The problem is to convert (3) into (2).

I tend to think that a good metric for C code quality is how hard that is. If
it's not obvious by looking how big something is supposed to be, there's
probably a potential bug.

[1] [https://github.com/duneroadrunner/SaferCPlusPlus-
AutoTransla...](https://github.com/duneroadrunner/SaferCPlusPlus-
AutoTranslation/tree/master/examples/lodepng/lodepng_translated/src/lodepng.cpp)

~~~
duneroadrunner
Thanks for noticing :) It's been quite a while since I worked on the code, but
I believe that the translator intentionally left types declared as "char
{star}" unmodified assuming that they were being used as strings [1] rather
"regular" array buffers. I'm guessing that dealing with strings would have
been a lot more work because it would require providing safe compatible
replacements for all the standard C library string functions.

I think you should find that array buffers of other types, like " _unsigned_
char" or "const unsigned char", and their associated pointer iterators are
translated to their corresponding macros. I'd be interested if you find
otherwise. If you're interested, the relevant code for the translator is in
the "safercpp" subdirectory [2]. It's not super-well commented so if you have
any questions feel free to post them in the "issues" section of the
repository.

[1] [https://github.com/duneroadrunner/SaferCPlusPlus-
AutoTransla...](https://github.com/duneroadrunner/SaferCPlusPlus-
AutoTranslation/blob/cf72155bbc7cf7f9e9288c22cbb332c9d2f5e16f/mutator_snapshot/safercpp/safercpp-
arr.cpp#L1439-L1440)

[2] [https://github.com/duneroadrunner/SaferCPlusPlus-
AutoTransla...](https://github.com/duneroadrunner/SaferCPlusPlus-
AutoTranslation/tree/cf72155bbc7cf7f9e9288c22cbb332c9d2f5e16f/mutator_snapshot/safercpp)

~~~
Animats
OK, Here's a non-string function where the translator is trying to deal with C
written like it's 1980:

    
    
        static unsigned countZeros(MSE_LH_ARRAY_ITERATOR_TYPE(const unsigned char) data,
            size_t size, size_t pos)
        {
          MSE_LH_ARRAY_ITERATOR_TYPE(const unsigned char)  start = data + pos;
          MSE_LH_ARRAY_ITERATOR_TYPE(const unsigned char)  end = start + 
              MAX_SUPPORTED_DEFLATE_LENGTH;
          if(end > data + size) end = data + size;
          data = start;
          while(data != end && *data == 0) ++data;
          /*subtracting two addresses returned as 32-bit number
              (max value is MAX_SUPPORTED_DEFLATE_LENGTH)*/
          return (unsigned)(data - start);
        }
    

What guarantees that the "while" loop will not run away and take "data"
outside the array bounds?

I proposed a version of C with slices and references, where you could write
that like this:

    
    
        static unsigned countZeros(const unsigned char &(data)[size],
            size_t size, size_t pos)
        {
          const unsigned char &(data1)[size-pos] = data[pos:size-pos]; // slice
          size_t cnt = 0;
          while (cnt < LENGTH(data1) && cnt < MAX_SUPPORTED_DEFLATE_LENGTH && data1[cnt] == 0) ++cnt;
          return(cnt);
        }
    

The "data" parameter has size info, so the language knows how big it is. The
"work" variable is a slice of "data". This eliminates the need for pointer
arithmetic. Much pointer arithmetic in C, especially where you have a pointer
partway into an array, is an attempt to emulate a slice.

Automatically extracting slice usage from code with pointer arithmetic is a
tough problem. But not impossible. When you see code constructing something
like

    
    
        data = start;
        while(data != end && *data == 0) ++data;
    

you have to recognize that as subscripting.

    
    
        while(data != end && *data == 0) ++data;
        return (unsigned)(data - start);
    

should become first

    
    
        data = start;
        size_t dataix;
        while(&data[dataix] != end && data[dataix] == 0) ++dataix;
        return (unsigned)(&data[dataix] - start);
    

by substituting subscripting for pointer arithmetic.

Next, when you see an offset array being created, as in

    
    
        start = data + pos;
    

turn that into a slice:

    
    
        const unsigned char &(data1)[size-pos] = data[pos:size-pos]; // slice
    

The slice is the same pointer, but the there's now valid size information
associated with it.

If you do transformations like that, you get a version of C where subscript
checking is possible. You can then hoist or prove out many of the subscript
checks. Here, the compiler would be expected to understand that if an array
subscript is less than LENGTH of the array, it's safe. LENGTH here, as I wrote
in my paper, refers to the length of the array as known to the compiler from
the array declaration. Here, array lengths can be expressions evaluated at
declaration time. That's how length info gets passed around.

    
    
        const unsigned char &(data)[size]
    

as a parameter means "this is an array of size "size". "size" comes in via
another parameter. The function can assume "size" is valid, and all callers
must check that, either at compile time or run time.

If you can't write an expression for the size of something, you have a big
problem with your program.

~~~
duneroadrunner
> What guarantees that the "while" loop will not run away and take "data"
> outside the array bounds?

What do you mean "the array bounds"? The code is memory safe. "data" is an
iterator that knows exactly what array/container it's pointing to, and that
container knows its own size. Dereferences are bounds checked (by default).

This translated code is not intended to be performance optimal. The translator
does not add, remove or rearrange any of the original source code elements, it
simply replaces some of them with macros that are defined as functionally
equivalent, memory safe C++ substitutes for the original element. Doing it
this way has the benefit of allowing you to "disable" the memory safety
mechanisms by reverting the macro definitions to the original (unsafe)
elements.

I have not yet gotten around to addressing performance of the translated code.
In order to preserve the ability to revert back to pure C code, there would
need to be an additional set of macros (like maybe an "array view" macro) that
could be mapped to their (safe) high performance C++ counterparts but that
would be more restricted in their usage.

But at this point I think the value of that is questionable. If you need your
code to be memory safe and high performance, the most expedient thing to do is
to just accept the translated code as C++ code (or SaferCPlusPlus code) and
re-optimize the performance bottlenecks as idiomatic SaferCPlusPlus code.
SaferCPlusPlus is, along with Rust, the fastest [1] option for memory (and
data race) safe programming.

And if you don't like the C++ language as whole, just (define and) stick to a
subset you're comfortable with, right? I mean, (I think your proposal is fine
as an extension of C, but) I don't see the point in extending the C language
with things like views/slices/spans, when the C language is already extended
with those. It's called C++ (or some subset thereof) right? And with C++ you
can solve the memory (and data race) issues much more comprehensively and
performantly (if that's a word :) than with any extension to C. No?

[1] [https://github.com/duneroadrunner/SaferCPlusPlus-
BenchmarksG...](https://github.com/duneroadrunner/SaferCPlusPlus-
BenchmarksGame)

------
pornel
Incremental improvement is a hard constraint here.

Keeping track of pointer and length in two separate variables, when the C
program has a freedom to do whatever it wants with them, creates a lot of
hassle and edge cases to deal with. If C had a concept of a slice/array view
that manages ptr+length as a single entity, it'd be so much easier.

~~~
duneroadrunner
For cases where the platform supports C++ (and its standard library), there is
kind of a corresponding "checked C++"[1] that also supports the "completely
incremental" migration approach. (And obviously supports "array view" type
objects.)

[1] shameless plug:
[https://github.com/duneroadrunner/SaferCPlusPlus#safercplusp...](https://github.com/duneroadrunner/SaferCPlusPlus#safercplusplus-
versus-checked-c)

------
dwheeler
This is really promising work. However, it doesn't support free() at all;
without free(), it seems less useful. It's not clear to me what supporting
free() would involve; I'd love to hear comments.

~~~
kayamon
Memory safety requires garbage collection. A system with explicit frees can
never be safe.

------
ericlewis
this may be a silly question, but should these kind of problems be solvable
with linting?

~~~
dwheeler
Short answer: No.

The original "lint" was developed many decades ago. The problem is that C just
doesn't provide enough type information to really do this properly. For
example, there's no difference between "pointer to one stand-alone object",
"pointer to beginning of an array (and here is its size)", and "pointer to an
object within an array (and here is its size)".

For many academics this is a "solved" problem - just use almost any language
except C, C++, Forth, and assembly. There were solutions to these problems
before C was created. Of course, out here in the real world there are large
existing C programs that no one is going to convert, and there are _reasons_
people write new C code today.

