
ManagedC: Memory safe execution of C on a JVM [pdf] - mike_hearn
http://chrisseaton.com/plas15/safec.pdf
======
c99throwa1
It is an interesting implementation. Tl;dr: they took a C interpreter for Java
(with unsafe memory management) and implemented fat pointers (Java object +
offset) in the interpreter.

The paper claims the implementation obeys C99, but there seems to be a
violation around pointer round-tripping. Specifically, they forbid all casts
from integers to pointers.

    
    
      char a, *p;
      
      p = (char*)(uintptr_t)&a;
    

In most C implementations, this is perfectly valid. In the paper's, it breaks.

Anyway, very interesting.

~~~
_delirium
They discuss this in section 3.2 of the paper. The C99 standard (in section
6.3.2.3) says that integers can be cast to pointers, but, except in the
special case of the integer 0, "the result is implementation-defined, might
not be correctly aligned, might not point to an entity of the referenced type,
and might be a trap representation". This implementation chooses the last
option: if you cast an integer to a pointer, you get a pointer as a result,
but one that you can't successfully dereference.

~~~
c99throwa2
Of course I read that, or else I would have no idea they broke pointer round-
tripping.

I believe this violates C99 §7.18.1.4 (if the TruffleC language defines the
uintptr_t type in stdint.h):

    
    
      The following type designates an unsigned integer type
      with the property that any valid pointer to void can be
      converted to this type, then converted back to pointer to
      void, and the result will compare equal to the original
      pointer:
    
        uintptr_t
    

If the TruffleC language does not define uintptr_t ("these types are
optional"), then hey, that's fine. A lot of valid code won't compile, though.

~~~
mike_hearn
Is uintptr_t widely used? I don't recall ever seeing it before.

~~~
david-given
It's really useful if you want a type to which you can cast _any_ (non-float)
primitive value and know that you can cast it back to the same value you had
before. I've used it as the cell type for a Forth interpreter, for example
(where cells have to be able to represent addresses as well as integers). It's
probably particularly not useful for general applications programming.

------
friendzis
I have only glossed over the paper (yet), but this raises some thoughts on C
in general. Why/Where/How we use it. I could distinguish 5 distinct use cases
for C:

    
    
      1. Legacy code
      2. Shared (closed) code
      3. Zero runtime dependency code
      4. Hardware control
      5. Resource limitations
    
    

This could be a good thing for (1.) - rewriting a project (most) in a safe
language is just infeasible, yet this offers some guarantees/protection
basically for free. Even if a tool (any tool) catches fire all over the place
on first run because of a bug (in app code) it's still a good thing.

Sometimes we want to share some functionality (library), yet stay closed
source. Or have an ability to take a file, drop it on a remote machine and
expect it to run. This is what native binaries with a stable API are for. Yet
calling foreign function in VM'd languages is often so awkward that we just
end up reimplementing a lot of software instead of calling foreign function
possibly shipped/managed at an OS level.

By hardware control I mean something like writing to 0xABAD1DEA and having
data fly out of serial port or NIC change modes or something along those
lines. I guess this is doable in managed languages by some built-in magic
proxy object, but I'm not entirely sure if this does not start with "write
hardware definition file and rebuild the VM". Just a thought.

5\. is the basic idea why I'm attempting to discuss this. Small MCUs are still
general purpose computers, just very limited, and are rather good litmus tests
- is it possible to cram a Hello World into ATtiny with 512 bytes of memory?
Is it possible to run something on baremetal ARM Cortex M3/4 (no FPU,
MPU/MMU)? No? Then it is by no means "general purpose" and we should
thoroughly discuss limitations imposed by the technology.

Programming is a discipline just too diverse for a single individual to grasp
and too rarely we step outside our boxen to see the whole world.

~~~
iofj
For legacy code a converter might be the better option. E.g.

[http://www.tangiblesoftwaresolutions.com/Product_Details/CPl...](http://www.tangiblesoftwaresolutions.com/Product_Details/CPlusPlus_to_Java_Converter_Details.html)

Yes, it's not perfect but at least debugger, instrumentation, ... will work
against the module.

~~~
friendzis
Well... If you just want to take old abandoned source and somehow run it -
anything goes (IIRC, NumPy requires Fortran for MKL). I meant old projects
that are still maintained at some level yet are large in scope and cannot be
reimplemented incrementally, think OpenSSL.

If we are talking about e.g. servers, we can safely _assume_ pretty beefy
x86-compatible hardware and discuss in the context of that. In my book C is
the ultimate at general-purposeness and anything we attempt to do with C must
be discussed in that light.

------
munin
I think that this work is a lot more thorough:
[http://www.cl.cam.ac.uk/~dc552/papers/asplos15-memory-
safe-c...](http://www.cl.cam.ac.uk/~dc552/papers/asplos15-memory-safe-c.pdf)

They go through and identify C programming idioms as they are reflected in
real code, and design their new memory model in part around that. The rest of
the work on CHERI is also very interesting.

------
gruez
I don't get why they went with ManagedC when Managed C++ aka C++/CLI (which
runs on CLR rather than JVM) existed for a decade.

~~~
mike_hearn
If you read the paper, ManagedC is doing something very different to Managed
C++, despite the similarities in name. C++/CLI is a different language where
you have to define garbage collected pointers manually. ManagedC is basically
the same as C99, albeit a whole lot more strict about things that might work
on other compilers whilst being technically undefined.

To be more specific, in ManagedC the C code is interpreted, profiled and then
JIT compiled just like in Java, where the JIT compiler (Graal) can then do
very aggressive profile guided optimisations like inlining huge amounts of
code, and using the resulting compile graphs to eliminate the overheads
introduced by the sanity checking. It can also do things like eliminate
dispatch costs when using function pointers and the like, in the same way it's
done for Java.

What's also super interesting about this approach is the language interop you
get. ManagedC is based on a project called TruffleC, which doesn't have any of
the security/safety checks. TruffleC was built to enable Ruby code that's also
being compiled by the same compiler to call into Ruby C extensions intended
for the MRI interpreter. What's really crazy about this work is, the compiler
can merge the C and Ruby code together incredibly tightly, to the extent that
running a mix of Ruby and C on this experimental JVM can be much faster than
running the Ruby and C together on the original implementation. The JVM can
actually optimise out all the interop costs of moving between the Ruby and C
worlds, just using compiler optimisations.

------
flohofwoe
At first glance I assumed they use a similar approach as how emscripten
compiles C/C++ to JS, where the entire C-accessible memory heap is one big
Javascript array object (which has the nice side-effect of basically switching
off the garbage collector, unless you need to cross-over to the JS-side). But
it looks like they are actually mapping granular C structs to JVM 'objects'.

There is (or was?) an emscripten-alternative called Duetto which had a
somewhat similar 'granular' approach like the C-on-JVM described here, but it
couldn't compete on performance and also needed a customized C/C++ dialect.

~~~
david-given
I've done this --- see:

[http://cluecc.sourceforge.net/](http://cluecc.sourceforge.net/)

It compiles C89 C into Java, Javascript, Lua, Perl and Common Lisp. Pointers
are represented as an array pointing at the object and an offset into the
array, so sizeof(void * ) == 2 and sizeof(everything else) == 1. This allows
efficient pointer arithmetic while still using one native allocation per
object.

It relies heavily on C89 undefined behaviour. Alas, C99 adds (IIRC) a defined
mapping from an object to bytes and vice versa, which means this approach
won't work. Also, the compiler frontend I was using, sparse, had bugs where it
would try to convert a pointer to an int, do arithmetic, and then convert back
again. So it's not suitable for real work.

Performance was better than I was expecting it would run C on Java at 1/3 the
speed of native. That's pretty good for a naive toy. Running C on LuaJIT was
amazing; very nearly the same speed as native. (For a set of artificial, non-
representative benchmarks.)

Java, Javascript and Lisp are all crippled by not having goto. goto allows you
to express arbitrary basic block graphs, which C supports. Without it, I have
to use a big switch statement inside a while loop --- JITs hate this.
Emscripten gets around this by using algorithms to try and represent the basic
block graph as much as possible with structured control flow statements, but
this can't work in all situations, so it has to fall back to the explicit
state machine if that doesn't work. Which is, of course, slow.

LuaJIT does support goto. When I converted from Lua 5.1 (with no goto) to
LuaJIT, I estimated about a 30% speed improvement.

Languages without goto statements are _toys_ , dammit...

~~~
dmm
Common lisp does have a form of goto in the "go" operator within a tagbody.
Was this not usable for your purposes? You needed a global goto?

[http://clhs.lisp.se/Body/s_go.htm#go](http://clhs.lisp.se/Body/s_go.htm#go)

~~~
david-given
TBH I don't know Lisp - the code generator was donated. There isn't enough
libc to run the benchmarks and I haven't really examined the code much. Maybe
it does use go. If you're interested in having a look, feel free...

~~~
dmm
> Maybe it does use go.

I just took a look at your repository. In the lisp implementation each
generated function is wrapped with a prog which provides a tagbody and then a
label is emitted for each basic block and go is used to jump to the labels. So
be sure to leave out lisp in your future lists of toy languages... :D

~~~
david-given
Excellent news!

Normally at this point I'd ask if you were interested in fixing the libc for
me, but frankly, it's not worth it --- clue's technology is pretty crappy. One
day, in my copious free time, I'd like to retry the whole idea using a better
compiler.

------
norswap
Do they allow the commonplace idiom of casting objects to void* and then back
to the original type? (That is allowed by the standard)

And do they allow "overlay" casting where multiple structs are used to access
the same memory region? This is used in the POSIX networking API for instance
(and isn't that rare besides -- the alternatives is unions but that has the
disadventage of closedness -- you can't add new variants).

------
mk00
This uses garbage collection which requires a stop-the-world pause unless
using something like Zing JVM. A STW in many (real-time) applications is why
you would choose C/C++ in the first place.

