
Baby Steps: Slowly Porting Musl to Rust - pygy_
http://blog.adamperry.me/rust/2016/06/11/baby-steps-porting-musl-to-rust/
======
quotemstr
I'm disappointed by the idea that something is semantic nonsense, but it's
okay to do it anyway. The right line of code isn't "return usize::MAX", but
"abort()". (I'd also accept "__builtin_unreachable()".)

In general, if a program enters an impossible state, you need to stop running
the program, not do something you _know_ is wrong just so the program keeps
working a while longer. That's why strcpy_s is much better than strlcpy.

It's also really not fair to compare the legibility of Rust to the legibility
of musl: whatever its other merits, musl is written in that old-school C style
that makes one imagine that programs run faster when variable names are
shorter and braces are omitted.

~~~
dikaiosune
This is a hobby project for my own edification, and maybe _maybe_ to
eventually show Rust's suitability for this kind of work. While I appreciate
that the semantics of returning usize::MAX aren't correct, I'm also not an
expert on C standard library implementations. Just a hobbyist writing a little
bit about something I'm trying out.

Potentially relevant is this section of the repo's README:

[https://github.com/dikaiosune/rusl#goals-and-non-
goals](https://github.com/dikaiosune/rusl#goals-and-non-goals)

EDIT: After a suggestion in a sibling here, here's a version which neither
explicitly aborts (musl doesn't do abort or unreachable), nor uses an explicit
return of usize::MAX:

[https://is.gd/lck8c7](https://is.gd/lck8c7)

It doesn't autovectorize, but I suspect that with some work I could maybe
convince LLVM to do so.

~~~
briansmith
IMO, it would be more interesting to see it done top-down, instead of bottom-
up. That is, instead of starting with rewriting `strlen`, why not start with
rewriting `getaddrinfo`? That would be a very interesting project because you
could write `getaddrinfo` with a safe Rust API (i.e. one that doesn't require
using `unsafe` to call it), and then wrap it in an `unsafe` function that
exports the unsafe C `getaddrinfo` function. This could arguably then be an
improvement on the musl code.

~~~
dikaiosune
I think it very well could be.

However, I would imagine (not having looked at the problem in depth) that one
would want to be able to use things like Vec<T> or CString from the Rust
standard library when implementing a DNS resolver. Since the Rust standard
library depends on libc, I'm not sure how one would properly allow the Rust
symbols to replace the linked-in libc symbols while still depending on the
rest of libc. Maybe it's not an issue? Maybe there's some magic which could be
done?

Nonetheless, I mostly started doing this to see if it could be done, not to
prove any point (although I'd love it if a point is proven along the way), so
a bottom-up approach with #![no_std] was the easiest way to prevent any sort
of cyclic dependencies and/or linker issues.

~~~
briansmith
> Since the Rust standard library depends on libc, I'm not sure how one would
> properly allow the Rust symbols to replace the linked-in libc symbols while
> still depending on the rest of libc. Maybe it's not an issue? Maybe there's
> some magic which could be done?

Static libraries can have circular dependencies like that. For example, in
_ring_ [1] I have C code in one static library, which calls some Rust
functions in my Rust code. And, that Rust code calls functions in the C static
library. The linker...links them together.

[1] [https://github.com/briansmith/ring](https://github.com/briansmith/ring)

~~~
dikaiosune
Circular dependencies are different than duplicate symbols, though, yes?
Anyways, my point is not that it's impossible, but that I'm already working at
the edge of my knowledge, and I picked the lowest-risk/smallest-chunk pieces
of work to get started.

~~~
spc476
On Linux, the symbols in glibc are marked as "weak"\---that is, if the symbol
"malloc" isn't found in another object file, then the one in glibc is used.
It's that way so that users can override system provided functions (non-"weak"
symbols that are duplicated are considered an error).

~~~
strcat
Some symbols are. Not all.

------
nikital
I wonder what is the benefit of rewriting the libc in Rust, as most of the
functions are actually unsafe. Implementing memcpy in Rust won't make memcpy
any safer.

As far as I understand, the main selling point of Rust is safety, so if we
don't get any safety, why bother? Am I missing something? (Other than having
some fun with Rust, which is obviously welcome :) )

~~~
pcwalton
I came to this thread hoping to answer this exact question. :)

There isn't a huge benefit to writing things like strlen in Rust. But a lot of
libc is big, complicated things like the DNS resolution logic in
gethostbyname. These are also where we find a huge number of vulnerabilities
in popular libcs. The internal logic of functions like gethostbyname should be
able to profitably take advantage of Rust's safety features.

~~~
panic
Here's musl's DNS resolution logic, for other people who may be curious:
[http://git.musl-
libc.org/cgit/musl/tree/src/network/lookup_n...](http://git.musl-
libc.org/cgit/musl/tree/src/network/lookup_name.c)

~~~
asveikau
That is crazy simple. I bet glibc's is a lot longer.

~~~
jcranmer
Here's part of glibc's:
[https://sourceware.org/git/?p=glibc.git;a=blob;f=sysdeps/pos...](https://sourceware.org/git/?p=glibc.git;a=blob;f=sysdeps/posix/getaddrinfo.c;h=31bb7e66dcbe1868b8e5e34185e048fe5bde15ab;hb=0d261f406d7c8f70b4ad7ca7d9247da1db3dfb41)

------
cpeterso
Has anyone experimented with a C-to-Rust translator?

Google wrote a C-to-Go translator to aid their conversion of the Go compiler
from C to Go. Carol Nichols gave a talk about her function-by-function, by-
hand rewrite of the Zopfli compression library from C to Rust. I wonder how
much of that could be automated, even if most of the resulting Rust code is in
unsafe blocks.

Carol's slides:

[https://github.com/carols10cents/rust-out-your-c-
talk](https://github.com/carols10cents/rust-out-your-c-talk)

~~~
legulere
I guess the output would be very ugly as the C semantics are probably ugly to
reproduce in rust. For instance arithmetic with unsigned types in C is defined
to be wrapping, so you would need e.g. calls to wrapping_add() instead of +
all the time.

~~~
pornel
I'd be OK with a shallow syntactic translator that only does the first
laborious step of the conversion, without trying to preserve too much
semantics. Rust with exact C semantics isn't going to be better than C, so I'd
prefer readable code that is easier to read & refactor.

I've converted a few utilities from C to Rust. Rewriting `void foo(int a)` to
`fn foo(a:int)` is _boring_ menial job. Only the next step of replacing C's
pile'o'pointers with Rust idioms _is fun_.

------
fpgaminer
As a corollary to writing the C standard library in Rust, are there any plans
by the Rust devs to remove the C standard library dependency itself? That, to
me, is just as interesting of a proposition.

~~~
pcwalton
It's not a high priority task, no. (Though it would be quite interesting to
see it done.)

On Windows and Mac OS X you shouldn't really avoid C libraries (kernel32.dll
and libSystem.dylib respectively) because the syscall interface is considered
a private API. (You _can_ avoid it technically, and apps sometimes do, but
Microsoft and Apple would be within their rights to break us at any time if we
did that.)

~~~
pjmlp
But those are technically the OS system calls, not libc.

Would the Rust team accept PR to remove the dependencies on libc for Windows?

~~~
masklinn
> But those are technically the OS system calls, not libc.

They're libc calls, libSystem is the union of a bunch of libraries including
libc. In fact, these unioned libraries are just symlinks to libSystem:

    
    
        libc.dylib -> libSystem.dylib
        libdbm.dylib -> libSystem.dylib
        libdl.dylib -> libSystem.dylib
        libgcc_s.1.dylib -> libSystem.B.dylib
        libinfo.dylib -> libSystem.dylib
        libm.dylib -> libSystem.dylib
        libmx.dylib -> libSystem.dylib
        libpoll.dylib -> libSystem.dylib
        libproc.dylib -> libSystem.dylib
        libpthread.dylib -> libSystem.dylib
        librpcsvc.dylib -> libSystem.dylib

~~~
pjmlp
Still, isn't ironic that a systems programming language isn't able to be used
without C, because its runtime was made to rely on libc instead of the raw OS
APIs?

I understand the shortcut as a way to reduce development time, but maybe that
is something to improve.

Go made the right decision by integrating directly with the OS APIs.

~~~
claudius
> rely on libc instead of the raw OS APIs?

The argument here is that on Windows and Mac OS X, the libc library _is_ the
‘raw OS API’ and it hence makes perfect sense to rely on it if you want to
support such operating systems.

~~~
bjourne
On Windows it isn't. Microsoft has made it clear that you aren't supposed to
link to the msvcrt.dlls shipped in Windows. See
[https://sourceforge.net/p/mingw-w64/wiki2/The%20case%20again...](https://sourceforge.net/p/mingw-w64/wiki2/The%20case%20against%20msvcrt.dll/)

------
scotty79
For me coolest part is the algorithm for determining if word contains zero
byte.

    
    
        #define ALIGN (sizeof(size_t))
        #define ONES ((size_t)-1/UCHAR_MAX)
        #define HIGHS (ONES * (UCHAR_MAX/2+1))
        #define HASZERO(x) ((x)-ONES & ~(x) & HIGHS)
    

Assuming sizeof(size_t) is 4 and UCHAR_MAX is FF...

It takes FF FF FF FF (by casting -1 to unsigned size_t), divides it by FF,
getting 01 01 01 01 (ONES).

Then gets 80 80 80 80 (HIGHS) by multiplying 01 01 01 01 by 80 (which is gets
from (FF+1)/2).

Once it has these two values it tests which bytes are lower than 80 by doing
~(x) & HIGHS and keeps result in highest bit of each byte and zeroes lower
ones.

Other part is subtracting ONES from (x) which results in highest bit set in
each byte either because (x) had more than 80 in that byte or (x) having zero
in that byte and borrowing from higher byte (underflow in case of highest
byte).

ANDing those two cases rules out highest bit set due to (x) byte being more
than 80. What's left is all zeroes if and only if all bytes of (x) were non-
zero.

Funny thing is that if borrowing from higher byte happens (because some byte
was zero) it might mess up test for that higher byte but it doesn't matter
because you will already detect zero in lower byte so you don't care about
test on higher bytes then.

If what I wrote above is unclear then try here:
[http://stackoverflow.com/a/34643025/166921](http://stackoverflow.com/a/34643025/166921)

This is algorithm that checks if word has a byte that is less than given value
(1 in that case, but surely valid up to 7F ... maybe more by some magic?
probably not...). Also works for different lengths of "words" and "bytes"
provided that whole number of "bytes" fits in a "word".

~~~
scotty79
Thus I give you my first Rust program ;)

    
    
        // needs in main.rs or lib.rs #![feature(const_fn)]
    
        use std::os::raw::c_schar;
        use std::mem;
    
        const ALLSET:usize = (-1_isize) as usize;
        const MAXCHR:usize = u8::max_value() as usize;
        const ONES:usize = ALLSET / MAXCHR;
        const HIGHS:usize = ONES * ((MAXCHR+1)/2);
        const ALIGN:usize = 8; // mem::size_of::<usize>(); // mem::size_of<T>() is not const fn (yet?)
    
        pub unsafe extern "C" fn strlen(s: *const c_schar) -> usize {
            let mut i:usize = 0;
            loop {
                if *s.offset(i as isize) == 0 { return i; }
                i += 1;
                if (i % ALIGN) == 0 { break; }
            }
            loop {
                let v = *mem::transmute::<*const c_schar, *const usize>(s.offset(i as isize));
                if v.wrapping_sub(ONES) & !v & HIGHS != 0 {
                    break;
                }
                i += ALIGN;
            }
            loop {
                if *s.offset(i as isize) == 0 { return i; }
                i += 1;
            }
        }

------
dman
Any plans for rust to switch to an allocator written in Rust?

~~~
steveklabnik
On nightly, we have the ability to swap out allocators. But I'm not aware of
an allocator written in Rust that's as high-quality as jemalloc yet; it'd have
to be pretty good before we'd move over. I'm not aware of any serious effort
to write one, yet.

------
pron
While Rust can indeed provide great benefits over C in many respects (mostly
security and safety, but also maintainability), rewriting "everything" in
Rust, as the author says, is so monumental an effort that there is no chance
of it happening, certainly not before a "better" language than Rust comes
along (say, in another ten or fifteen years), and then what? Rewrite
everything in that language?

A far more feasible approach (though still very expensive) -- and probably a
more "correct" one, even theoretically -- is to verify existing C code using
some of the many excellent verification tools (from static analysis to white-
box fuzzing and concolic testing) that exist for C[1]. It is more "correct"
because, while perhaps not providing some of the other benefits, those tools
can prove more properties than the Rust type system can, and such tools don't
(yet) exist for Rust.

Even if you want to write a green-field project today that absolutely must be
correct, it's better to do so in C (or in some of the verified languages that
compile to C, like SCADE) and use its powerful verification tools, than to
write it in Rust. Verification tools are usually much more powerful than any
language feature (even though Rust's borrowing system seems like it can
prevent a lot of expensive real-world bugs). Of course, it would be optimal if
we could do both, but the Rust ecosystem simply isn't there just yet.

This is part of a more general problem, namely that rewriting code in new
languages -- any new language -- is _so_ costly (and often risky, too), that
the benefits are rarely if ever commensurate with the effort. This is why good
interop with legacy libraries is so important (something, I understand, Rust
does well), often much more important than most language features. Rewriting
code is often the worst possible way to increase confidence in a
library/program in terms of cost/benefit, both because verification tools are
often more powerful than any language feature, as well as because new bugs are
introduced -- either in the code or in the compiler (bugs that won't be
uncovered by the usually incomplete legacy test suite, but would break some of
the millions of clients). Rewriting is usually worth the effort if the
codebase is about to undergo a significant change anyway, but almost never to
increase confidence in the existing functionality.

EDIT: Clarification: if you're _not_ going to use verification tools in your
_new_ project, anyway, than obviously using Rust would give you much better
guarantees than C.

[1]: C and Java are the languages with the best collection of powerful
verification tools (maybe Ada as well, but there's far less Ada code out there
than C or Java).

~~~
pcwalton
I don't buy the argument that verification using heavyweight verifiers is less
expensive than rewriting in Rust. Formally verifying a piece of code is an
enormous effort, while writing in Rust is usually less effort than the
original unsafe code took to write, thanks to Rust's ergonomic improvements
over C.

Besides, most safe languages for embedded use (SPARK, etc) get their safety by
disallowing dynamic allocation, which is great for avionics systems and not so
great for libc or any consumer or server software. (Or they use a garbage
collector for the heap, which is not what you want for libc for the usual
reasons.) Not coincidentally, the novel part of Rust's type system is the
marriage of easy-to-use, zero-overhead C++-style dynamic allocation with the
flexibility and safety of Cyclone-style region checking.

~~~
pron
> while writing in Rust is usually less effort than the original unsafe code
> took to write, thanks to Rust's ergonomic improvements over C.

Absolutely, but do you think that rewriting billions of lines of C code in
Rust is the _best_ , most cost-effective way to increase confidence in them?
Also, using verification tools can be done much more gradually (say, function
by function) than rewriting in a new language (which may require a lot of ffi
if done piecemeal), and there's no need to fear new bugs.

> Besides, most safe languages for embedded use (SPARK, etc) get their safety
> by disallowing dynamic allocation, which is great for avionics systems and
> not so great for libc or any consumer or server software.

True. I don't actually suggest using SCADE or SPARK for libc :) Astrée
wouldn't work, either, as it also assumes no dynamic code allocation. But
something like SLAyer (a tool by MSR based on proving separation logic
properties through abstract interpretation) or Facebook's Infer should. I
would guess that making those tools work well on more and more real-world code
would be a much smaller undertaking than rewriting in Rust and proving no new
bugs have been introduced.

I'm generally skeptical of significant, real-world, bottom line benefits new
languages provide, but Rust has convinced me that, at least potentially, it
can have some _very_ significant benefits over C. But a wholesale rewrite of a
near infinite amount of code, much of which may work well enough, plus the
risk of adding _new_ bugs in the process, without the means to detect them???
That's like draining the ocean with a spoon in the hope of finding a sunken
treasure. Software engineers often speak of using the best tool for the job;
rewriting vast amounts of code (with very incomplete test suites) in a new
language is certainly far from the best tool for the job of increasing
confidence in its correctness (unless a particular piece of code is buggy
beyond repair).

Same goes for physical infrastructure, BTW. If your infrastructure is old and
you fear some bridges may collapse, you inspect them all. You rebuild the
crumbling ones and strengthen those with cracks. What you don't do is rebuild
all the bridges in the country. That's not only wasteful and comes at the
expense of new bridges you could have built, but may end up increasing the
danger in some cases, rather than decreasing it.

~~~
dikaiosune
A big part of writing about this was to share one potential strategy for
reducing risk in rewrites by doing it function by function. That's a strategy
which is available to Rust with zero overhead FFI, so I'm not sure why you're
holding that up as a benefit exclusive to static analysis and verification
tools.

~~~
pron
Because -- and I may be wrong about this -- the FFI has zero _runtime_
overhead, not zero coding effort overhead (i.e., FFI code isn't identical to
native Rust code). So you'd want to rewrite again (the FFI code to native
code) as you translate more functions.

~~~
dikaiosune
How does that differ from what's been reported is extensive work to appease
static analysis tools?

~~~
pron
It differs by being a far _more_ expensive (and probably less effective) way
to achieve the goal. Those verification tools were _designed_ to be
spectacularly less expensive than a rewrite in a new language (which is often
prohibitively expensive), or they wouldn't have been designed in the first
place (they're all much newer than safe systems languages like Ada).

I said that using verification tools is expensive, but I'd be surprised if
it's not at least one, if not two, orders of magnitude _less_ expensive than a
rewrite (I can think of few if any more expensive ways of increasing
confidence in such large amounts of legacy code than rewriting it). It can
provide similar -- sometimes worse, sometimes better -- guarantees, requires
much less creativity, it automatically focuses you on places in the code that
are likely to be buggy (or, alternatively, shows you which code is likely
_not_ buggy) and cannot introduce new bugs that you have absolutely no way of
detecting before pushing the new code to production systems.

(BTW, the chief complaint against those tools -- and what makes them expensive
to use -- is that they have too many false-positive reports of potential bugs,
but even "too many" is far fewer than _every line_ , which is what you'll need
to consider and study when rewriting.)

~~~
pcwalton
> and cannot introduce new bugs that you have absolutely no way of detecting
> before pushing the new code to production systems.

Yes, they can. Remember the Debian OpenSSH vulnerability that arose from
blindly making changes that Valgrind suggested?

> It can provide similar -- sometimes worse, sometimes better

Much worse, unless you're talking about systems that verify memory safety
problems by disallowing dynamic allocation.

> Those verification tools were designed to be spectacularly less expensive
> than a rewrite in a new language (which is often prohibitively expensive),
> or they wouldn't have been designed in the first place (they're all much
> newer than safe systems languages like Ada).

I don't agree. Are you talking about things like Coverity? Coverity hasn't
effective at stemming the tide of use-after-free and whatnot.

I strongly encourage you to familiarize yourself with Rust's ownership and
borrowing discipline and to understand how it prevents problems like use-
after-free. You'll find that it's very hard to retrofit onto C, and that's why
no practical static analyses do it.

~~~
nickpsecurity
I agree that Rust is superior solution if we're talking use-after-free
detection. Also agree on tool immaturity. However, recent results might make
you reconsider how practical they are.

[http://research.microsoft.com/en-
us/um/people/marron/selectp...](http://research.microsoft.com/en-
us/um/people/marron/selectpubs/Undangle.pdf)

[https://dl.acm.org/citation.cfm?id=2662394&dl=ACM&coll=DL&CF...](https://dl.acm.org/citation.cfm?id=2662394&dl=ACM&coll=DL&CFID=799998173&CFTOKEN=79327699)

IMDEA's Undangle seems to have nailed it totally with one other doing pretty
good. You're the compiler expert, though. Does Undangle paper seem to be
missing anything as far as UAF detection or is it as total as it appears?

~~~
kbenson
What I think is relevant in this case is not whether an an external tool for C
can reach the same levels of safety presented by Rust, but whether an
external, _optional_ tool will every be able to provide the same level of
assurance given that you can't ensure confirmity on the (source) consumer end
without getting all the same tools and rerunning them (which is the situation
you have when it's built into the compiler.

Another way of stating this is "code verification tools for C are great! What
level of market penetration do you think we can achieve? Oh. That's
disappointing..." :/

~~~
nickpsecurity
That's a great point but a bit orthogonal. It's _very important_ if one is
aiming for mass adoption. Hence, why C++ was built on C and Java combined a
C-like syntax with a marketing technique I call "throw money at it." We still
need to consider the techniques in isolation, though.

Here's why. One type of tool needs to either not force conformity or blend in
seemlessly with everyone's workflow with wide adoption due to awesome
compatibility, safety, efficiency, and price tradeoffs. That's HARD. The other
is just there for anyone that chooses to use it recognizing value of quality
in a lifecycle. It just needs to be efficient, effective, have low to zero
false positives, work with what they're using, be affordable, and ideally plug
into an automated build process. These C or C++ techniques for safety largely
fall into category number 2.

I totally agree with you on overall uptake potential. There's almost none.
Most of the market ends up producing sub-optimal code with quality non-
existent or as an after-thought. Those that do quality in general follow the
leader on it. It's a rare organization or individual that's carefully
researching tools, assessing their value, and incorporating everything usable
into the workflow. Nothing I can really do about this except push solutions
that are already mainstreaming with the right qualities. Rust, Go, and Visual
Basic 6 [1] come to mind.

[1] When hell freezes over...

------
matt_wulfeck
The first example uses a "unsafe" method. I'm not very familiar with rust, but
isn't this an inherently "bad thing"?

~~~
dikaiosune
Not an inherently bad thing, really. The unsafe keyword has two meanings:

1\. An unsafe block tells the compiler "trust me, I know this is unsafe and I
know how to do it safely"

2\. An unsafe function tells the compiler "this is unsafe! it needs to be put
inside another unsafe function or an unsafe block"

So an unsafe block essentially terminates an unsafe call chain. There are a
bunch of things which are inherently marked as unsafe in Rust, usually because
the compiler can't reason about them, like dereferencing a raw pointer as
opposed to a borrow-checked Rust reference. That's what calls for an unsafe
function in the case of strlen -- to only iterate over the string once we just
have to assume that the pointer points to valid memory and is null-terminated,
so like a lot of things in C, we just cross our fingers and go for it. The
fact that Rust knows this could fail miserably is why it needs to have an
unsafe block.

~~~
xienze
> 1\. An unsafe block tells the compiler "trust me, I know this is unsafe and
> I know how to do it safely"

Right, so aren't you giving up the biggest advantage of Rust, that your code
will be safe by virtue of it getting through the Rust compiler's safety
checking?

~~~
dikaiosune
Yes, although there are other advantages to using Rust (about halfway down
there's a "What does Rust offer here?" section that discusses in more detail).

Frankly, the POSIX C standard library does not seem to have been designed with
safety in mind (I don't think this requires an expert eye to see). So writing
it in Rust is going to require a number of compromises on Rust's safety.

Also, if I were taking this more seriously, the wildly unsafe Rust code would
just be a beach head for a bottom-up conversion of a project. You have a
project which uses strlen? Great, now use this Rust version. Now incrementally
rewrite the strlen-using code in Rust. Now change strlen to use a Rust
CString...and so on. Each time you identify a new module boundary, rewrite its
dependencies in Rust, and then consume the module while preserving it's
outward facing interfaces.

~~~
xienze
> Frankly, the POSIX C standard library does not seem to have been designed
> with safety in mind (I don't think this requires an expert eye to see). So
> writing it in Rust is going to require a number of compromises on Rust's
> safety.

That's kind of what I'm getting at. If cstdlib is inherently unsafe, and Rust
cstdlib is going to be inherently unsafe, what's the point? Because of some
other Rust advantages things may end up safer, but there's a lot to be said
for years/decades worth of fixing existing cstdlib implementations such that
you can be reasonably sure that most of what's in there is as safe as it's
going to get.

~~~
sitkack
Security and safety isn't a binary relation, it is a continuum.

I think what you are experiencing is a transitive-covering fallacy. The
safeties with which you are speaking aren't the same type or the same units.
The argument you making is often made by people on security discussion lists
to _prove_ that modifying X to make it more secure is pointless because there
is still a small possibility of Y.

> years/decades

How long has MUSL been in development? How many programmers there are in the
world?

~~~
masklinn
> How long has MUSL been in development?

A bit more than 5 years. The initial public release was 0.5.0 in February
2011: [http://www.musl-libc.org/oldversions.html](http://www.musl-
libc.org/oldversions.html)

------
bitmadness
Wonderful work! Keep it up!

~~~
dikaiosune
Thanks!

