
Porting C to Rust - ScottWRobinson
https://wiki.alopex.li/PortingCToRust
======
jstimpfle
> Achieving abstraction is super hard when you can just reach down into some
> bytes and noodle around with them instead.

Disagree. In my experience all was wonderful once I was past the point where I
decided to just not worry. Much the opposite, so much pointless and confusing
OOP boilerplate (as well as blood, sweat, and all that) goes into "securing"
code against misuse.

Invest some time into data structures with obvious meaning, or a procedural
API that is easily understood. "Nobody" will "ever" misuse, and misuse will be
easy to detect. Much better tradeoff IMO.

And lastly, of course, abstraction has nothing to do with "security systems".
It's against popular opinion, but you can in fact have good abstraction with
nothing but plain functions and void pointers. I like to view abstraction as
mostly a conceptual thing that doesn't even happen in code.

~~~
ekidd
> _" Nobody" will "ever" misuse, and misuse will be easy to detect... And
> lastly, of course, abstraction has nothing to do with "security systems"._

I've spent a lot of time digging around in other people's C code over the
years, and with very few exceptions, the kind of low-level byte munging in the
original post tends to be absolutely riddled with bugs. There are off-by-one
errors, buffer overflows, integer overflows, and memory leaks. The error-
handling code paths tend to be broken, too.

And if you run a fuzzer like libfuzz or American Fuzzy Lop, it will almost
always find even more vulnerabilities. (A handful of C programs do better,
including djb's tools, SQLite, dovecot, and significant portions of Apache.
Apache seems to mostly succeed because it has good abstractions for strings
and buffers.)

Back in 2002, I published a ~7,000 line XML-RPC library in C, with extensive
unit tests. I ran it through multiple code quality tools, including Electric
Fence. I spent a long time carefully examining each function for correctness.
I chose Expat as my XML parser, because it was one of the best-written at the
time.

Overall, I thought I was an unusually paranoid and careful C programmer. But
here's the list of CVEs against my library and—most importantly—its
dependencies: [https://people.canonical.com/~ubuntu-
security/cve/pkg/xmlrpc...](https://people.canonical.com/~ubuntu-
security/cve/pkg/xmlrpc-c.html)

If anybody here writes C code, and if your code runs on potentially hostile
data, I strongly recommend experimenting with a fuzzer. It can be a brutally
humbling experience, even if you think you're an exceptionally careful
programmer. And if you rely on anybody else's code, like I did, you inherit
all their bugs.

~~~
todd8
We need all the help from our programming languages (and tools) that we can
get. Donald Knuth is a remarkable programmer and computer scientist.
Interestingly, he kept a list of all the errors in his program TeX[1], which
starts in March of 1978.

TeX is an exceptionally stable program, the design has been frozen since
version 3.0 in 1990 reflecting Knuth's desire to have a format suitable for
reproducing the typesetting precisely even for archived texts that were
developed on hardware that is no longer running. Since 1990 only minor changes
have been made to TeX and this is reflected in the current version number,
3.14159265, which is converging to pi.

A comment by Knuth in the error log in March of 1990 was _" We’re now up to
Version 3.0; I sincerely hope all bugs have been found."_ At that point there
had been 908 errors logged. The last entry is currently an entry made in 2014
for error 947.

I've always been fascinated by this list (see [1]).

[1] [http://texdoc.net/texmf-
dist/doc/generic/knuth/errata/errorl...](http://texdoc.net/texmf-
dist/doc/generic/knuth/errata/errorlog.pdf)

~~~
svat
That is a great list, and the paper that accompanies it is a must-read. Note
that to make sense of the list (to put it in context, and what all the
abbreviations stand for, etc), it is best to read the paper by Knuth. That's
"The Errors of TeX", also reprinted in Literate Programming. (The DOI of the
paper is 10.1002/spe.4380190702 and that may help one find and read it.) Just
the abbreviations are also given in
[https://www.tug.org/TUGboat/tb10-4/tb26knut.pdf](https://www.tug.org/TUGboat/tb10-4/tb26knut.pdf)
but the paper is much more.

------
mschwaig
> A lot of these complaints are just, well, it’s called Progress. If we
> designed the C in 2018 and made it so we actually wanted to use it, then it
> would look very different. Probably a lot like Nim or Swift. I think that’s
> still a niche that is currently unfilled; a modern, powerful language that
> doesn’t try to provide the same guarantees as Rust and so can be much more
> minimalist.

I think zig is quite a good candidate to fill that niche. It is very
compatible with C, but it offers meaningful improvements while still being
very picky about what goes into the language.

~~~
grok2
Zig is good, but I like the look of Kit
([https://www.kitlang.org/](https://www.kitlang.org/))...it feels a lot more C
like with many of the features (ADT/pattern-matching/generics/type-inference)
of the latest crop of languages. Not fully ready though, but promising.

~~~
jungler
The two projects have different ambitions, really. Kit is a "compiles to C"
that adds on some nifty semantics. There are many similar efforts in that
vein, and they benefit by reusing the leverage of the existing infrastructure
and patching it up, but that isn't going to change the underlying environment
- you'll still build and debug as if it were C. Zig is trying to replace the
very bottom of the stack. Although it's bootstrapping off a C++/LLVM
environment itself, the effort is more ground-up and will remove much of the
undefined or vendor-specific behavior that currently exists in C tooling,
while still adding some high-level niceties, just probably not as many as an
applications-focused language.

------
sinistersnare
Here is the discussion on the Rust Subreddit, with author comments [1].

[1]:
[https://old.reddit.com/r/rust/comments/9mioiv/porting_c_to_r...](https://old.reddit.com/r/rust/comments/9mioiv/porting_c_to_rust_a_case_study_minimp3/)

------
monocasa
> What is it about C that makes people think L3_imdct_gr() is a perfectly good
> function name?

Early ANSI, like Fortran77, only required 6 characters of a symbol to be
significant, with compilers not going much farther beyond that. At that point,
it sort of becomes a "when in Rome" thing.

~~~
rayiner
Unpopular opinion: it also benefits the human reader to have the significant
distinguishing parts of the function name at the beginning, instead of after a
bunch of foreplay.

As to this particular function name, you're in MP3 land, so MDCT means
Modified Discrete Cosine Transform. The "i" is probably "inverse." "L3" is
probably "Layer 3" since we're talking about MPEG Layer 3. I had to look up
"gr" since it seems specific to the encoding, but it seems to refer to
"granule" which is a basic unit of the MP3 data stream (that's also clear from
the context, since many functions operate on a gr_info structure). A comment
on the gr_info structure might have helped there, but there's nothing wrong
with consistently using a shorthand that is used in the underlying spec. In
context, it's not actually a bad function name.

~~~
wyldfire
I suppose I'd grant that there's two distinct audiences: one who's familiar
with the codebase already and has to read these symbols but knows what they
mean, and one who's new to the code. I think your point only works for the
former and doesn't help the latter at all.

So we should weigh the negative impact to the familiar-reader of scanning a
64-character identifier distinguished only by a suffix like "_gr" (or
"_greater" if that's what is meant here) against the negative impact of the
difficult-to-decipher-abbreviations to the unfamiliar-reader. IMO the net win
is to optimize for the unfamiliar-reader in this case. The frequency of
unfamiliar-readers is much lower than familiar-readers, the positive impact is
much more significant.

> As to this particular function name, you're in MP3 land, so MDCT means
> Modified Discrete Cosine Transform. The "i" is probably "inverse." "L3" is
> probably "Layer 3" since we're talking about MPEG Layer 3. In context, it's
> not actually a bad function name.

Agreed, this codec implementation will likely use some abbreviations for sane
reasons. But even if you know what mdct/l3 mean in this context (I did), can
you say what this function does (or should do) by looking at the name? What
about it is distinct from L3_imdct12/L3_imdct36/etc?

~~~
lucideer
> _I 'd grant that there's two distinct audiences: one who's familiar with the
> codebase already and has to read these symbols but knows what they mean, and
> one who's new to the code. I think your point only works for the former and
> doesn't help the latter at all._

With the exception of the "i" (possibly inverse?), the rest of the parts seem
like things you would very quickly become familiar with after a short period
of looking at the code and reading up on the subject matter (both of which
would be required if you wanted to contribute or port).

I guess having names that are immediately understandable to any fresh reader
with zero domain knowledge is a noble goal, but it's a high bar.

It's not a perfect function name (you've pointed out some other, valid,
problems) but it's far from being objectively bad.

~~~
tjoff
_> I guess having names that are immediately understandable to any fresh
reader with zero domain knowledge is a noble goal, but it's a high bar._

I'd argue that it is often detrimental to the readability for those with
domain knowledge, and if the code (as often is) requires the domain knowledge
it can be a disservice.

Consistency is key, _if_ you are consistent these shortened function names can
be disturbingly pleasant.

Long names are sometimes a consequence of the programmer not taking the
necessary time to think about it (also, consistency is still paramount).

------
eridius
On the "possible bugs in minimp3" section, the links are all to the current
master, which appears to have changed since this was written (e.g.
[https://github.com/lieff/minimp3/blob/master/minimp3.h#L232](https://github.com/lieff/minimp3/blob/master/minimp3.h#L232)
doesn't point at the function with the bitshift anymore; I expect it should be
[https://github.com/lieff/minimp3/blob/644e0fb7fed34f803b6634...](https://github.com/lieff/minimp3/blob/644e0fb7fed34f803b6634f72e5ad8cc20a520f7/minimp3.h#L232)).

Protip: When viewing something like this in GitHub, press "y" and the URL bar
will change to be a proper permalink to the current version of the code. This
permalink can then be shared. Alternatively, with the line highlighted, press
the … button in the gutter and it will offer a "Copy Permalink" option (which
gives you the same permalink you get by pressing "y").

------
capdeck
> _In the end, the results of my work are in the rinimp3 crate, because I suck
> at naming things._

Why not minimp3-rs -- thus people familiar with the original C library will be
sure where it is coming form...

~~~
saghm
"Crate" refers to a package in the centralized repository for Rust packages,
crates.io. Given that it's not a repository for C packages as well, adding
"-rs" to the end is a bit redundant. That being said, there's nothing
enforcing that that the repository name has to be the same as the crate name,
so it would be pretty reasonable to add "-rs" to the end of the name on
Github, etc.

~~~
steveklabnik
Yes, but also no. "crate" is Rust's unit of compilation. Crates.io distributes
_pacakges_ , which are made up of one or more crates.

That said, yes, it's considered bad form to name your project with a -rs
suffix. GitHub is okay.

------
eximius
I admit, I was curious about the `some_struct foo[1];` thing.

This is all I could find. It isn't terribly compelling.[1]

[1] [https://stackoverflow.com/questions/6390331/why-use-array-
si...](https://stackoverflow.com/questions/6390331/why-use-array-
size-1-instead-of-pointer)

~~~
GuB-42
The stack overflow link is about a different mater. And it is compelling. I
managed lower memory usage and improve performance very significantly using
this technique.

That kind of control over memory is what C is very good at, and it is, I
think, the main reason why C tops almost every benchmark.

Note that modern C++ offers everything C has to offer plus even better, more
advanced tools. As a result, it should be even faster. However, "proper" C++
is a bit more hands off when it comes to memory management, with things like
generic containers, smart pointers, ... It tends to result in a slightly worse
performance.

As for the "foo[1]" as it is described in the article, it is just syntax. Some
people may find it more readable than &foo, it doesn't matter. Kind of like
&table[1] vs table+1. Use the form you prefer, or even 1[table] if you are
making an IOCCC entry.

------
JoshuaAshton
"The lack of real bool" -Which is fine as the minimum space it would take to
store a bool is 1 byte anyway (unless youre using bitflags), and you probably
want to give more detailed information than true/false anyway

"but a shitty way to engineer software as a whole. Achieving abstraction is
super hard when you can just reach down into some bytes and noodle around with
them instead.": Data oriented design is usually better both in ease of use and
performance than any random class abstractions that exist; also, see Linux.

"Bloody hell you can’t tell whether a pointer points to a single object or an
array just by looking at it": You can, usually. There are plural words in
languages usually used do denote this, item _s_ , also the size thing he
mentioned. If it doesn't have it then its usually just bad practice or poor
code quality.

"heckin’ ternary operators, just make your if statements not suck.": Ternary
operators are great, and usually quite concise. Not sure what they are
specifically referring to here :/

"The pre and post increment operators are just the worst damn thing in the
world.": Again, this knowledge comes to experience, and actually makes things
more concise.

C is not designed to be a ""beginner"" friendly language, but its essence is
simple -- and I would recommend it for any beginner as it really drives home
the majority of actual programming principles, and makes you think about what
you are doing on a deeper level rather than coating things in a magical dust
layer of classes with vtables and garbage collection.

"it’s called Progress.": However, with modern programming languages it's one
step forward with two steps back most of the time.

I agree with stuff about automatic conversions between types, however some
compilers will warn you (unless you told it to shut up) about any narrowing
conversions that you do, and that's the main trap that people fall into.

The majority of debugging/compiling tools are designed primarily with C in
mind and are fairly simple to use also.

The majority of the rant about C was mainly not based off issues with C
itself, but with the code quality of minimp3, which is quite depressing as C
itself does have some bad traits imo such as function ptr definitions are
bulky, : for bitfields, no predictability for most 'undefined behaviours',
dodgy bitshifting too and probably more things I can't think of at the top of
my head.

