
Notes on Type Layouts and ABIs in Rust - pcr910303
https://gankra.github.io/blah/rust-layouts-and-abis/
======
quotemstr
Rust's ABI needs a lot of work before it's frozen. They're leaving performance
on the table. Consider this code:

[https://rust.godbolt.org/z/jN2AA_](https://rust.godbolt.org/z/jN2AA_)

    
    
      pub fn square(num: i64) -> i64 {
          return num * num;
      }
    
      //      mov     rax, rdi
      //      imul    rax, rdi
      //      ret
    
      pub fn square2(num: i64) -> Result<i64, Box<dyn 
      std::error::Error>> {
          Ok(num * num)
      }
    
      //      mov     rax, rdi
      //      imul    rsi, rsi
      //      mov     qword ptr [rdi + 8], rsi
      //      mov     qword ptr [rdi], 0
      //      ret
    

In square(), we multiply the input by itself and return the result in rax. In
square2(), we call the function with an extra hidden pointer argument and
store the output Result via that pointer. That's really bad considering how
central this fat-return-value model is to Rust programming.

x86_64 has nine caller-saved registers. We should be using _all_ of those to
return function return values. Yes, I know that we'e returning by pointer
because the Box has a non-trivial destructor. That's irrelevant, a holdover
from the bad old C++ ABI. Rust should be returning from square2() via two
registers, and I hope that it implements this optimization before freezing its
ABI.

In C++, we can address this ABI wart using the [[clang::trivial_abi]]
attribute: see [https://quuxplusone.github.io/blog/2018/05/02/trivial-
abi-10...](https://quuxplusone.github.io/blog/2018/05/02/trivial-abi-101/)

There's no reason Rust can't do the same thing.

~~~
zozbot234
> That's irrelevant, a holdover from the bad old C++ ABI.

It's not irrelevant, it's doing whatever's most convenient _given that_ the
caller is to drop the Result<> later. As mentioned in the blogpost you link
to, the [[clang::trivial_abi]] attribute changes how these things happen, such
that it's no longer clear whether the caller or callee is responsible for a
value that's passed via trivial_abi. I don't think it would make sense for
Rust to adopt this, especially not in all cases.

> That's really bad considering how central this fat-return-value model is to
> Rust programming.

It's not 'fat return values' that lead to this, it's just _boxed_ return
values. Or more generally, values with non-trivial drop implementations.

~~~
quotemstr
It's not unclear in the slightest. If the value dropped before the callee
returns, the callee drops it. If the value is dropped after the callee
returns, the caller drops it. The claim that returning by pointer is better
for a caller that's going to drop the value later makes no sense. If you think
there's an actual problem, point it and _be specific_.

The problem isn't the boxing. The problem is that the boxing triggers a
needlessly inefficient calling convention, and returning boxed values _is_
central to Rust. The language implementation as it is today uses the machine
inefficiently. Rust can make up its own ABI. Returning via pointer when the
architecture has tons of registers available is just bad ABI no matter what
excuses people use to justify bad performance.

~~~
manwe150
My mental model comparison says that the cost of those may potentially be
about the same, with no clear winner over many benchmarks. I’ve even seen
recently a claim that using two registers was measured to be slower than stack
memory return when tested in another language, so it’s been on my TODO list
now to performance test it (I work on another language runtime). But I don’t
actually expect to be able to measure the difference—instead I expect the
latency often may simply disappear in the execution of the `ret` statement.
Plus, for a non-trivial function, returning via a register may mean you have
to materialize an extra load, whereas the box could have been filled much
earlier and kept a tiny bit more of the stack hot.

~~~
quotemstr
Sure, that part of the stack is going to be in L1 cache, and there's not a
huge difference between that and the register file. But think about the code
size cost too. The icache hit won't show up in a microbenchmark, but when you
have 100,000 of these functions, the difference will be real.

> I’ve even seen recently a claim that using two registers was measured to be
> slower than stack memory return

I find that difficult to believe. Those output registers are undefined anyway.
I get that store-to-load forwarding may hide the latency hit of the stack
access, but they'll apply to the registers too. At _best_ , the stack pattern
runs at the same speed as register return, but with a bigger code size. How on
earth could a register return be worse? If it really were worse, we'd use a
pointer return for multi-word PODs, and we do actually use a pair of registers
to on x86_64 to return a pair of words.

~~~
manwe150
I agree it’s hard to believe it could be faster. But we expect there must be
_some_ cut off where more registers is worse, it’s just a question of where.
But more-than-one being usually worse seems surprising to me too. I expect
it’s unlikely someone actually benchmarked when making that decision, but just
went with the “obvious must” reaction too though. On reflection now, I feel
like there could be some reasons the stack pattern may sometimes be better or
equivalent for multi-return. To your point though, curiously the Win64 ABI
does only use one return register (unlike the SysV ABI which allocates up to
two integer registers as you say).

I’m guessing their microbenchmark may have been that if the compiler had to
spill the value earlier, it’s better to be able to spill directly to the sret
pointer, than to need extra code to reload it at the end.

The bigger reason I usually see that this matters none is that if you’ve
missed the inlining opportunity of something that small, you’re already so far
away from optimal {code size, performance, memory usage} it is really too
premature to optimize the shape of this code (and the ABI).

------
zozbot234
I'm somewhat surprised to learn that there's no plan to have Rust support non-
byte-addressable architectures, such as are found in DSP's (that is,
architectures where the addressable word is something other than 8 bits). It
shouldn't be _too_ hard to add such support, at least for no_std use.

~~~
quotemstr
Rust seems to be targeting a development environment that's a bit more
constrained than C and C++. C++ is general enough to work _everywhere_. Rust
seems specialized to, well, the kinds of systems that run Firefox. Embedded
into the shape of the language isn't only byte addressing, but also this idea
that memory is practically unlimited and that allocation failure recovery
isn't important.

Rust may be perfectly fine in its domain, but my position is that these baked-
in assumptions make its domain a subset of C++'s.

~~~
Ar-Curunir
Nothing in the language assumes unfailing allocation. It's only the current
standard library APIs that don't support failable allocation, but there is
work underway to remedy this.

~~~
quotemstr
The fundamental mistake in the Rust programming language is avoiding the use
of exceptions for reporting errors. It's _because_ the Rust designers eschewed
exceptions that lots of standard library functions that need to allocate
memory are also no-fail. If we reported allocation failure with Result, we'd
have Result _everywhere_, which would have an ergonomic disaster, especially
before the "?" operator appeared.

If Rust had been designed to use exceptions from the start, this problem
wouldn't exist, and it'd be a much nicer language besides. Some people seem to
just have a philosophical or aesthetic aversion to exceptions, which is a
shame, because exceptions work wonderfully in practice.

Rust's development path is especially unfortunate because when you return
Result everywhere and use the "?" operator consistently, what you end up
writing looks _just like_ exceptional code, but with weird warts and
limitations from the language essentially implementing exceptions with return
codes instead of first-class affordances built into the language itself.

It's because Rust got error handling wrong that I still prefer C++ despite
Rust's advantages in variable lifetime reificiation. Exceptions turn out to be
so useful in general-purpose programming that Rust had to add them in a hacky
way after the fact using macros and special operators.

I understand that my position is controversial, but it's still my opinion and
it's going to inform my language choice.

~~~
nixpulvis
I disagree. In fact, I'd go a step further the opposite direction and say that
the inclusion of panics was a mistake in the language design, though we're
stuck with it now. To be fair, they do allow much better diagnostics when
things fail, so this is more of a matter of simplicity vs developer economics,
unless I'm missing something.

~~~
quotemstr
I agree that supporting panics was a mistake as long as exceptions aren't
really supported. Panics force everyone to pay for the possibility of
unwinding without getting to use unwinding for exceptions. It's unfortunate.
If you're going to have to consider the possibility unwinding, running
destructors, and so on, just add exceptions!

You don't need the panic mechanism to get good stack traces. You can get good
stack traces in plain C with abort(): walking the stack does not require
_unwinding_ the stack! Confusing these two concepts is just another example of
the unfixable confusion baked into the core of Rust that ruins an otherwise-
interesting language.

~~~
steveklabnik
If you don’t want to pay for landing pads, you can compile your program with
panic=abort, and you won’t get them. As you say, you can still get stack
traces.

The only reason that this is feasible is precisely because panics are not used
for recoverable error handing in Rust; libraries don’t break if you turn this
option on.

~~~
mwcampbell
Can I go further and disable stack traces as well, without giving up entirely
on std? I don't want my release builds to have any of the extra code required
to walk the stack. Just let the app crash; if I want a stack trace, I'll use a
debugger.

~~~
connicpu
Almost all of the bloat you get for stack traces is the extra information
embedded in the stack frames, which you still need if you want your debugger
to be able to display stack traces. If you don't want them to get printed on
panic, you can give it a custom panic hook that doesn't do the stack trace
walk.

------
mwcampbell
Regarding systems with segmented memory models, I think it would be kind of
cool if a language like Rust, with its emphasis on both safety and zero-
overhead abstractions, could be cross-compiled for my first computer, the
Apple IIGS, with its 65C816 processor. But I admit I don't want it enough to
put any serious work into making it happen. Probably time to let that
nostalgia go.

~~~
zozbot234
You'd need to get the architecture supported by LLVM first. This can be a
hurdle, because LLVM backends need to be actively maintained to keep up with
the rest of the code. The MC68K (far more popular than the 65C816) is still
not supported for example, and work on it is really just starting.

Segmented memory can be supported by C itself in a quasi-standard way, because
the "Embedded C" technical report has a proviso for "multiple address spaces".
Loosely speaking, this means that a "fully general" pointer type, encompassing
all possible address spaces, still has to exist; but address-space specific
pointers are also made possible.

~~~
pshc
Would it work to compile rust code to WebAssembly first and translate that to
65C816 opcodes?

~~~
zozbot234
It would be quite inefficient. And you'd still need to write a JIT from WASM
to the 65C816.

------
davidgerard
> To my knowledge the last great bastion of these properties being violated is
> some DSPs (Digital Signal Processors), because they really don't like 8-bit
> bytes.

Now, there's definitely a tangent to be explored there. Anyone know more about
this?

~~~
steveklabnik
The C spec has a concept called CHAR_BIT that specifies how many bits are in a
byte. Looking for systems where it’s not 8 will give examples. I always link
to [https://stackoverflow.com/questions/32091992/is-char-bit-
eve...](https://stackoverflow.com/questions/32091992/is-char-bit-
ever-8/38360262) which has a TI DSP as the first answer, for example.

