I guess he refers to pipelining, which sort-of can give you a logical register twice. For example, you may be able write a new value to a register before the instruction writing out the old value has completed.
Then the second add can start while the first and the first mov are still running, because the logical reuse of eax is independent of the first and so will receive a different entry in the rename buffer.
Happy to be confirmed or corrected by someone who knows more.
Right, but pretty often the payment is not obvious. For example, it's often shocking how many non-obvious wasteful copies are being made behind the scenes as output to relatively innocent looking code. That's of course "programmer's fault", not denying that.
So you end up paying for things you didn't intend to use, but used anyways because of some small detail.
Sometimes I feel like the only way to write good C++ is to keep looking at the disassembler output...
That doesn't actually eliminate range checks. Hardware doesn't have native saturating overflow operations so the saturating overflow methods are implemented by doing a checked overflow and using the min/max value of the type if it overflows/underflows. Which is to say, it removes them from your code, but there's no performance benefit to it.
They do? I admit I'm not an expert on either x86 or ARM assembly, but I've never seen code that used a saturating arithmetic instruction, and the saturating math in Rust is implemented using checked arithmetic as opposed to any kind of llvm intrinsic for saturating math. Looking at the LLVM docs I don't see any kind of intrinsic for saturating math either (checked math uses llvm.sadd.with.overflow.* and friends).
Maybe it's an Apple thing. Sometimes they make up silly names like "Grand Central Dispatch" for executing a block in parallel in multiple threads. "ARC" makes me still think "adaptive replacement cache", not "automatic reference counting". At least they don't call it "automatic retain count". After all, they used to call "reference count" as "retain count". Confused the heck out of me on my first exposure to Objective-C.
Not that others like Microsoft don't do the same thing, renaming commonly known concepts to something they made up. Like DEP vs NX vs XD. Or Interlocked* for atomic operations. Atomic add in Microsoftese is InterlockedAdd.
I know that UB can be useful for optimizations. Where did I assert otherwise?
I'm saying that overflow is not undefined in Rust (the very first section of the blog post says this). I made this same mistake a few days ago and thought it was undefined (and implementation-defined to panic on debug), but I was wrong. Overflow is well defined in Rust.
That can easily lead to excessive branch target buffer invalidation and mess up branch prediction.
It might look acceptable in microbenchmarks, with just 1-10% performance loss and be a total disaster in an actual running system.
A mispredicted branch costs about 15 clock cycles. You'll have a lot of those when CPU can't do bookkeeping on all of your range check branches anymore. The range checks themselves might run fast and fine, but those other branches your program has to execute can become pathological. Those, where the default choice for branches not present in CPU branch prediction buffer and branch target buffer is wrong.
15 clock cycles is rather excessive on any modern CPU that's supposed to execute 1-4 instructions per clock cycle! In the worst pathological cases this can mean an order of magnitude (~10x) worse performance.
Of course data flow analysis might mitigate need for actual range checks, if the compiler can prove the value can never be out of range (of representable values with a given type).
Regardless it is important to understand worst case price can be very high, this type of behavior should not be default in any performance oriented language.
To add, to avoid any misunderstanding: the cost is probably within 1-10% in likely scenarios on most real world code.
Pathological case would be code with a lot of taken branches mixed with range/overflow checks.
For range checks, any sane compiler would generate code that doesn't branch if the value is in range. If no branch predictor entry is present, branches are assumed  to be not taken. These branches can still thrash BTB (kind of cache trashing) and cause taken branches that fell out of BTB to become always mispredicted, thus very expensive.
: Some architectures do support branch hints, even x86 did so in the past. Modern x86 CPUs simply ignore these hints.
It's very hard for an overflow check to be mis-predicted. On x86, for static branch prediction, it's sufficient to generate code with a forward branch to be predicted as unlikely the first time. From that point, as the branch will never be taken (unless it actually triggers a panic but then who cares), it will keep being predicted as not taken.
A correctly-predicted non taken conditional jump on x86 has 0 latency. In the specific case of overflow, you don't even need to generate a cmp instruction. It basically means that's close to a wash, in performance.
The actual negative effect that could be measurable is disabling optimizations (like composition of multiple operations, as you need to check each one for overflow).