Hacker News new | comments | apply | show | ask | jobs | submit | vardump's comments login

> You can really use a register more than once at the same time because they will be renamed.

What do you mean by this?

Do you realize register renaming is only done for instruction scheduling purposes, to avoid bubbles in the pipeline?


I guess he refers to pipelining, which sort-of can give you a logical register twice. For example, you may be able write a new value to a register before the instruction writing out the old value has completed.

Yes. I'm not an architecture expert but I understand that if you do something like:

    addl    %ebx, %eax
    movl    %eax, foo(%rip)
    addl    %ecx, %eax
    movl    %eax, bar(%rip)
Then the second add can start while the first and the first mov are still running, because the logical reuse of eax is independent of the first and so will receive a different entry in the rename buffer.

Happy to be confirmed or corrected by someone who knows more.


Right, but pretty often the payment is not obvious. For example, it's often shocking how many non-obvious wasteful copies are being made behind the scenes as output to relatively innocent looking code. That's of course "programmer's fault", not denying that.

So you end up paying for things you didn't intend to use, but used anyways because of some small detail.

Sometimes I feel like the only way to write good C++ is to keep looking at the disassembler output...

C++ is my dayjob.


Say you're computing a buffer size with untrusted inputs.

With saturated arithmetic, you could add the (unsigned) sizes together without a possibility of an overflow, so you could eliminate all except the last range check (=branch).

If the end result is larger than what is reasonable, return an error. It's not possible that the value wrapped around and later code will be writing to unallocated memory.


That doesn't actually eliminate range checks. Hardware doesn't have native saturating overflow operations so the saturating overflow methods are implemented by doing a checked overflow and using the min/max value of the type if it overflows/underflows. Which is to say, it removes them from your code, but there's no performance benefit to it.

>Hardware doesn't have native saturating overflow operations

Uh, what hardware are you talking about? x86 and ARM at the least have saturating arithmetic instructions.


They do? I admit I'm not an expert on either x86 or ARM assembly, but I've never seen code that used a saturating arithmetic instruction, and the saturating math in Rust is implemented using checked arithmetic as opposed to any kind of llvm intrinsic for saturating math. Looking at the LLVM docs I don't see any kind of intrinsic for saturating math either (checked math uses llvm.sadd.with.overflow.* and friends).

It's part of the vector ISAs for both. x86 does as part of SSE (or SSE2, or MMX, etc, I don't remember). ARM it's part of the DSP extensions.

And x86 (SSE).

http://felix.abecassis.me/2011/10/sse-saturation-arithmetic/


Maybe it's an Apple thing. Sometimes they make up silly names like "Grand Central Dispatch" for executing a block in parallel in multiple threads. "ARC" makes me still think "adaptive replacement cache", not "automatic reference counting". At least they don't call it "automatic retain count". After all, they used to call "reference count" as "retain count". Confused the heck out of me on my first exposure to Objective-C.

Not that others like Microsoft don't do the same thing, renaming commonly known concepts to something they made up. Like DEP vs NX vs XD. Or Interlocked* for atomic operations. Atomic add in Microsoftese is InterlockedAdd.


UB can be confusing, but please remember it also means a lot of very reasonable optimization opportunities.

Say in data flow analysis, when the compiler can assume "(value + 1) > value" to be always true, a whole expensive branch can be removed.

If the overflow is never going to happen anyways, the savings are considerable (expensive branch eliminated) and the result is always correct.


I know that UB can be useful for optimizations. Where did I assert otherwise?

I'm saying that overflow is not undefined in Rust (the very first section of the blog post says this). I made this same mistake a few days ago and thought it was undefined (and implementation-defined to panic on debug), but I was wrong. Overflow is well defined in Rust.


That can easily lead to excessive branch target buffer invalidation and mess up branch prediction.

It might look acceptable in microbenchmarks, with just 1-10% performance loss and be a total disaster in an actual running system.

A mispredicted branch costs about 15 clock cycles. You'll have a lot of those when CPU can't do bookkeeping on all of your range check branches anymore. The range checks themselves might run fast and fine, but those other branches your program has to execute can become pathological. Those, where the default choice for branches not present in CPU branch prediction buffer and branch target buffer is wrong.

15 clock cycles is rather excessive on any modern CPU that's supposed to execute 1-4 instructions per clock cycle! In the worst pathological cases this can mean an order of magnitude (~10x) worse performance.

Of course data flow analysis might mitigate need for actual range checks, if the compiler can prove the value can never be out of range (of representable values with a given type).

Regardless it is important to understand worst case price can be very high, this type of behavior should not be default in any performance oriented language.


To add, to avoid any misunderstanding: the cost is probably within 1-10% in likely scenarios on most real world code.

Pathological case would be code with a lot of taken branches mixed with range/overflow checks.

For range checks, any sane compiler would generate code that doesn't branch if the value is in range. If no branch predictor entry is present, branches are assumed [1] to be not taken. These branches can still thrash BTB (kind of cache trashing) and cause taken branches that fell out of BTB to become always mispredicted, thus very expensive.

[1]: Some architectures do support branch hints, even x86 did so in the past. Modern x86 CPUs simply ignore these hints.


Not only that, you lose the ability to pattern match simple arithmetic to LEA and bloat your code size.

It's very hard for an overflow check to be mis-predicted. On x86, for static branch prediction, it's sufficient to generate code with a forward branch to be predicted as unlikely the first time. From that point, as the branch will never be taken (unless it actually triggers a panic but then who cares), it will keep being predicted as not taken.

A correctly-predicted non taken conditional jump on x86 has 0 latency. In the specific case of overflow, you don't even need to generate a cmp instruction. It basically means that's close to a wash, in performance.

The actual negative effect that could be measurable is disabling optimizations (like composition of multiple operations, as you need to check each one for overflow).


It would be interesting to have support for saturated types as well. u8saturated 200 + u8saturated 200 = 255. That's often desired behavior.

Saturated arithmetic is supported in hardware in SSE. Although not much point if the values need to be constantly shuffled between SSE and scalar registers...


.saturating_add() exists. You can create a wrapper type around this and implement the Add/Sub/Mul traits.

Does it have branchless codegen?

Not right now (I'm surprised that's the case, there should be intrinsics for it), but that could change.

It gets optimised to be branchless if LLVM thinks that will be faster, yes.

So slower than LZ4? I wonder how Kraken compression rate compares to LZ4.

Better, if you read the article.

You mean slow RAM expansion slot? Isn't it connected to chipset side, right. So limited in capacity and slow.

On some Amiga models I actually saw this expansion "slow RAM" as true chip RAM, could play samples, set up copper lists, set Denise to display bitplanes from it, etc.

More

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | DMCA | Apply to YC | Contact

Search: