Here's a quick and dirty example: https://gcc.godbolt.org/z/3WK7eYM4z I observe ...

titzer · on March 31, 2022

From the godbolt link, it looked like most of the vector operations were not getting inlined. You'll need -O2 or higher to have representative results.

I could counter by just writing a Virgil program and turning off bounds check with a compiler flag. We could stare at the machine code together; it's literally an additional load, compare, and branch. The Virgil compiler loves eliminating bounds checks when it can. I know your original comment was in the context of the STL, but it really muddies the waters to see huge piles of code get inlined or not depending on compiler flags. Machine code is what matters.

Regardless, this is still microbenchmarking. Maybe matrix multiply is an important kernel, but we need to benchmark whole programs. If I turn off bounds checks in a big program like the Virgil compiler, I cannot measure more than about 1.5% performance difference.

jcelerier · on March 31, 2022

> From the godbolt link,

I just used godbolt to quickly share the code. On my computer I tried with -Ofast -march=native (broadwell)

> I could counter by just writing a Virgil program and turning off bounds check with a compiler flag. We could stare at the machine code together; it's literally an additional load, compare, and branch.

This sounds like Virgil is not vectorizing, which makes the comparison much less useful.

I see much more than a couple instructions changed there: https://gcc.godbolt.org/z/8jPMb734x

titzer · on March 31, 2022

> Virgil is not vectorizing

That's just even more confounding variables. We are, after all, not even doing experiments with Rust vectors, which is what your original comment was about. You wrote examples in C++ and we went down a wormhole already, but I let it slip since at least it was empirical. But I think we shouldn't get distracted with even more side alleys now. Bounds checks are literally just a couple machine instructions, and often eliminated by the compiler anyway, which enables vectorization.

jcelerier · on March 31, 2022

> We are, after all, not even doing experiments with Rust vectors,

they are very close to C++ ones, down to the stack unwinding to report the panic in case of bound error if I'm not mistaken.

> That's just even more confounding variables.

No they are not. We are trying to see how much bound checks costs. If you compare between suboptimal programs the comparison is meaningless (or rather not interesting to anyone) - the starting point has to be the absolute best performance that it is possible to get, and add the worst-case bound-checking (I'm happy for you if you never have to worry about the worst case though!)

> Bounds checks are literally just a couple machine instructions, and often eliminated by the compiler anyway

please provide a source for this ? sure, if you use spans as other commenters mentioned, that moves the checking at the span creation time but that only works for the simplest cases where you are going to access linearily - and I would say that it's a library feature rather than a compiler one.

   for(double v : vec) { } // or any sub-span you can take from it

in C++ also does not need bound-checks by design but this kind of construct also utterly does not matter for so HPC workloads.

I can look into my pdfs folders and bring out dozens of papers where the core algorithms all use funky non-linear indexing schemes where you cannot just iterate a range (algorithms based on accessing i, i-1, i+1, or accessing i and N-i, or accessing even / odd values, etc etc) - how would you implement an FFT for instance ? This is the code that matters !

titzer · on March 31, 2022

> accessing i, i-1, i+1, or accessing i and N-i,

A compiler can do loop versioning for that. And they do. Hotspot C2 (and Graal, too) does a ton of loop optimizations, partitioning the index space into in-bounds and potentially out-of-bounds ranges, unrolling loops, peeling the first iteration off a loop, generating code before a loop to dispatch to a customized version if everything will be in bounds.

When a compiler is doing that sophisticated of loop transforms, you are not measuring the cost of bounds checks anymore, you are measuring a whole host of other things. And sometimes if a loop is just a little different, the results can be disastrously bad or miraculously good. Which is why microbenchmarking is so fraught with peril. A microbenchmark might be written assuming a simple model of the computation and then a sophisticated compiler with analyses never dreamed of comes along and completely reorganizes the code. And it cuts both ways; a C++ compiler might do some unrolling, fusion, tiling, vectorization, software pipelining or interchange on your loops and suddenly you aren't measuring bounds check cost anymore, but loop optimization heuristics that have been perturbed by their presence. You end up studying second-order effects and not realizing it. And it keeps running away from you the more you try to study it.

>> That's just even more confounding variables.

> No they are not. We are trying to see how much bound checks costs. If you compare between suboptimal programs the comparison is meaningless (or rather not interesting to anyone) - the starting point has to be the absolute best performance that it is possible to get, and add the worst-case bound-checking (I'm happy for you if you never have to worry about the worst

We are always comparing suboptimal programs. No compiler is producing optimal code, otherwise they would dead-code eliminate everything not explicitly intertwined into program output, statically evaluate half of our microbenchmarks, and replace them with table lookups.

You're going down a linear algebra rabbit hole trying to come up with a result that paints bounds checks in the worst possible light. If this is the real problem you have, maybe your linear algebra kernels would be better off just using pointers, or you could even try Fortran or handwritten assembly instead, if it is so important. Unsafe by default is bad IMHO. For real programs bounds checks really don't matter that much. Where this thread is going is all about loop optimizers, and they don't really get a chance to go nuts on most code, so I think we're way off in the weeds.

tialaramex · on March 31, 2022

Note that Rust has unchecked index, you just have to explicitly ask for that if you want it. You can even - if you just insist on writing cursed code - which it seems jcelerier is, write your own type in which the index is always unchecked and it is Undefined Behaviour when you inevitably make a mistake. Just implement std::ops::Index and if you also want to mutate these probably invalid indices, std::ops::IndexMut and in your implementation say you're unsafe and just don't bother doing bounds checks.

You can shoot yourself in the foot with Rust, it's just that you need to explicitly point a loaded gun at your foot and pull the trigger, whereas C++ feels like any excuse to shoot you in the foot is just too tempting to ignore even if you were trying pretty hard not to have that happen.

pjmlp · on April 1, 2022

C++ devs that actually care about security the same way as titzer, do turn on the security checks that are disabled by default, thus you can have a same experience.

Example, on Visual Studio define _ITERATOR_DEBUG_LEVEL to 1, enable /analize as part of the build.

While not Rust like, it is already much better than not caring.

steveklabnik · on March 31, 2022

I think you’re confusing two different senses of “vector”, there is the contiguous series of items “vector” data structure that both C++ and Rust have in their standard libraries that’s used in the code example, and there’s “vectorizing” which is an optimization to use things like SIMD to operate on multiple things at the same time.

titzer · on March 31, 2022

I understood what was meant by these terms.

steveklabnik · on March 31, 2022

Okay, then I got lost reading your comment, my apologies.

titzer · on March 31, 2022

No worries. It's too late for me to edit it now, though, but I meant vectorizing="compiler introduces SIMD".

nybble41 · on March 31, 2022

Part of the problem is that the multiply() function takes variable-size matrices as inputs and isn't inlined because it's declared as an external function and so could, in theory, be called from another compilation unit. If it's declared as "static" instead then the compiler generates identical code (modulo register selection) at -Ofast for both versions—eliminating the bounds checking at compile-time:

https://gcc.godbolt.org/z/qqWr1xjT8

jcelerier · on March 31, 2022

> Part of the problem is that the multiply() function takes variable-size matrices as inputs

so does real-life code ? I don't know about you but I pretty much never know the size of my data at compile time.

> If it's declared as "static"

again, real-life code calls math operations defined in other libraries, sometimes even proprietary.

titzer · on March 31, 2022

In Java or other languages with bounds-checked arrays, you would typically just loop from 0 to the end of the array. In Java the JIT will analyze the loop, see that it is bounded by the length of the array, conclude all accesses are in-bounds and eliminate bounds checks. Virgil does this simple type of bounds check elimination as well in its static compiler, but its analysis is not quite as sophisticated as Java JITs.

nybble41 · on March 31, 2022

> I don't know about you but I pretty much never know the size of my data at compile time.

Whole-program optimization (LTO) can deal with that. Also, Rust inlines across modules much more aggressively than C++ does, even without LTO, so optimization will be more effective and its bounds-checking won't have as much (if any) runtime overhead. Especially if you write your Rust code idiomatically, as others have already suggested. Your earlier Rust example was practically designed to inhibit any attempt at optimization. Simply iterating over a slice of the vector, rather than the indices, results in a much tighter inner loop as it only needs to check the bounds once.

That being said, in this case I think it would be better to have fixed-size matrices (using const generic/template arguments) so that the bounds are encoded in the types and known to the compiler locally without relying on whole-program optimization.

stouset · on March 31, 2022

> Especially if you write your Rust code idiomatically, as others have already suggested. Your earlier Rust example was practically designed to inhibit any attempt at optimization.

This is pretty much word-for-word what I've seen in other similar disagreements. Rust Skeptic transcribes their solution from its original programming language into Rust line by line, producing something nobody using Rust would actually write. They find that Rust performs poorly. Rust Proponent writes it the way you'd expect someone who actually knows Rust to write it, and performance meets or exceeds the original. Sometimes they even catch a few bugs.

Yes, if you don't know Rust, and you think it's just a weird re-skin of C++ that can be gsub'd from one to another, you're going to have a bad time. If you're an expert in C++ and have never written Rust, you probably aren't going to beat your finely-tuned C++ program with a ten minute Rust translation. But someone with a year of experience in Rust is probably going to be able to rewrite it in half an hour and come within spitting distance of the thing you've optimized by hand over the course of a few years.

I've written Rust for half a decade and I'm not sure I've ever actually explicitly indexed into a Vec or slice in production code. If I needed to, and it was in a hot loop, and it wasn't a fixed-size array whose size is known at compile time... there's always `get_unchecked()` which is functionally identical to indexing into a C++ vec without `at`.

jcelerier · on April 1, 2022

> Yes, if you don't know Rust, and you think it's just a weird re-skin of C++ that can be gsub'd from one to another, you're going to have a bad time.

most C++ code is not "idiomatic C++" either, it's code which looks like this:

https://github.com/cdemel/OpenCV/blob/master/modules/calib3d...

or this

https://github.com/madronalabs/madronalib/blob/master/source...

or this

https://github.com/MTG/essentia/blob/master/src/3rdparty/spl...

etc ... which is in general code from science papers written in pseudocode which is ported at 2AM by tired astrophysics or DSP masters students to have something to show to their advisor. You can have whatever abstraction you want but they won't be used because the point is not to write rust or C++ or anything but to get some pseudo-MATLABish thing to run ASAP, and that won't have anything that looks like range / span, only arrays being indexed raw.