
Getting 4 bytes or a full cache line: same speed or not? - ingve
https://lemire.me/blog/2018/07/31/getting-4-bytes-or-a-full-cache-line-same-speed-or-not/
======
Const-me
I don’t like the article.

The code is not portable to other compilers/OSes so I can’t even run it here
on Windows. The code uses inline assembly a lot, i.e. optimizer likely failed
to achieve much, and for code like this optimizations are very important.

The reason for inline assembly is also worse mentioning, it’s for rdtsc
instruction. Mainstream CPUs have dynamic frequency for the last decade.
Recent CPUs return rdtsc values unrelated to the count of instruction it
executed, it’s a counter running at some constant frequency. The code uses
AVX, the CPU is therefore at least sandy bridge or AMD equivalent. Now in
2018, __asm rdtsc is no longer a good method for performance measures.

~~~
ordu
_> The code uses inline assembly a lot, i.e. optimizer likely failed to
achieve much, and for code like this optimizations are very important._

Yes, optimizations are important, but not for the question the autor tries to
ask. To answer his question he need to control optimizations, to have
vectorized code and not vectorized one. The simpliest way to control
optimizations is to turn them off and to do them by hand.

 _> Recent CPUs return rdtsc values unrelated to the count of instruction it
executed, it’s a counter running at some constant frequency._

So this counter is the best fit to measure speed of code. What the point to
count number of instructions executed when different instructions take
different time to execute? But if rdtsc counts clock ticks, than it fits best.
It is OS independant way to measure time. It is compiler dependant way, but it
works with gcc and clang, the compilers you can get for free. Alternatively it
is not so hard to rewrite two non-empty __asm statements into any asm syntax
your compiler understands. There are total of 6 instructions.

~~~
Const-me
> it works with gcc and clang, the compilers you can get for free

When I’m building for Windows, I usually use Visual C++ 2017 community
edition, the compiler I got for free.

> into any asm syntax your compiler understands

When building 64 bit binaries, my compiler doesn’t understand any syntax. The
portable way to produce that instruction is __rdtsc() intrinsic, it’s
supported in all major compilers just in different headers, x86intrin.h for
gcc/clang, intrin.h for msvc.

> There are total of 6 instructions.

Right, but I’m not sure assembly, inline or not, has good compatibility with
advanced optimization techniques implemented by all modern compilers. It’s
just too low level: harder to inline, impossible to move stuff across
registers…

~~~
ordu
_> When I’m building for Windows, I usually use Visual C++ 2017 community
edition, the compiler I got for free._

Is it an obstacle to get one more free compiler?

 _> When building 64 bit binaries, my compiler doesn’t understand any syntax._

Is it really? I'm surprised.

 _> The portable way to produce that instruction is __rdtsc() intrinsic, it’s
supported in all major compilers just in different headers, x86intrin.h for
gcc/clang, intrin.h for msvc._

I like to learn it. Thank you.

 _> Right, but I’m not sure assembly, inline or not, has good compatibility
with advanced optimization techniques implemented by all modern compilers.
It’s just too low level: harder to inline, impossible to move stuff across
registers…_

Its not relevant: this code do not rely on compiler optimizations.

~~~
Const-me
> Is it an obstacle to get one more free compiler?

I have both gcc and clang installed but they’re for Linux. I’m using Windows
as my main OS, and prefer running native binaries, e.g. because tools for
debugging & profiling.

> Its not relevant: this code do not rely on compiler optimizations.

Right, and that’s GCC’s problems — MSVC-produced code is optimized, and is
free from the described issue.

You know how I found that? I’ve made a version portable across compilers,
built with VC++ and run on Windows: [https://github.com/Const-
me/MemoryAccessCosts](https://github.com/Const-me/MemoryAccessCosts) I don’t
have Intel compiler but my expectation is it’ll work even better.

~~~
ordu
_> Right, and that’s GCC’s problems — MSVC-produced code is optimized, and is
free from the described issue.

> You know how I found that? I’ve made a version portable across compilers,
> built with VC++ and run on Windows: [https://github.com/Const-
> me/MemoryAccessCosts](https://github.com/Const-me/MemoryAccessCosts) I don’t
> have Intel compiler but my expectation is it’ll work even better._

Yes! You have found that optimization doesn't allow you to make a comparison.
Code is free of the described issue, so you cannot measure effect this issue
have on the speed. As I understand from asm listings, you are comparing now
not 1 iteration with 16 iteration, but 1 iteration with some vectorized code.
In the original article this case is covered, with hand-crafted vectorized
code.

Do you see? When optimization is turned on you cannot compare 16 iterations
with 1 iteration or with vectorized code. It is the basics of experimental
method: you need to control independant variable, and to measure dependant
one. When using compiler optimizations it is the way harder, you would need to
check generated code.

Turn off vectorization of code in msvc, and hopefully you'll get the same
results.

~~~
Const-me
> you are comparing now not 1 iteration with 16 iteration

GCC made code with zero iterations, it unrolled the complete inner loop and
produced 16 scalar add instructions.

MSVC made code with two iterations, each iteration containing two paddd SSE2
instructions, which stand for _mm_add_epi32.

> Turn off vectorization of code in msvc

I can’t. When I try, the compiler says “Command line warning D9002: ignoring
unknown option '/arch:IA32'”

SSE2 is a requirement for AMD64 CPUs, that’s why the MS compiler doesn’t
support scalar code anymore. And the data size is too large for 32 bit builds,
it’s not enough address space for a 8GB vector.

~~~
ordu
_> GCC made code with zero iterations, it unrolled the complete inner loop and
produced 16 scalar add instructions._

It doesn't matter really. It is amount of work that needed for experiment.

 _> I can’t. When I try, the compiler says “Command line warning D9002:
ignoring unknown option '/arch:IA32'”_

I have zero experience with msvc, but I believe you are doing it wrong. You
need not to change arch, you need to control over optimizations. For example,
gcc has options -ftree-vectorize, -ftree-loop-vectorize, and -fno-tree-
vectorize and so on. You need to find docs on compiler optimization techniques
and on options that allow to control them.

 _> MS compiler doesn’t support scalar code anymore._

Strange, but ok. If we return to the starting question "why the author didn't
write the code portably", than we now know the answer. There are no way to
write scalar code with MSVC. MSVC doesn't allow make this kind of experiments,
so there are no reasons to make such tests portable to msvc.

~~~
Const-me
> You need not to change arch

See the pics:
[http://const.me/tmp/vc/codegen.png](http://const.me/tmp/vc/codegen.png)
[http://const.me/tmp/vc/optimize.png](http://const.me/tmp/vc/optimize.png)

> You need to find docs on compiler optimization techniques

Indeed, #pragma loop(no_vector) in the source code before the inner loop did
the trick. The results are very close to GCC.

------
lscharen
To me, the takeaway is that there is a ~10x performance difference between the
near-worst case memory access pattern (330 cycles) compared to the best using
SIMD (41 cycles).

A potential order of magnitude improvement hiding in the memory access
patterns of your code makes this worthwhile information for a lot of devs.

~~~
corysama
Additionally, if you do the wait for a byte, you've done 90% of the wait for
63 adjacent bytes. Software becomes memory bound when it peeks at 4 bytes
here, 4 bytes there in zillions of unpredictable cache lines. So, instead of
waiting 41 cycles to process 64 bytes, it waits 300 cycles to process 4.

It helps to think of the * and -> operators in C++ as "int
cache64BytesAndLoad4FromCache(int *p);" That's when you start to really feel
"int x = a->b->c->d; a = a->next;" deep in your gut.

------
mjw1007
When 'percache' is higher, there will be more instructions executed between
calls to pick_address_of_random_cache_line().

That means as the out-of-order processor "runs ahead" while waiting for the
first main-memory request, it will make fewer requests to main memory before
it runs out of space in its reorder buffers, and so it won't be able to
overlap so many main-memory requests.

I think this is the main effect they're seeing. It would fit in with SIMD code
giving better results, and it explains why the effect mostly goes away when
they make each load depend on the previous calculation.

~~~
Const-me
No it’s not that. The CPU runs fine, the OP was screwed up by GCC optimizer.
With MSVC, there’s very small difference when testing with 8GB vectors, only
8% between 1 and 16.

Here’s my portable version, scalar-only, C++, and slightly more optimized:
[https://github.com/Const-me/MemoryAccessCosts](https://github.com/Const-
me/MemoryAccessCosts)

The results are in result-gcc.txt and result-msvc.txt in that repo. They’re
from the same PC, I’m using gcc 5.4.0 running in WSL.

~~~
mjw1007
I think those results just show that if you persuade the autovectoriser to
work you'll get the same result as for manual SIMD code, as you'd expect.

~~~
Const-me
Initially, I didn’t know the reason is automatic vectorization. I’ve found the
reason and wrote the readme after that comment: [https://github.com/Const-
me/MemoryAccessCosts/commits/master](https://github.com/Const-
me/MemoryAccessCosts/commits/master)

