What adding a lookup table can do is to add the load-latency to the dependency chain involving the calculation, which seems to be what you are talking about here. For an L1 hit that's usually 4 or 5 cycles, and for L2 hits and beyond it's worse, as you point out. How much that actually matters depends on whether the code is latency-bound and the involved lookup is on the critical path: in many cases where there is enough ILP it won't be (an general rule is that in most code most instructions are not on a critical dependency-chain).
If the involved method isn't that hot then L1 misses (like your example) or worse are definitely a possibility. On the other hand, in that case performance isn't that critical by definition. If the method is really hot, e.g., in a tight(ish) loop, then you are mostly going to be getting L1 hits.
The comparison with 384 FOPs seems a bit off: I guess you are talking about about some 32-FOP per cycle SIMD implementation (AVX512?) - but the assumption of data dependencies kind of rules that out: one would assume it's scalar code here. If it's vectorization, then the whole equation changes!
If it's not vectorizable, LUT result is often used for indirect jump/call (like large switch statement) or memory access (say, a histogram etc.).
> What adding a lookup table can do is to add the load-latency to the dependency chain involving the calculation, which seems to be what you are talking about here.
Yeah, used a bit sloppy terminology. Loads can affect performance system wide. To be a win, LUT function needs to be something pretty heavy, while LUT itself needs to be small (at least <16 kB, preferably <1 kB).
> If the method is really hot, e.g., in a tight(ish) loop, then you are mostly going to be getting L1 hits.
That really depends. It's generally good to keep L1 footprint small. There are just 512 of 64-byte L1 cache lines. Hyperthread shares L1 as well. There can be other hot loops nearby that could also benefit from hot L1. It's very easy to start to spill to L2 (and further). Microbenchmarks often miss "system" level issues.
> The comparison with 384 FOPs seems a bit off
384 was for the extreme vectorization case, 12 x 2 x 8 FMACs (AVX). Most vendors count FMACs nowadays as two FOPs...
> If it's vectorization, then the whole equation changes
Well, isn't that where the performance wins are and what you need to do to extract maximum performance from that hot loop? A good truly parallel vector gather implementation could make (small) LUTs very interesting performance wise.
You've lost me. The parent comments were specifically talking about using LUTs to replace calculation, and particular calculations involving branches. So basically rather than some ALU ops + possibly some branches, you use a LUT and fewer or zero ALU ops, and fewer or zero branches.
No one is talking about the context of a LUT of function pointers being used for a switch statement.
A histogram is generally a read/write table in memory and doesn't have much to do with a (usually read-only) LUT - unless I missed what you're getting at there.
> Yeah, used a bit sloppy terminology. Loads can affect performance system wide. To be a win, LUT function needs to be something pretty heavy, while LUT itself needs to be small (at least <16 kB, preferably <1 kB).
Definitely. Useful LUTs are often a few dozen bytes: the JumpMForceData one referring to about was only 8 bytes!
> Microbenchmarks often miss "system" level issues.
But it's like a reverse myth now: at some point (apparently?) everyone loved LUTs - but now it's popular to just dismiss any LUT use with: "yeah but a cache miss takes yyy cycles which will make a LUT terrible!" or "microbenchmarks can't capture the true cost of LUTs!".
Now the latter is certainly true, but you can certainly put reasonable bounds on the cost. The key observation is the "deeper" the miss (i.e., miss to DRAM being the deepest, ignoring swap), the less the implied frequency the LUT-using method was being called anyways. If the method is always missing to DRAM, the LUT entries are being cleared out before the next invocation (that hits the same line) which must be a "while". Conversely, when the LUT method is very hot (high of calls), the opportunity for the LUT to stay in cache is great.
You can even analyze this more formally: basically looking at the cost-benefit of every line of cache used by the LUT: if the LUT didn't use that line, what would be the benefit to the surrounding code? For any non-trivial program this gets hard to do exactly, but certainly with integration benchmarks and performance counters you can make some reasonable tests. Associativity makes everything tougher and less linear though...
> Well, isn't that where the performance wins are and what you need to do to extract maximum performance from that hot loop?
Sure, if it can be vectorized. My point there was that you were discussing the latency of L2 misses rather than the throughput, which implies that the operations on consecutive elements were dependent (otherwise it would be the throughput of 1 or 2 per cycle that would be important). So you just have to keep it apples to apples: if you assume independent ops, you can perhaps vectorize, but then the LUT comparison is a throughput one (as is the scalar alternative), but if the code is dependent and non-vectorizable, then the LUT latency more becomes important.
> A good truly parallel vector gather implementation could make (small) LUTs very interesting performance wise.
Anything that can be vectorized usually adds another factor of 4 or 8 to performance making it much harder for LUTs to come out on top, since the gather implementations on x86 anyways just use the same scalar ports and are thus still limited to 2/cycle, and without any "smarts" for identical or overlapping elements.
Sometimes you can use pshufb to do what amounts to 16 or 32 parallel lookups in a 16-element table of bytes. If your LUT can be made that small, it works great.