> replaces short[65536] look up table Is that not quite dim to begin with (havin...

Tuna-Fish · 2024-05-15T22:38:31 1715812711

The lookup table does surprisingly well because the workload is otherwise extremely cache-hostile, and it doesn't really matter if you blow up your L1 cache, none of the data you evicted because you needed to fit the LUT was ever going to be reused anyway.

ML loads in general are streaming loads that linearly load the entire dataset for every iteration.

pclmulqdq · 2024-05-15T23:33:52 1715816032

One interesting option for these big memory scans on x86 and ARM CPUs is using the non-temporal load/store instructions. Those actually bypass caching* and may help with the cache pressure of LLM workloads that just do scans. The lookup table is still probably the wrong solution even with this sort of thing.

* Not quite all of it - There are still buffers to do write combining and some read caching on scans.

andy99 · 2024-05-15T22:14:16 1715811256

Why you (probably) shouldn't use a lookup table https://specbranch.com/posts/lookup-tables/ gives some discussion about when it's appropriate generally. My narrow experience is that you can do a lot of real time calculation before it's faster to do a lookup.

o11c · 2024-05-15T23:02:36 1715814156

One case that doesn't mention explicitly: sometimes your problem is lucky enough that you have the choice between:

1. for each function, for each datum, call function with datum

2. for each datum, for each function, call function with datum

For 1, if your main data is fetched predictably but does not saturate bandwidth, it can be beneficial to use even a fairly large LUT. Note also that while actual L1i is semi-ignorable, e.g. branch prediction benefits much more from remaining "in cache".

For 2, assuming other functions also have their own miscellaneous data, the non-LUT approach is likely to win. But you should probably benchmark against #1.

But sometimes you can get the best of both worlds:

3. for each arbitrary chunk of data (smaller than L3, or maybe even L2 depending on when your throughput saturates), for each function, for each datum in the chunk, call function with datum

Yes, you should of course do whole-system benchmarks, but being able to guess helps you write a useful implementation to benchmark in the first place.

pclmulqdq · 2024-05-15T23:47:04 1715816824

In the two cases you have specified here, #2 is almost always the winner on performance. So much so that in performance-sensitive code, many people will (justifiably) default to it without a benchmark. Computers almost always operate in a memory bandwidth bound state, and have comparatively idle cores, and #1 is likely to just be wasteful of the resource that will almost always be the binding constraint.

Examples are ECS systems in games, and async run-to-completion runtimes on servers. HPC systems also tend to operate this way.

Also, in the interest of disclosure, I wrote the blog post you are responding to.

anonymoushn · 2024-05-16T03:46:11 1715831171

For some parsing, serialization, or filtering tasks, you end up with quite large lookup tables of shuffles that would otherwise be pretty costly to compute on the fly, but real workloads hit only a tiny part of the lookup table at a time. (This comment fits entirely within his point about benchmarks, I guess)