Hacker News new | past | comments | ask | show | jobs | submit login
Cache invalidation really is one of the hardest problems in computer science (surfingcomplexity.blog)
306 points by azhenley on Nov 26, 2022 | hide | past | favorite | 152 comments



Slightly off topic...

Back before my Phoenix/Elixir days, we used to have to cache almost everything. Now we don't cache anything -- even with more traffic and less servers. Now, the projects I work on are not getting millions of views every day. But they are still important for many customers.

We don't really design anything that needs to be cached anymore. Just better access patterns and table design as we get wiser.

Take Wordpress as an example. There really should not be a need to cache a Wordpress site. But some of the most popular plugins are those that cache. To be fair, I suppose, their legacy non-optimal database design access pattern and code is responsible for this.

Just getting something up and aiming for product-market fit is priority #1 and then struggling with caching if it gets popular, at which point, more resources can be thrown at it.


Elixir is not some golden ticket and in my experience still needed caching (ets and/or redis). I worked on an elixir service that topped out at around 2k requests per second with 5 servers even with some caching of the high-read/low-write flows. This was mostly limited by db cpu pressure against postgres with lots of materialized views and monstrous queries being inefficient due to over normalization and, maybe, partly affected by uuid primary keys.

I came to this system with prior experience using Go and mysql where I expected base performance of 1-5k queries per second per node before any optimizations like (non cpu) caching but also didn't use triggers or anything fancy to prevent db cpu pressure.

I got a bit rambling and perhaps a bit off topic, but elixir is not something that side steps the need for a cache when your db is maxed out.


2000 requests per second implies your system was badly designed. If you’re saying it was mostly database bottlenecks why would that not affect the Go version in the same way?


The point I attempted to make was that elixir and your caching needs are orthogonal.

However, to indulge Go vs Elixir in this context (note it was two different systems, companies, and scale referenced above): Elixir has the amazing Ecto query builder and I think poor schema design along with good tooling let folks be productive and build a couple of years of cruft into a tangled knot of table dependencies and access patterns and materialized views.

Go tends to eschew ORM like things and I believe some pressure on the devs to think more about the actual queries being (hand) generated would have exposed schema inefficiencies earlier (instead of Ecto automagically pulling all the data you need, you see the heavy lifting when writing the queries).

And to your point, with a fucked schema and or access patterns, Go vs Elixir or any other language is a moot point. Cheers!


Of course you can use an ORM with Golang, and will have reduced throughput. However, I wonder how much you can profile for both languages and fix the bottlenecks by rewriting things. It can also depend on the machine and how the VM allocates the threads and memory allocation (for example multi chiplet, multi rambank support).


Ecto isn’t an ORM it’s two things 1) an Elixir DSL for building SQL Queries 2) Ecto.Changeset a system for validating and managing data changes before they go into your database.

I think bad database design can happen anywhere and good database design can happen in a PHP application. It’s not necessarily language dependent and always refactoring technical debt on the database design can be some of the most productive changes you can make to a service.


> Take Wordpress as an example. There really should not be a need to cache a Wordpress site. But some of the most popular plugins are those that cache.

Caching gives sites about two orders of magnitude speed uplift in my experience. Sometimes more. Especially if you are building the pages live and compressing them.

Brotli -11 is really really expensive and egress data transfer is one of the more expensive commodities to purchase from cloud venders.

Caching also reduces latency significantly - which is a much better user experience.


>Take Wordpress as an example. There really should not be a need to cache a Wordpress site. But some of the most popular plugins are those that cache. To be fair, I suppose, their legacy non-optimal database design access pattern and code is responsible for this.

To anyone reading this, do not assume that you don't need to cache your Wordpress site! I remember running a test on my site once (basic WooCommerce site running on a DigitalOcean VPS), and simply turning on the free version of WP Fastest Cache was the difference between serving 30 users/min and crashing vs. serving 1000. Drastically lowered page load times too.


In my experience, the caching of WordPress became essential on shared hosting which one of my clients was somewhat bound to for....reasons. Without, each page load was atrociously slow, but once enabled and the cached was warmed, the site moved as quickly as anywhere else.

It's interesting because a custom legacy PHP application on the same shared host didn't demonstrate such slowness, but I did write some aggressive in-request caching (static variables for repeatedly called functions, mainly) to great performance effect.


Just throwing more ressources at it will make your hosting company happy. I am working for one, so I will not disagree with you. If I were to mention its name, I would though, since a lot of what we do is help our clients to make their webapplications run more efficently (and thus saving them a lot of money in the long run).


WordPress by itself isn’t so bad. It’s the plugin/theme developers who likely have no idea what they’re doing or why what they’re doing is bad.


My client is running 21 plugins :(


At Automattic, I think the mu-plugins folder for WordPress.com had at least a hundred directories and at least another count's worth of the same in files. It was massive.


Keeping operating modes simple is something that pays off especially in large scale systems (let alone smaller ones): https://archive.is/2S2ME (Doing Constant Work, Colm MacCárthaigh, AWS Builders' Library).

Generally, I've found that, in the guise of doing the simplest thing, I've done the most naive-est thing possible, which unfortunately comes with the burden of tech debt. So, as a rule of thumb, given the small tech shop we are, I tend to favour systems that are fully-managed or serverless and prefer to take on whatever tech debt using it throws at me.


WP cache is "necessary" if you run on a low-end shared host, where contention for resources, context switching, and basically any dynamic feature of the site gets too slow in a big traffic surge.


1M views per day is only ~12 views per second. It should be possible to easily run at least 1K DB queries per second. Based on this, it seems that you would have quiet a bit of room with a non-cached model. Of course, peak traffic estimates are really the key.

I don't understand how Phoenix/Elixir impacts this however.


Phoenix / Elixir (really, BEAM) enforce a worldview that results in non-bottlenecking code. You can do this yourself with a certain amount of disclipine in other languages. You should do this yourself if you are getting millions (or billions, etc...) of hits per second to gain access to system with a lower constant than BEAM. This is why Discord (for example) shifted to Rust.

Two interesting things are true:

1. Engineers have through lack of experience (or care) bottlenecked systems that had 12 views per day. Guardrails are useful particularly if there are shared resources or bottlenecks can impact busier neighbors.

2. The threshold where a shift out of Elixir is a good idea climbs at the same rate as Moore's law. Discord made their Elixir->Rust moves a lot later than Twitter made Ruby->Java ones.


Bottlenecked by what? Cache misses? IO?


Aren't you just pushing down the caching from HTML and App layers to DB layers, OS layers, Disk layers and CPU caches layers?

Something needs to be cached always, so the DB and Disk need at least cache blocks.


If your main page is put together by running a bunch of queries it’s just a waste of resources to run these for every bot hit.


I believe in caching where it makes sense. This is usually app and implementation specific. But I do agree that a cache shouldn't make up for bad DB design and tuning.


> Cache coherency ensures that the behavior is correct, but every time a cache is invalidated and the same memory has to be retrieved from main memory again, it pays the performance penalty of reading from main memory.

1. First no. If any such rereads occur, they will be from the LLC (last-level cache, or L3 cache for Intel/AMD CPUs).

2. Second no. IIRC, modern caches are snooping and/or directory caches. This means that when Core#0 says "I'm changing cacheline X", Core#0 knows that Core#1 has it in its L2 (or deeper) caches. So Core#0 will publish the results of that change to Core#1

3. The examples they gave are missing read/write barriers, which are rather important. #2 will only happen during read/write barriers. That is to say: your code is memory-unsafe by default. If order and cache-invalidation matters, you actually need a barrier instruction (!!!) to make sure all these messages are passed to the right place at the right time.

---------

For better or for worse, modern CPU design is about doing things thread-unsafe by default. If the programmer recognizes a problem may occur, it is the responsibility of the programmer to put the memory barriers in the right place.

This is because the CPU is not the only thing that can reorder things... but also the cache system... as well as the compiler. So memory barriers inform the compiler + CPU + cache system when to synchronize.

The original Netflix article is more precisely worded.

> This consistency is ensured with so-called “cache coherency protocol.”

With a link to MESIF (F for Forwarding state, which is one such protocol / way to share L1 cache info without going all the way to LLC or DRAM)

I'm pretty sure that GPUs actually invalidate the cache dumbly, without any MESI / MESIF (or whatever). GPUs are actually really bad at these kinds of ping-pong operations and synchronization... preferring thread-fences and other synchronization mechanisms instead.

------

That being said, I think the blogpost is a good introduction to the subject. But note that its a bit imprecise with some of the low level details. I guess its correct for the GPU-world though, so its not completely inaccurate...

-----

The original Netflix article has an even more difficult problem revolving around Java's Superclass cache and how it is affecting "true sharing" of caches. I'm not very familiar with JVM internals, so I lost track of the discussion at that point.


> 3. The examples they gave are missing read/write barriers, which are rather important. #2 will only happen during read/write barriers. That is to say: your code is memory-unsafe by default. If order and cache-invalidation matters, you actually need a barrier instruction (!!!) to make sure all these messages are passed to the right place at the right time.

Wait... wait what? Am to understand that if I have code like this:

    struct {
      int foo;
      int bar;
    } foobar;

    mutex io_mutex;

    void baz(int* x, int n) {
      *x = 0;
      for (int i = 0; i <= n; i++)
        *x += i;
      lock io_mutex;
      print *x;
      unlock io_mutex;
    }

    thread t1 {
      baz(&foobar.foo, 10000);
    };

    thread t2 {
      baz(&foobar.bar, 10000);
    };
You're saying this code is memory-unsafe without adding barrier instructions? (as opposed to just having terrible performance) (let's also assume the compiler isn't smart enough to just compile that down to `x = n (n-1) / 2;` and actually reads/writes the value in the pointer.)


    lock io_mutex;
    print \*x;
    unlock io_mutex;
Have you ever looked at lock/unlock code? There's a memory + compiler barrier inside of any mutex.

The easiest way to use memory barriers is through locking/unlocking mutexes. Alas, that's too slow for a number of applications, so advanced programmers have begun to use more complicated patterns.


But you aren't actually mixing data here, right? Even though foo and bar could be on the same cache-line, the threads need to read/write either foo OR bar, so the respective core caches being out of sync for the var they don't use does not really cause any safety issues. Maybe I'm missing the point of your example.

No idea how these two separate cache lines get merged to main memory in the end though, is there some kind of write mask for that?

Also I imagine them being in the same struct is really just an implementation detail and shouldn't actually influence the memory safety?


I was going to say something like this but I like this version better. Key things to keep in mind, this is Intel Core micro-architecture, other micro-architectures may make other choices.

In particular using snooping or shoot downs it one of those choices. Why use one or the other? Consider whether or not your memory controller is "single threaded" or not[1]. If you have multiple memory controllers that have access to the same "slow" memory, then a shoot down (cache invalidation message) over the cache coherency interconnect, can start the refresh transaction for unoccupied memory controller context. In these cases you can overlap memory transactions and "hide" some of the latency. If you're snooping, and see a write through (where write is headed for main memory), you can mirror a copy of that write through to update your cache, but you might simply invalidate your cache if there is a shared cache with the CPU (thread) in question, as it will get changed in the shared cache and the invalidation will cause the shared cache to move it from L2 to L1 as soon as you look at it again.

The availability of multi-core RISC-V designs for FPGAs give someone who is interested in this an opportunity to actually try the different strategies and to compare instruction throughput (instructions per clock) and maximum latency for correct operation.

There was a good paper at Microprocessor Forum in the early 2000's from Toshiba on a multi-core cache design where the L2 cache was visible but segmented between cores (it appeared as a single cache where you didn't have access to invalidate all of the look aside buffers. I thought it was very creative and it of course rewarded algorithms that understood that. (small working set per thread, larger working set for application)

[1] Basically how many memory transactions can it have outstanding at any given time.


Directory vs snooping just means you either know exactly which other agents have an interest in a line so you can target them directly, or that you don't so you broadcast and other agents have to "snoop" requests and respond to relevant ones.

The logical coherency protocol that is (somewhat) above that level is what would dictate whether you can forward modified data between CPUs or it would have to write back and be fetched from memory.

AFAIK protocols don't tend to have the writer publish its modification to other owners for two major reasons. First because you then would no longer own it exclusive and would have to go off-chip if it wanted to modify it again immediately afterwards. Second because problems with supporting atomic operations and preventing deadlocks in the protocol can become unmanageable if you could have other owners while you are writing lines. I believe generally the transfer of data is demand-driven, so core 1 wanted to read the line, it would ask core 0 for it. And that modifications will always remove access permission from other cores before the modification completes, so core 0 would have send an intent to write message to core 1 in this case before its store could reach coherency.

Barriers don't necessarily drive any coherency on a cache line level. Caches are always coherent (ignoring access by incoherent agents, where the coherency has to be managed at a software level). Barriers provide ordering between more than one memory access.

The MESIF page is a good basic intro to concepts, naturally state of the art has moved beyond that. Here for example https://course.ece.cmu.edu/~ece742/f12/lib/exe/fetch.php?med... a 15 year old CPU has 13 states in its coherency protocol. It's difficult to find a lot of public information about this stuff. Suffice to say these things are so complex that mathematical proofs are used to ensure correctness, no deadlocks, etc. You are right right that data including updates can go directly between agents rather than via memory.

But no matter how smart the hardware gets, the advice to programmers is still the same: stores to a line will slow down access to that line by any other agent, so sharing read-only lines between CPUs is fine, but don't add a lot of writes to the mix.



> For better or for worse, modern CPU design is about doing things thread-unsafe by default. If the programmer recognizes a problem may occur, it is the responsibility of the programmer to put the memory barriers in the right place.

This so much. I truly wish RISC-V had decided on the weak memory model (WMM)[0] that allowed load-value speculation in the absence of explicit `Reconcile` fences between dependent loads (i.e., when first loading an address from memory and then loading a value from that address, it can load an old value that was stored after the value's address was stored to memory, provided it speculated the value (of pointer-type) of the first load correctly and loaded from this speculated address before it properly executed the first load).

Not needing to repeat the load from a speculated address in response to that address getting a cache invalidation before the load that fetched the (correctly speculated) pointer "goes though" severs the cache snooping/invalidating feed to the reorder buffer(ROB). Essentially, the difference to banning the dependent-load-load reordering is that WMM frees up the second load's ROB entry once the value speculation of the first load confirms, so when the first load instruction returns and affirms the value speculated for it, the second load can retire immediately with the first load to free up both ROB entries. The decoupling between an L2 invalidating other L2's and responding to another L2's read-request allows the "Invalidation Buffer" of an L2 to function similar to an L1's store buffer. So if L3 is implemented by just giving each core a large multi-bank L2 so the L3-slices are just the different core's L2s, write-congestion into an L2 through other cores all writing to the same bank is just buffered away (and redundant stores can be squashed/adjacent stores write-combined, useful if the writing cores specify they themselves don't want read access to the cachelines those writes target).

See this excerpt from the paper: > Speculations: OOO can issue a load speculatively byaggressive predictions, such as branch prediction (Figure 3d),memory dependency prediction (Figure 3e) and even load-value prediction (Figure 3f). As long as all predictions relatedto the load eventually turn out to be correct, the load resultgot from the speculative execution can be preserved. Nofurther check is needed. Speculations effectively reorderdependent instructions, e.g., load-value speculation reordersdata-dependent loads. Since WMM does not require preserving any dependency ordering, speculations will neither breakWMM nor affect the above correspondence between OOOand WMM.

Yes, WMM is similar to Alpha, just that Alpha allowed a load from some address to be reordered after a store to a different address (which is forbidden in WMM).

[0]: https://arxiv.org/abs/1707.05923


Not all the world is x86, arm doesn't do as much.


ARM is cache coherent, meaning it has some kind of MESI (or better, like MESIF) protocol going on.

ARM is _RELAXED_ however, or perhaps... "more relaxed" than x86. More orderings are possible on ARM, but that only makes memory-fences (load-acquire, store-release) even more important.


> but that only makes memory-fences (load-acquire, store-release) even more important.

I would say that the importance is in the correctness of the algorithm that implements the shared memory synchronization.

Lowering the algorithm/code to architecture instructions that enforce memory ordering is trivial at that point. So what if it means a fence instruction isn't needed on x86 but it is on arm64 at some particular point in the code.


Makes me wonder why there are no portable system language keywords that abstract "hey, I know my peripheral touched this DRAM, please don't take it from the cache".


I suspect you're talking about the case where the peripheral issues a memory read request and you want the coherence protocol to return the value from CPU cache via snoop instead of having that value already been evicted from CPU cache and having to go to DRAM (off-chip memory) for the value?

If the peripheral issues a memory write, that location in the CPU cache must be invalidated so a CPU memory read of the location does not return an old/stale value.

In my own experimentation (non real-world use case) on a very specific system, I was surprised by the rate of peripheral read requests that resulted in snoop hits where the value would be returned from CPU cache (instead of from DDR PHY controller). The base case was surprisingly low. Modifying the experiment to have the CPU continuously access the memory (read accesses) while the CPU-peripheral interaction was taking place resulted in much much higher snoop hit rates. The overall performance difference between the two cases was not as big as I would have hoped at all. Perhaps the value being returned from DDR PHY controller was not as slow as I would have expected (some unknown/unexpected behavior caching/bypassing in the DDR controller?)--again, this was not a use case that had real-world memory accesses...

A language keyword for "please don't take it from cache" is tricky because it would be an incredibly low low low level specifier intended to be used for performance reasons in a system that is very complex to reason about performance. Maybe having more knobs could help (much easier to use this language specifier rather than having to write code to have the CPU continuously access the memory hoping that will keep it in cache), but I think this could get into the realm that people get distracted by performance and just start doing things in the name of performance without having proper controls and measurements in place for assisting in understanding what may be happening in the system.

Instructions related to the memory model exist for correctness. Memory prefetch instructions are just suggestions for an already sophisticated memory unit. Memory QoS can be thought of as having an impact on performance, but it is a much higher level solution aimed at partitioning of resources.


Std::atomic in c++. I think rust has something as well.


Isn't that what C's volatile keyword does?


Close, but not fully. Volatile enforces using an assembly read instruction (does not use a register cached value), but no memory barriers. In ye (really) olden days that seemed enough, but then you know, backwards compatibility...


Well, that's what Java's "volatile" actually enforces.

C's "volatile" isn't enough, but more recent C compilers have atomics and memory models.

The memory-model problem wasn't solved until the 00s (that late!!!) when multicore CPUs became more commonplace. Java, C++11, and other language / language updates after that point quickly adopted memory models. And the rest is history.

So yeah, use atomics + memory barriers.


It has earned its title as "one of the two hardest problems in computer science": cache invalidation, naming things, and off-by-one errors.


IMHO off-by-one errors only deserve to be there for the irony. In reality, they are orders of magnitude easier then the other two.


Yes, off-by-one is easier to fix, but far more common. Hence if suffering is the product of frequency and difficulty, then off-by-one suffering (S=FD+1) is the same order as cache invalidation (S=FD).


That is, indeed, the joke.


No? They're talking about something mostly orthogonal to the joke, which is whether it actually deserves to be in the list.

It's the difference between "here's three very hard things, let's make a joke with them" and "here's two very hard things, but we can make a joke if we shoehorn in a less-hard thing".


But that is why it's there, as a joke, not as an actual hard thing.


Are you sure? I think it is intended as an actually hard problem and is forming a joke.

I've never interpreted "oh it's not actually hard" as being part of the joke.


I always assumed that at the time it was a relatively harder problem, with lower-level languages that are mostly array-focused, no real built-in containers and no for-each support. This was cured with better programming languages.

On the other hand, cache invalidation and naming things are problems that grow harder and harder when systems get more complex, which they obviously did.


Perhaps the category was improperly named.


the multiple layers of irony in this one just always amaze me


You may also like this one (from https://twitter.com/mathiasverraes/status/632260618599403520):

There are only two hard problems in distributed systems: 2. Exactly-once delivery 1. Guaranteed order of messages 2. Exactly-once delivery


I like how your answer wasn't just one of the list. I was almost not going to post this reply to illustrate your point.


Despite the complexity of the cause of the problem, the fix was really simple: they just introduced 64 bytes of padding between the two struct members to guarantee that they never could fall within the same cache line. Sometimes the best solution is brute force!


They still saw the same performance though due to “true sharing” or CPU enforced memory ordering on a highly contended variable.

Patching the JDK to skip the secondary cache altogether was the real solution.


Weird they added 64 bytes and not 56, which is the minimum required to have the two variables in different cache lines.


Maybe the 64 bytes was better for some kind of alignment with other architectural structures....


Here's the HN discussion of the original Netflix TechBlog article that this one references: https://news.ycombinator.com/item?id=33540091


> false sharing

Usually the "cache invalidation" remark is about correctness and knowing when to invalidate, like for example making sure all your login tokens get expired when you change your password. This is about performance impacts from having multiple logical entities in one shared cached entity.


Yes I’d say this is more along the lines of concurrency problems than caching problems: it’s about synchronizing access to memory (cache) between multiple cores.


More generally, it's about making sure that no consumer uses stale data, in the presence of multiple concurrent producers and consumers.


In the fix in the original Netflix blogpost, shouldn't they pad with 56 instead of 64 bytes? Otherwise they are wasting some cache space.

I would also expect the fix to reflect the platform it is applicable for, but maybe they simplified it for the blog post...


You’re correct as far as I know, and now I wonder why the Netflix people didn’t elaborate on this choice.


Off topic: What level of sophistication about modern CPUs is _good_ to have? And where does one learn it? Resources?

Say that I want to work with HPC-type applications and otherwise just squeeze the most out of the processor, e.g. quant finance / trading systems.


First question is a bit subjective, but Hennessey and Patterson is a good place to start.

It's thick but reads fairly easy in my recollection


There are a lot of issues here, so I can share some stuff about some of them and hope that some helpful internet commenters come along and point out where I have neglected important things.

A single modern CPU core is superscalar and has a deep instruction pipeline. With your help, it will decode and reorder many instructions and execute many instructions concurrently. Each of those instructions can operate on a lot of data.

As famous online controversial opinion haver Casey Muratori tells us, most software just sucks, like really really bad (e.g. commonly people will post hash table benchmarks of high-performance hash tables that do bulk inserts in ~100ns/op, but you can do <10ns/op easily if you try), and using SIMD instructions is table stakes for making good use of the machinery inside of a single CPU core. SIMD instructions are not just for math! They are tools for general purpose programming. When your program needs to make decisions based on data that does not contain obvious patterns, it is often a lot cheaper to compute both possible outcomes and have a data dependency than to have a branch. Instructions like pshufb or blendv or just movemask and then using a dang lookup table can replace branches. Often these instructions can replace 32 or 64 branches at a time[0]. Wojciech Muła's web site[1] is the best collection of notes about using SIMD instructions for general-purpose programming, but I have found some of the articles to be a bit terse or sometimes incorrect, and I have not yet done anything to fix the issue. "Using SIMD" ends up meaning that you choose the low-level layout of your data to be more suitable to processing using the instructions available.

Inside your single CPU core there is hardware for handling virtual -> physical address translation. This is a special cache called the translation lookaside buffer (TLB). Normally, chips other than recent Apple chips have a couple hundred entries of 1 4KiB page each in the TLB, and recent Apple chips have a couple hundred entries of 1 16KiB page each. Normal programs deal with a bit more than 1 meg of RAM today, and as a result they spend a huge portion of their execution time on TLB misses. You can fix this by using explicit huge pages on Linux. This feature nominally exists on Windows but is basically unusable for most programs because it requires the application to run as administrator and because the OS will never compact memory once it is fragmented (so the huge pages must be obtained at startup and never released, or they will disappear until you reboot). I have not tried it on Mac. As an example of a normal non-crazy program that is helped by larger pages, one person noted[2] that Linux builds 16% faster on 16K vs on 4K pages.

Inside your single CPU core is a small hierarchy of set-associative caches. With your help, it will have the data it needs in cache almost all the time! An obvious aspect of this is that when you need to work on some data repeatedly, if you have a choice, you should do it before you have worked on a bunch of other data and caused that earlier data to be evicted (that is, you can rearrange your work to avoid "capacity misses"). A less obvious aspect of this is that if you operate on data that is too-aligned, you will greatly reduce the effective size of your cache, because all the data you are using will go into the same tiny subset of your cache! An easy way to run into this issue is to repeatedly request slabs of memory from an allocator that returns pretty-aligned slabs of memory, and then use them all starting at the beginning. That this could cause problems at all seems relatively unknown, so I would guess lots and lots of software is losing 5-10% of its performance because of this sort of thing. Famous online good opinion haver Dan Luu wrote about this here[3]. The links included near the bottom of that post are also excellent resources for the topics you've asked about.

When coordinating between multiple CPU cores, as noted in TFA, it is helpful to avoid false sharing[4]. People who write trading systems have mostly found that it is helpful to avoid sharing *at all*, which is why they have work explicitly divided among cores and communicate over queues rather than dumping things into a concurrent hash map and hoping things work out. In general this is not a popular practice, and if you go online and post stuff like "Well, just don't allocate any memory after startup and don't pass any data between threads other than by using queues" you will generally lose imaginary internet points.

There are some incantations you may want to apply if you would like Linux to prioritize running your program, which are documented in the Red Hat Low Latency Performance Tuning guide[5] and Erik Rigtorp's web site[6].

Some other various resources are highload.fun[7], a web site where you can practice this sort of thing, a list of links associated with highload.fun[8], Sergey Slotin's excellent online book Algorithms for Modern Hardware[9], and Dendi Bakh's online course Perf Ninja[10] and blog easyperf[11].

> Off topic: What level of sophistication about modern CPUs is _good_ to have?

Probably none? These skills are basically unemployable as far as I can tell.

[0]: https://github.com/lemire/despacer

[1]: http://0x80.pl/articles/index.html

[2]: https://twitter.com/AtTheHackOfDawn/status/13338951151741870...

[3]: https://danluu.com/3c-conflict/

[4]: https://rigtorp.se/ringbuffer/

[5]: https://access.redhat.com/sites/default/files/attachments/20...

[6]: https://rigtorp.se/low-latency-guide/

[7]: https://highload.fun/

[8]: https://github.com/Highload-fun/platform/wiki

[9]: https://en.algorithmica.org/hpc/

[10]: https://github.com/dendibakh/perf-ninja

[11]: https://easyperf.net/notes/


All of this especially the end.

Funnily enough the part about not sharing cache feels a lot like erlang with less scheduling...


Hey, thanks for sharing! That was a lot of effort to type all that up, with references to boot.


If you’re just starting out, I suggest the introductory book by Patterson and Hennessey (not the Quantitative Approach which is a tome) - https://www.amazon.in/Computer-Organization-Design-MIPS-Inte...

Another one is Computer Systems: A Programmer’s Perspective: http://csapp.cs.cmu.edu/


CS looks so much like biology sometimes

Sometimes the relevant phenomenon lies at the scale of cache-lines, of which there is a whole zoo of possible specifications


I've made lots of headway with MBA types explaining software as a lifeform. Care and feeding and mutations and cancer.


Great article from a few years ago "A Codebase is an Organism": https://meltingasphalt.com/a-codebase-is-an-organism/

HN discussion: https://news.ycombinator.com/item?id=20787848


Emergent behavior…


In the fix, they have padded. instead of padding, do we have compiler directive, that forces the a variable starting using a fresh cache line (instead of putting it in the last cache line?)

Reason I am asking this, it will make the fix independent of the architecture of cache lines, size of cache lines.

Also, the compiler directive/annotation would help developer to use it more liberally when they think, false sharing can happen.


C++17 has std::hardware_destructive_interference_size which you can use in conjunction with alignas to get this behavior, see https://en.cppreference.com/w/cpp/thread/hardware_destructiv...

I’m not sure whether a single target architecture can (either in theory or in practice) map to hardware with different line sizes. If so, another problem is that the compiler’s idea of the cache line size might not match the hardware’s.


AArch64 processors are available with 64 and 128 byte cache lines, and big.LITTLE processors may report different icache line sizes on different cores[1], so a single thread can even be rescheduled across different icache line sizes (no idea about dcache).

[1]: https://www.mono-project.com/news/2016/09/12/arm64-icache/


There is __attribute__((aligned)) that some compilers provide (e.g. gcc, clang) but it's non-standard.


Thank you for the respone.

Can we write like this? Klass ssc __attribute__((64));

This could convey more info than padding.

Further, I wish, gcc could have provided an attribute that explicitly forces for the cache line alignment than just generic alignment. Generic scalar type alignment is not same as the cache line alignment.

Example: Klass sc __attribute__((start_at_cache_line));


alignas(std::hardware_destructive_interference_size) is the portable solution.


Thank you for sharing this. I was not aware of this. But now agree, it is indeed portable solution.


alignas() has been standard for a decade now.


Cache invalidation really sounds to me like a math problem. Instead of a Processor, let's say it is a jewel store with limited display space at the front of the store and a lot of inventory in the back, instead of cache misses you have window shoppers and walk-ins that won't engage with sales staff or have limited time so you have lost sales/customer and a catch hit is the opposite of that. At what point so you invalidate a jewel displayed at the front to rotate it with an inventory item? Let's also assume customers buy multiple jewels if the same type and will not buy in less quantities than they desire or wait for staff to get more at the back because you have a competing store across the street.

Is that about right the way I described the problem? If so, this really might be a math problem like the travelling sales man, not exclusive to compsci.


What you are talking about is not really cache invalidation but a cache algorithm/policy.

Invalidation means to remove a potentially inconsistent element from the cache, like after the original or a part of the inputs of the function were changed. Both problems are important with huge effects, but different.


It's CS, they are all math problems


I like a good analogy challenge, let me try to shape something around your framework!

We sell particularly high-value, unique jewels, that are seated in jewellery. In fact, our jewels are so valuable to us, that our stores are able to modify the case around the jewel even before they sell to a customer. Each customer will be offered the last casing the jewel was in, and be given the opportunity to change it - though they might still choose not to buy.

We have a set of stores that we can sell from, and of course only one store can have the jewel in it at a time. We’re losing customers by only having the local stock available to sell, so we instigate a new policy; every store takes a picture of the casing for the jewel when they have it in store, they can show this last copy of the jewel in their catalogue. If a customer wants the jewel from the catalogue, we call around the other stores asking for the physical jewel to be brought to us.

This works well a lot of the time, but we find a new problem. Sometimes the customer really liked the casing the jewel was in for the photograph in the catalogue, but when the physical jewel arrives the casing has changed. The customer mutters something about false advertising and storms out disgusted by the stale photo shown. So we add a new policy. If a store changes the casing for a jewel it has to send a picture to every other store of the new casing so they can all update their catalogues.

Well! Now we have chaos! Every store owner is so busy updating catalogues and sending pictures around the other stores that they’re getting less real work done. We look at our catalogue, sometimes we’re updating the same page many times without a customer even seeing it!

So we come up with a new policy - instead of immediately changing the catalogue when we get told that another store has changed the physical jewel, we’re just going to mark the catalogue page as out of date. If eventually a customer asks to see that jewel, we’ll apologise to them for the wait, we’ll ask the other stores who has the jewel right now, then we’ll go get the physical jewel from them, take a picture for our catalogue, and show it to the customer. (As a minor optimisation, if we think the customer will want to change the casing we’ll politely inform the store we picked it up from that they can now mark their catalogue out of date).

It is all going well, but there’s a lot of overhead in our catalogue with only one jewel per page, so we start putting multiple jewels on each page. That makes our catalogue page update a bit difficult so we declare that a store must always collect all physical jewels for any given catalogue page.

One day we find that two stores are sending the same two jewels backwards and forwards all day. We investigate and find that the two jewels are on the same catalogue page, one jewel is being changed by a particularly indecisive customer in one store, and the other is being change by an equally determined customer in the other store. Neither cares about the jewel the other is trying to change, but both customers are getting frustrated by the wait as our policy requires both jewels in the same location before an update can be made. The store managers bemoan ever putting the two jewels on the same page to create this false sharing nightmare. They agree that even if wasteful, for these particular jewels they’ll leave blank space in the catalogue, the two jewels land on different pages, and the shuffling of jewels between stores stops.


Thanks for taking the time and explaining. In your opinion is this a problem unique to CS then?


Recommend reading about CPU cache coherence protocols such as MESI

https://en.m.wikipedia.org/wiki/MESI_protocol


Isn't anything functional (i.e. its computation only depends on its known inputs/input state) inherently trivially cacheable by hashing the input state?


Yes, but most use-cases are not pure functions.


What is interesting to me is how cache controllers can signal all other caches when shared data is modified in a single clock cycle.


Netflix and others are discovering one of the reasons why I think one socket servers are the future. Single sockets are big enough for almost every workload, and come with a ~10% perf uplift from not having inter-socket cache coherence nonsense.


Modern x86 CPUs typically have three layers of caching, where L1 and L2 are per-core, and only L3 is shared across all cores. You still have to solve all the cache coherency issues, and everything in the article still applies, even with just one socket.


While that is true, and AMD CPUs are even worse thanks to chiplets, the latency and bandwidth of the inter-chip cache coherent interconnect is a lot worse than the on-chip (or on-socket in AMD's case) buses. That is the expensive part of it.

Cache line sharing (false and true sharing) still hurts in 1S configurations, but it hurts a lot less.


This of course requires a citation of the old joke...

There are two hard problems in computer science: naming things, cache invalidation, and off by one errors.


There are three hard problems in CS: naming cache things, invalidation, off by one and concurrency. errors,


There are four hard problems in CS: naming cache things, invalidation, off by one and concurrency. errors, and memory alloca


There are 5 hard proble… to unlock send 2 BTC to 1lockerA7h……


Something something all problems fixed by blockchain


problems(problems)


There are two hard problems in CS: cache invalidation, naming things, and cache invalidation.


I thought off by one always had to go at the end.


A thought, reasonable but that how you is fail concurrency. at


I think it’s like those tired old UDP multicast jokes which some people don’t get.


Well they sure beat the TCP joke, don't they?

"Hi, I’d like to hear a TCP joke."

"Hello, would you like to hear a TCP joke?"

"Yes, I’d like to hear a TCP joke."

"OK, I’ll tell you a TCP joke."

"Ok, I will hear a TCP joke."

"Are you ready to hear a TCP joke?"

"Yes, I am ready to hear a TCP joke."

"Ok, I am about to send the TCP joke. It will last 10 seconds, it has two characters, it does not have a setting, it ends with a punchline."

"Ok, I am ready to get your TCP joke that will last 10 seconds, has two characters, does not have an explicit setting, and ends with a punchline."

"I’m sorry, your connection has timed out. Hello, would you like to hear a TCP joke?"

https://www.reddit.com/r/Jokes/comments/2ezdl1/i_told_my_fri...


TCP has only 2 round trips, so more like

> C: I would like to tell a joke

> S: You may tell a joke

> C: We are now exchanging the joke

> C: "Knock knock ..."


The nature of UDP


What joke?


Joke: "I would tell you a UDP joke but you might not get it."

Of course, I might be spoiling the point someone else is making by not posting the joke.


I've heard it:

"I've got a UDP joke for you. You might not get it... but I don't care.".


The joke is that UDP doesn't guarantee packet delivery (or order of arrival).

> What joke?


This is a UDP joke.

I don't care if you get it.


I still didn't get it.


You use UDP (as opposed to TCP) when you want to send data but don't need to wait for any response. A good example is streaming metrics when you can afford to miss some by the metrics aggregator or sending out player location data in a multiplayer game - if you miss the player location update in packet N, you'll just get the update in the N+1 packet or later and update their position at that time. This is totally different than TCP where you want the other server to acknowledge they got the packet else you send it again.

So the joke is a play on words. "You may not get it, the udp packet" (because that is how UDP works by design) and "you may not get it, the joke" (due to being unfamiliar with UDP).


I think that’s the joke - that concurrency errors is in the wrong place.


It’s that the “errors” from “off by one errros” comes after concurrency (the hard things being: cache invalidation, off by one errors, naming, and concurrency—not concurrency errors)


can’t tell if meta or whoosh


There are problems.


There is only one hard problem in software engineering: people.


It really doesn’t require it, not here anyway where comments are supposed to be substantive.

I hope it doesn’t seem supporting the HN guidelines is, against fun or anything. I think it’d be great to hear at some social get-together or even as part of a tech talk.

One problem with it here is as you can see are long cascading joke reply comments which make it hard for interesting comments to be seen.


And yet I found the parent comment interesting and on topic and yours tangential and pedantic.


Actually the reference to the joke about cache invalidation being hard is about something else than the article is actually about: the article is about undesired dirty caches and concurrency, while the original joke is about designing algorithm to invalidate caches.


Redditors often bemoan how comment threads are ruined by low-effort commenters making low-effort jokes. It's an easy, knee-jerk complaint that ignores the bigger, tougher problem of bad moderation. Subreddits are plagued by bad moderators whereas HN at least has a few good moderators who can regulate the discourse well.

Some commenters can't tell the difference: they think so lowly of their fellow commenters that even one joke is cancer. Quite a silly condescension given that this one "joke" thread is buried towards the middle of the page and is also producing interesting links like https://en.wikipedia.org/wiki/Naming_and_Necessity


I find it interesting that, while accusing commenters about thinking lowly about others and being condescending, you’re actually doing that exact same thing.


My point was that small minds use pedantry to compensate for insecurity. I’m sorry that it hits too close to home.


I see that my citation of that joke was not to your taste. I appreciate the feedback.

HN provides a solution for the long cascading joke thread. To the right of every comment is a [-] link which collapses the entire thread.

Have you watched The Marvelous Mrs. Maisel? I recommend it. There are so many scenes where she or another comedian just bombs. That is part of the life of a comedian, and learning from when you bombed is how you get better at your craft.

Regarding the off by one joke, personally I find it insightful and relevant to the life of a software engineer. Doesn't it get you thinking about all the ways an off by one error can sneak into your code?

Consider your everyday ruler or tape measure. What is the first whole number you see on it? 1, of course. But the ruler doesn't actually start at 1. It starts at 0.

Or consider a graphic system where you describe the coordinates of a rectangle. You may describe these as (left,top)-(right,bottom), or in other words (x1,y1)-(x2,y2).

Now how do you describe a single pixel? Is it (0,0)-(0,0)? That kind of makes sense. Or should you use (0,0)-(1,1) for a single pixel, and (0,0)-(0,0) is an empty rectangle?

These are not easy questions to answer. One of the best explanations I've seen was in Inside Macintosh, written by my friend Caroline Rose.

Here is a discussion with Guido's and Caroline's thoughts on zeroes and ones:

https://news.ycombinator.com/item?id=6602497

Let me leave you with an Amicus Curiae brief filed before the US Supreme Court:

https://www.supremecourt.gov/DocketPDF/22/22-293/242596/2022... [PDF]

The brief begins:

> The Onion is the world’s leading news publication, offering highly acclaimed, universally revered coverage of breaking national, international, and local news events. Rising from its humble beginnings as a print newspaper in 1756, The Onion now enjoys a daily readership of 4.3 trillion and has grown into the single most powerful and influential organization in human history.


It's funny that if you say "cache" invalidation is hard the CPU people will nod, the database people will nod, the CDN people will nod and they will all be thinking of different things.


I had always heard this saying as an undergrad in CS but never understood why names were supposed to be a hard problem until reading about modal logic about a decade later.

https://en.wikipedia.org/wiki/Naming_and_Necessity


The first two are the same problem so this is almost correct.


There are only two hard things in Computer Science: cache invalidation and naming things.

-- Phil Karlton


It's like they played a game of Jenga that got out of hand

-- Karl Pilkington


There are only two hard things in Computer Science:

1. Cache invalidation

2. Off by one errors

3. Naming Things

Famous CS Joke


I don’t know. It’s pretty easy if you just don’t use cache.


Just flush it down the toilet when it stinks.


and orders of magnitude slower!


The amount of inefficiency in code these days is a testament to the enormous computational power of modern computers. We're talking billions of multiplications or additions per second and millions of division per second.

A mad genius could code most business logic to never be greater than o(n) outside of the most complex of problems. At that point a cache would become a liability due to cache invalidation errors, one of the hardest problems. Not that I recommend this approach. Caches are good tools and can help with DOS.


I think you vastly underestimate just how much slower DRAM is compared to cache (SRAM). Your super fast CPU will be sitting idly doing nothing if it were not for the caches between it and DRAM, which keep it fed with data and instructions.


we take a clean-room approach to computation.


As much as folks are downvoting you there’s a nugget of beauty in your demand - our systems should abstract caching. Ideally your web server framework should cache results itself. Your db too. If Postgres was good at it you wouldn’t need redis as well right?


CPUs already do, except for when the abstraction breaks down for performance reasons.


Sounds like you might like a server based framework like wordpress


Re: cache coherency and mixing in tiering, blah blah - This looks more like a CAP problem than a cache invalidation problem.


How is partition tolerance relevant to CPU cache hierarchies? How is cache coherency not a question of when and how to perform cache invalidation?


> How is partition tolerance relevant to CPU cache hierarchies?

It's not. Hope that answers the question. A lot of this article is about additional concerns that are being leveraged, while avoiding the core issue.

Cache coherency and consistency are equivalent, in this case.

> How is cache coherency not a question of when and how to perform cache invalidation?

It's a consistency concern, for which you have to make a tradeoff. Cache invalidation is incidental, not specifically relevant to the problem at hand. It just happens to take place in memory that has the name cache.


> A lot of this article is about additional concerns that are being leveraged, while avoiding the core issue.

What’s the core issue that is being avoided here?


They are trying to optimize for what hardware they are stuck with, then characterizing it as if it's a cache invalidation problem. At the core, it is not a problem with when to do cache invalidation. Their "problem" (which isn't a problem, they just want to see a different use profile) is a consequence of what they have decided to stick with due to concerns that are not reasoned and wouldn't be very interesting anyway. I don't understand why anyone would want to belabor this further.


Seems like you are implying that there is some hardware they could move to where they would not have the problem, were they not stuck with their current one.

This makes me very curious. What hardware could they move to that does not have the same problem?

Or “consequence”, as you call it. Though in my mind performance problems are still problems. Were they characterizing it as a cache invalidation problem? The first paragraph says “a performance problem involving caching”. I read the title as being because their false sharing problem arises from the way CPUs invalidate cache lines, to ensure coherency, which is notoriously complex.


This is an aphorism I got tired of a long time ago.

There was a batch job that I sped up about 40 times with a cache. The invalidation strategy I used was "dump the whole cache and start a new one" which wasn't too bad because the system got about 10% through the job when it ran out of RAM and it had a pattern of a long-tail distribution plus locality which fit that model well.

If I'd been memory-constrained, or had a distributed system, it would have been a different story, but it's not unusual for caching to be a win with a simple invalidation strategy.


To be fair, rocket science is also hard, but not very much when all you plan to do is hitting a barn door with fireworks. Same with prime factorization when the number to factor is “15”, spontaneously sketching out the pole-zero diagram of f(x)=0… you get the gist.

Usually when people talk about a topic being hard, they don’t mean the very constrained cases that are easy.


It's the other way around.

90% of the cases are easy cases, where thinking too hard about invalidation (as opposed to testing what works) is likely to give you worse performance.

There are a rare set of cases that are devilishly hard such as synchronizing the caches across an SMP microprocessor or making a system like Cloudflare really speed up your web site as opposed to slowing it down.


When it comes to cache invalidation worse performance isn’t the primary concern in most cases, correctness is.


> 90% of the cases are easy cases

I'll second this. For example, I can usually get away with simple and short lived TTL caches in varnish (with grace) for high traffic read only APIs. Flushing (invalidation) is automated since it just expires. It doesn't always fit, but often it does.


It’s amusing because rocket science is pretty easy. Rocket mechanics is the tricky part.


Rocket scientists can be heard saying "It's not like it's cache invalidation!".


I was initially disappointed by not seeing a discussion of jargon in the field. Specifically, I think you have to discuss eviction and expiration as two distinct things with many caches.

That said, I enjoyed the article and highly recommend it. With the obvious caveat of "this is the kind of problem that you are likely to never be impacted by," I think it is a fascinating incident that I'm very grateful for the analysis to have been shared. Having another pass at explaining the analysis was a pleasant surprise and a fun read.


[flagged]


Huh? Didn't realize I said that.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: