Idk, it's a general rule of thumb that the more mutable shared state an algorithm has, the worse it scales. So if you're trying to scale something to be concurrent, mutable shared state is an antipattern.
At scale, algorithms are commonly limited by memory bandwidth, not concurrency. Most code can be engineered with enough cheap concurrency to efficiently saturate memory bandwidth.
This explains why massively parallel HPC codes are mostly minimal mutable state designs despite seemingly poor theoretical properties for parallelism. Real world performance and scalability is dictated by minimization of memory copies and maximization of cache disjointness.
as noted someone else, it is lock contention that doesn't scale, not mutable shared state. lock-free data structures, patterns like RCU ... in many cases these will scale entirely appropriately to the case at hand. A lot of situations that require high-scale mutable shared state have an inherent asymmetry to the data usage (e.g. one consumer, many writers; many consumers; one writer) that nearly always allow a better pattern than "wrap it in a mutex".
Mutable shared state is literally the nature of contention. It's true that locking is the mediocre default, but "avoid locks" is not a silver bullet. Alternatives have their own tradeoffs. If you "carefully design" a solution, it's probably because you're not just using an alternative but actually taking care to optimize, and because you have a specific use case (which you described).
It's lock contention that slows things down more than anything.
But it's really an 'it depends' situation.
The fastest algorithms will smartly divide up the shared data being operated on in a way that avoids contention. For example, if working on a matrix, then dividing that matrix into tiles that are concurrently processed.
> It's lock contention that slows things down more than anything.
It's all flavors of the same thing. Lock contention is slow because sharing mutable state between cores is slow. It's all ~MOESI.
> The fastest algorithms will smartly divide up the shared data being operated on in a way that avoids contention. For example, if working on a matrix, then dividing that matrix into tiles that are concurrently processed.
Yes. Aka shared nothing, or read-only shared state.
It's true that it's easier to write correct async code using immutable shared data or unshared data.
However, it's very hard if not impossible to do fast and low memory concurrent algorithms without mutable shared state.