Huh, that's ... still quite inefficient. Since a multiple variables in a cache line is fine if all will be accessed from the same cpu, you ideally want a separate allocator for that, then you can avoid all the spooky dynamic growth. And most CPUs offer a direct "atomic add" instruction which is much faster than a CAS loop. For pure stat counters you generally want `relaxed` memory ordering on that; for other cases acquire and/or release may be desired (this is tricky to get performance-optimal given that some platforms upgrade to a stronger memory ordering for free so it's best to require it in your algorithm, whereas others are best with an explicit fence in the edge case). I've never found a real-world use for `seq_cst`.
It's unfortunate that per-cpu variable are difficult in userland, but there are at least 2 ways to fully emulate them - rseq and pinning - and you can also just revert to full-blown thread-locals (which have much better tooling support) if you aren't constructing a gratuitously large number of threads, or shared thread-locals if you do have a lot of threads. If you make the wrong choice here, correctness never suffers, only performance.
Uncontended CAS without carried dependendies on the result (almost always the case in this use case) are similar in performace to atomic add on most platforms.
The CAS is the price they pay for contention detection, though it would be interesting to consider solutions which usually use unconditional atomics with only the occasional CAS in order to check contention, or which relied on some other contention detection approach (e.g., doing a second read to detect when the value incremented by more than your own increment).
The solution looks reasonable to me given the constraints.
Part of my point was that "check for contention" is often a silly thing to do in the first place, when you can just avoid it entirely (and with simpler code too).
Admittedly, Java is fundamentally incapable of half of the solutions, but making a simple bump allocator (called once per statistic at startup) over per-thread arrays is still possible.
Yeah exactly, and this is a commonly used trick in concurrent data structures in general. The Java implemenation has the additional twist that they don't use a fixed number of "slots" but rather start at 1 and use CAS failures as a mechanism to detect contention and then grow the number of slots until there are no longer CAS failures.
Updating a SQL database row can be expensive because of the transaction overhead. Once you increment this counter, the row with the counter remains locked until the transaction is committed. But the actual useful work of incrementing a number is already quicker than selecting a random slot. If the increment can be a simple ADD instruction, without transactional semantics, then this random-slot-based approach is slower than just incrementing it. Especially if you don't care about missing a few increments and therefore don't need a mutex.
For even more scale, don't use Redis but have a shared memory block containing your integer.
For even more scale, you can have one such slot per CPU. Make sure they're in different cache lines, to avoid false sharing. Now you have slots again, but they're per-CPU and so you still avoid all the SQL database overhead.
Performance is all about structuring the task as close to the machine as possible.
You could add a row for each count. You could add more associated information with the count too, like date, why, etc.. Then that insert would trigger a debounced method that would query the event stack to give you a count. You can make periodic snapshots of the total to avoid O(n) table scans. This allows scale without row locking and a much more accurate accounting of when the counts happen. You rewind/fast forward, get a time series, whatever with the event stack. This is all part of the CQRS pattern.
Could be a SQS or an in memory queue in your app (hand waves about mutiple nodes but that can be dealt with or use redis). Have a single processor that aggregates and does writes. You can use DLQ if something goes wrong to avoid data loss.
Inside the DB maybe just append rows for each add to a table without indexes. That way the engine probably just needs to write out a journal record and append but doesn't spend time on updating indexes. Then periodically read all from the table and aggregate into another. You could cycle between 2 tables.
These is eventually consistent in aggregations but transactional in source of truth. I.e. if you write you can be sure the data is stored, but may take time to aggregate.
Or Use a more exotic DB. That is distributed but still supports transactions.
Usually for this kind of use case you want both INCREMENT and DECREMENT. Using a random number for the "Slot" (AKA "Shard") seems like an anti-pattern. If the Slot were derived from a hash of the data itself, then DECREMENT would be much more straightforward.
https://github.com/openjdk/jdk/blob/master/src/java.base/sha...