Hacker News new | past | comments | ask | show | jobs | submit login
The Slotted Counter Pattern (2020) (planetscale.com)
49 points by eatonphil 15 days ago | hide | past | favorite | 21 comments



This is similar to how java.util.concurrent.atomic.LongAdder works:

https://github.com/openjdk/jdk/blob/master/src/java.base/sha...


Huh, that's ... still quite inefficient. Since a multiple variables in a cache line is fine if all will be accessed from the same cpu, you ideally want a separate allocator for that, then you can avoid all the spooky dynamic growth. And most CPUs offer a direct "atomic add" instruction which is much faster than a CAS loop. For pure stat counters you generally want `relaxed` memory ordering on that; for other cases acquire and/or release may be desired (this is tricky to get performance-optimal given that some platforms upgrade to a stronger memory ordering for free so it's best to require it in your algorithm, whereas others are best with an explicit fence in the edge case). I've never found a real-world use for `seq_cst`.

It's unfortunate that per-cpu variable are difficult in userland, but there are at least 2 ways to fully emulate them - rseq and pinning - and you can also just revert to full-blown thread-locals (which have much better tooling support) if you aren't constructing a gratuitously large number of threads, or shared thread-locals if you do have a lot of threads. If you make the wrong choice here, correctness never suffers, only performance.


Uncontended CAS without carried dependendies on the result (almost always the case in this use case) are similar in performace to atomic add on most platforms.

The CAS is the price they pay for contention detection, though it would be interesting to consider solutions which usually use unconditional atomics with only the occasional CAS in order to check contention, or which relied on some other contention detection approach (e.g., doing a second read to detect when the value incremented by more than your own increment).

The solution looks reasonable to me given the constraints.


Part of my point was that "check for contention" is often a silly thing to do in the first place, when you can just avoid it entirely (and with simpler code too).

Admittedly, Java is fundamentally incapable of half of the solutions, but making a simple bump allocator (called once per statistic at startup) over per-thread arrays is still possible.


Yeah exactly, and this is a commonly used trick in concurrent data structures in general. The Java implemenation has the additional twist that they don't use a fixed number of "slots" but rather start at 1 and use CAS failures as a mechanism to detect contention and then grow the number of slots until there are no longer CAS failures.


Are there are any other approaches for this problem? Seems like there should be.


I think probabilistic counters are really interesting for this situation, as long as you don't need exact numbers.

  if random.randint(1, 100) == 1:
      db.increment(row)
Then, looking at the database, you can know that the actual count is approximately 100x the number of times the row was incremented.

There's an even more aggressive version that I've seen called a 'Morris Counter' which uses just a few bits to build an order-of-magnitude estimate of the number of events: https://en.wikipedia.org/wiki/Approximate_counting_algorithm


Yes: use a different tool, such as Redis.

Updating a SQL database row can be expensive because of the transaction overhead. Once you increment this counter, the row with the counter remains locked until the transaction is committed. But the actual useful work of incrementing a number is already quicker than selecting a random slot. If the increment can be a simple ADD instruction, without transactional semantics, then this random-slot-based approach is slower than just incrementing it. Especially if you don't care about missing a few increments and therefore don't need a mutex.

For even more scale, don't use Redis but have a shared memory block containing your integer.

For even more scale, you can have one such slot per CPU. Make sure they're in different cache lines, to avoid false sharing. Now you have slots again, but they're per-CPU and so you still avoid all the SQL database overhead.

Performance is all about structuring the task as close to the machine as possible.


Some people, when they have the problem of incrementing integers think, "I'll use Redis."

Now they have two problems.


Can you go into more detail? Redis supports counters: https://redis.io/glossary/storing-counters-in-redis/



so... what's the problem with using Redis to store a shared counter?

It's a joke about answering complexity with additional complexity. Laugh, it's ok. :)

Redis is a data structure server, for a limited set of data structures. If that's what you want, then use Redis.

You could add a row for each count. You could add more associated information with the count too, like date, why, etc.. Then that insert would trigger a debounced method that would query the event stack to give you a count. You can make periodic snapshots of the total to avoid O(n) table scans. This allows scale without row locking and a much more accurate accounting of when the counts happen. You rewind/fast forward, get a time series, whatever with the event stack. This is all part of the CQRS pattern.

https://martinfowler.com/bliki/CQRS.html


Outside of the DB: some kind of queue.

Could be a SQS or an in memory queue in your app (hand waves about mutiple nodes but that can be dealt with or use redis). Have a single processor that aggregates and does writes. You can use DLQ if something goes wrong to avoid data loss.

Inside the DB maybe just append rows for each add to a table without indexes. That way the engine probably just needs to write out a journal record and append but doesn't spend time on updating indexes. Then periodically read all from the table and aggregate into another. You could cycle between 2 tables.

These is eventually consistent in aggregations but transactional in source of truth. I.e. if you write you can be sure the data is stored, but may take time to aggregate.

Or Use a more exotic DB. That is distributed but still supports transactions.


Usually for this kind of use case you want both INCREMENT and DECREMENT. Using a random number for the "Slot" (AKA "Shard") seems like an anti-pattern. If the Slot were derived from a hash of the data itself, then DECREMENT would be much more straightforward.


Hm, why wouldn't decrements work with random slots, if you're always just summing up all the slots to get the final tally?


It's an interesting approach. In similar cases has anyone got experience using a database more optimized on writes, e.g. Cassandra?


phillip it's called lock striping


Really just a special case of sharding.




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: