Run a benchmark. With low contention, the lock will outperform. Atomics are very...

1932812267 · 2025-03-30T20:18:05 1743365885

I've seen the talk! The issue with using a global lock on a global work queue is that, unless the work items have drastically different compute times, there _will_ be high contention on the lock.

I ran a benchmark [1], which shows that this is correct:

Results on quad-core Intel Linux box:

$ hyperfine target/release/testit 'env USE_RAYON=1 target/release/testit' Benchmark 1: target/release/testit Time (mean ± σ): 2.526 s ± 0.139 s [User: 4.709 s, System: 11.425 s] Range (min … max): 2.391 s … 2.730 s 10 runs

Benchmark 2: env USE_RAYON=1 target/release/testit Time (mean ± σ): 174.1 ms ± 0.9 ms [User: 212.1 ms, System: 121.1 ms] Range (min … max): 173.1 ms … 175.4 ms 16 runs

Summary env USE_RAYON=1 target/release/testit ran 14.51 ± 0.80 times faster than target/release/testit

Results on M1 Pro:

$ hyperfine target/release/testit 'env USE_RAYON=1 target/release/testit' Benchmark 1: target/release/testit Time (mean ± σ): 692.2 ms ± 8.3 ms [User: 491.4 ms, System: 5693.6 ms] Range (min … max): 683.2 ms … 704.5 ms 10 runs

Benchmark 2: env USE_RAYON=1 target/release/testit Time (mean ± σ): 63.0 ms ± 2.1 ms [User: 97.7 ms, System: 47.0 ms] Range (min … max): 61.0 ms … 71.2 ms 44 runs

Summary env USE_RAYON=1 target/release/testit ran 10.99 ± 0.39 times faster than target/release/testit

[1]: https://play.rust-lang.org/?version=stable&mode=debug&editio... (I'm just using the rust playground as a pastebin; the actual benchmarks were run locally)

pclmulqdq · 2025-03-31T01:02:56 1743382976

Ah, yes. It's good that you ran a benchmark. However, if I read the code correctly, your version with a lock does an extra memcopy. When the work function is as short as AES encoding of a block of memory, that memcopy is quite a big cost and the queue is going to be quite heavily contended.