I've seen the talk! The issue with using a global lock on a global work queue is that, unless the work items have drastically different compute times, there _will_ be high contention on the lock.
I ran a benchmark [1], which shows that this is correct:
Results on quad-core Intel Linux box:
$ hyperfine target/release/testit 'env USE_RAYON=1 target/release/testit'
Benchmark 1: target/release/testit
Time (mean ± σ): 2.526 s ± 0.139 s [User: 4.709 s, System: 11.425 s]
Range (min … max): 2.391 s … 2.730 s 10 runs
Benchmark 2: env USE_RAYON=1 target/release/testit
Time (mean ± σ): 174.1 ms ± 0.9 ms [User: 212.1 ms, System: 121.1 ms]
Range (min … max): 173.1 ms … 175.4 ms 16 runs
Summary
env USE_RAYON=1 target/release/testit ran
14.51 ± 0.80 times faster than target/release/testit
Results on M1 Pro:
$ hyperfine target/release/testit 'env USE_RAYON=1 target/release/testit'
Benchmark 1: target/release/testit
Time (mean ± σ): 692.2 ms ± 8.3 ms [User: 491.4 ms, System: 5693.6 ms]
Range (min … max): 683.2 ms … 704.5 ms 10 runs
Benchmark 2: env USE_RAYON=1 target/release/testit
Time (mean ± σ): 63.0 ms ± 2.1 ms [User: 97.7 ms, System: 47.0 ms]
Range (min … max): 61.0 ms … 71.2 ms 44 runs
Summary
env USE_RAYON=1 target/release/testit ran
10.99 ± 0.39 times faster than target/release/testit
Ah, yes. It's good that you ran a benchmark. However, if I read the code correctly, your version with a lock does an extra memcopy. When the work function is as short as AES encoding of a block of memory, that memcopy is quite a big cost and the queue is going to be quite heavily contended.
Fedor Pikus has a good talk on this at cppcon 2019.