
Show HN: Shishua – Fast pseudo-random generator - espadrine
https://espadrine.github.io/blog/posts/shishua-the-fastest-prng-in-the-world.html
======
nullc
The design approach of just iterating until it passes a suite of statistical
tests is ... not the best.

The issue is that the common suites of statistical test don't include every
credible statistical test at their level of complexity.

Essentially as you retry against the tests you expend their utility, producing
a result that is guaranteed to pass them as a byproduct of your process. ...
but may in fact be extraordinarily weak to yet another reasonable statistical
test (and/or broken in some practical application) that no one has yet thought
to include in the test suite.

~~~
espadrine
Very true. Sadly, this is the main approach the non-cryptographic PRNG
community has at the moment.

On the plus side, those tests are fairly extensive and brutal nowadays. They
detect anomalies in cryptographic algorithms that were still used a decade
ago, such as arc4random().

But one thing I hint at at the end is that I hope, and anticipate, to see more
accurate quality measurement tools in the future.

~~~
samatman
A partial workaround is to reserve one or two of the tests until your PRNG
passes the others.

If it passes the remainder on the first try, you might be on to something.

If it fails, you've got a problem: tweaking until it passes isn't going to
help you.

------
nkurz
> Indeed, AVX2 has a bit of a quirk where regular registers (%rax and the
> like) cannot directly be transfered to the SIMD ones with a MOV; it must go
> through RAM (typically the stack), which costs both latency and two CPU
> instructions (MOV to the stack, VMOV from the stack).

I don't think this is quite correct. You can move efficiently from r64 to
XMM/YMM without going through memory with MOVQ/_mm_cvtsi64_si128. I'm not able
to look into it more closely right now, but these links should give insight:

[https://www.felixcloutier.com/x86/movd:movq](https://www.felixcloutier.com/x86/movd:movq)

[https://software.intel.com/sites/landingpage/IntrinsicsGuide...](https://software.intel.com/sites/landingpage/IntrinsicsGuide/#text=movq&expand=1885)

My vague recollection is that you might be right that this is clumsy with
AVX2. Maybe it's a case where you have to take advantage of the fact that XMM1
and YMM1 are same register, and cast the YMM to XMM? Or just drop to inline
assembly.

But thanks for the writeup, and I hope to be able to look at Shishua more
closely at some point in the future.

~~~
BeeOnRope
Yes, most instructions that modify only the bottom element(s) didn't get a ymm
(256-bit) version since it would serve no purpose as it would produce the same
result as the xmm one, and the corresponding intrinsics mostly follow the same
pattern. So there is no int64 -> ymm intrinsic.

An intrinsic cast works fine though:

[https://godbolt.org/z/M9XWCb](https://godbolt.org/z/M9XWCb)

Intel even says about the cast:

> Cast vector of type __m128i to type __m256i; the upper 128 bits of the
> result are undefined. This intrinsic is only used for compilation and does
> not generate any instructions, thus it has zero latency.

That seems a bit beyond their mandate since what the compilers generate is
mostly up to them, and in fact it doesn't seem true: at -O0, both gcc and
clang generate a few extra instructions for the cast. With optimization on,
it's all good though.

------
jedisct1
If you need something fast and with stronger security guarantees, Google
Randen remains a solid choice.
[https://github.com/google/randen](https://github.com/google/randen)

~~~
lern_too_spel
The article mentions Randen is slower than ChaCha8.

~~~
grantwu
Where? CTRL-F Randen doesn't show anything, and the Randen repo claims it's
faster than ChaCha8.

~~~
espadrine
It is not directly in the article, but in a link to a tweet by djb, the
creator of ChaCha8. He believes that the cpb listed in the Randen comparison
is off:

[https://twitter.com/hashbreaker/status/1023965175219728386](https://twitter.com/hashbreaker/status/1023965175219728386)

He mentions that perhaps the implementation of ChaCha8 for the benchmark is
done by hand and unoptimized. And it is true from what I saw that a lot of
benchmarks with ChaCha8 are implemented with none of the tweaks that make it
fast.

In this instance, it looks like the Randen author didn’t reimplement it from
scratch, but they used an SSE implementation, not an AVX2 one, which would
have been faster:
[https://github.com/google/randen/blob/1365a91bafc04ba491ce79...](https://github.com/google/randen/blob/1365a91bafc04ba491ce79b88968744726225cf1/engine_chacha.h)

------
galkk
This looked ugly and so much out of place...

> Or, as Pearson puts it:

> From this it will be more than ever evident how little chance had to do with
> the results of the Monte Carlo roulette in July 1892.

> (Not sure why his academic paper suddenly becomes so specific; maybe he had
> a gambling problem on top of being a well-known racist.)

~~~
bjoli
i found it quite funny. And I don't really find calling a well know eugenics
advocate racist to be even remotely provoking. Maybe out of style compared to
the rest of the text, but apart from that I didn't find it conspicuous in the
least.

------
kortex
> One of the not-ideal aspects of the design is that SHISHUA is not
> reversible.

Why is irreversibility a bad thing? Is it just the loss of useful state which
reduces the internal entropy?

Is irreversibility ever useful? Wouldn't some amount of discarded bits make it
harder for an adversary to infer the state?

~~~
KMag
CSRNGs are mentioned a few times in TFA, but cryptographic security doesn't
appear to be among the design criteria for SHISHUA. There's no mention of
linear or differential crpytanalisis, which is table stakes for a CSPRNG.

~~~
espadrine
I added a warning in the GitHub’s Readme. SHISHUA is not to be used for
cryptographic purposes.

You mentioned the lack of cryptanalysis; there is also the lack of rounds
(which prevents researchers from breaking partial versions to ease their
study, and allows setting a security margin).

------
gok
What is a real world use case where CSPRNGs are too slow on modern hardware?

~~~
mschuetz
Shuffling 100 million points per second (probably much more if it wasn't both,
disc IO and CPU->GPU transfer limited), and shuffling in general if you want
to maintain high throughput. e.g.:
[https://bit.ly/2zcUzJq](https://bit.ly/2zcUzJq)

------
closed
This is not a bad thing, and maybe I'm alone in this, but I spent my first
minute thinking over whether "Shishua" was some kind of romanized mandarin or
mandarin-esque word!

(I speak cantonese, though, so maybe mandarin speakers wouldn't fall in this
trap)

~~~
yorwba
It's a "stone brush" 石刷 _shíshuā_. (Alternatively "It's a number!" 是数啊 _shì
shù a_ , but then pinyin orthography requires writing it as "Shishu'a" with an
apostrophe in front of the third syllable.)

------
azhenley
Will everyone soon be using Shishua then? (I don't know much about PRNGs.)

~~~
espadrine
Good question.

SHISHUA is a strong design, suitable for many use-cases. But the decision of
which PRNG is right for you depends on platform, context, comfort, and ease of
access.

If you are making a simple video game, it might be fine to just use your
standard library; or to write one of the simple ones that you memorized and
that don’t have too terrible an output.

If your video game has more severe stakes, if it is the backbone of a virtual
economy in a massively multiplayer game where servers feed on a large amount
of randomness that needs high quality to ensure players don’t find flaws and
abuse the system for their own gain, maybe SHISHUA is right for you.

If you are targeting a platform that does not support 64-bit computation,
SHISHUA can’t do it well, and other PRNGs such as sfc32 can be good there.

If you are working on machine learning that feeds on heaps of randomness over
the course of months of computation, and that need quality to ensure an
unbiased, even learning, SHISHUA can be right for you.

There was a neat article recently that contrasted a large number of PRNGs for
use on video games, in particular on consoles: [https://rhet.dev/wheel/rng-
battle-royale-47-prngs-9-consoles...](https://rhet.dev/wheel/rng-battle-
royale-47-prngs-9-consoles/)

It is before the publication of SHISHUA, but the insights there are
interesting.

~~~
petermcneeley
I couldnt find the license? Is this MIT

~~~
espadrine
CC0 it is! I added the legal information in the GitHub project.

