That's definitely quite curious: I am sure pure Python could have been heavily optimized to reach 2 minutes as well, though. Random number generation in Python is C-based, so while the pseudo-random generators from Python's random module might be slow, it's not because of Python itself (https://docs.python.org/3/library/random.html is a different implementation from https://man7.org/linux/man-pages/man3/random.3.html).
Call overhead and loop overhead is pretty big in Python though. The way to work around that in Python is to use C-based "primitives", like the stuff from itertools and all the builtins for set/list/hash processing (thus avoiding the n^2 case in pure Python). And when memory is an issue (preallocating large data structures can be slow as well), iterators! (Eg. compare use of range() in newer Python with use of list(range())).
I'm reasonably sure the PRNG being used in the python version came from numpy and was implemented in C (or other native code, not python). The problem was that the necessary control flow and varying parameters around it meant you had to call it once per value from python (and you had to generate a lot of values).
And if I recall correctly there was no allocation in the hot loop, with a single large array being initialized via numpy to store the values before hand. Certainly that's one of the first things I would think to fix.
I was strongly convinced at the time that there was no significant improvement left in python. With >99% of the time being spent in this one function, and no way to move the loop into native code given the primitives available from numpy. Admittedly I could have been wrong, and I'm not about to revisit the code now, since it has been years and it is no longer in use - so everything I'm saying is based off of years old memories.
Sure, numpy introduces its own set of restrictions. I was mostly referring to taking a different approach before turning to numpy, but it could very well be true.
In essence, doing what you did is the way to get performance out of Python when nothing else works.
> The problem was that the necessary control flow and varying parameters around it meant you had to call it once per value from python (and you had to generate a lot of values).
Call overhead and loop overhead is pretty big in Python though. The way to work around that in Python is to use C-based "primitives", like the stuff from itertools and all the builtins for set/list/hash processing (thus avoiding the n^2 case in pure Python). And when memory is an issue (preallocating large data structures can be slow as well), iterators! (Eg. compare use of range() in newer Python with use of list(range())).