upd: apparently reasons like:
> So I guess for most of the case loading or storing i128, the data will be used by some library functions running on cores instead of NEON, so storing i128 to two GPR64 is more general.
Thats polynomial multiply. Its (almost) a multiplication in GF2 for elliptical curves. Thats not a "normal" multiply.
"PMULL" is basically a bitshift and XOR. Your traditional "MUL" is bitshift and ADD. Its called "polynomial multiply" because bitshift-and-xor has very similar properties to bitshift-and-add (it distributes over XOR, associative, communative, etc. etc).
Bitshift-and-xor has a few features that are better for cryptography. But its NOT the multiplication you are taught in grade school.
EDIT: With that being said... those "better features" for cryptography would make PMULL probably a better function for random-number generation. PMULL will return a different result than the real multiplication, but you'll have an easier time making a field (aka: reversable 1-to-1 bijections) out of PMULL than MUL...