I hope LLVM gets better at using AVX-512 instructions efficiently. That looks like the main pain point found here. It's exciting to see that for the most part Julia is roughly matching performance of the mature HPC solutions.
I think the issue with code vectorization is that the compiler must know that the given loop can run out-of-order. I don't know if that is something that LLVM can do reliably.