The Fortran implementation that I see on GitHub, is in no way the most optimized implementation that one could have, or would do in any serious heavy numerical computation. Yet it is amazing to see how well Fortran performs given all the lack of optimizations in the benchmark code. Also, performance highly depends on the compiler and compiler optimization flags and there is no mention of the optimization flags used for the Fortran compiler. and remember optimizations do not end with "-O3" flag in Fortran (and C). A Julia benchmark written by Julia developers showing Julia outperforms any other language does not seem to be the best way of assessing languages' performance. Did you have experts in each of the other languages to implement the respective codes?
From what I understand, the point of these benchmarks was to write relatively straightforward idiomatic code in all the laguages. The Julia devs were well aware that their knowledge of julia would make benchmark hacking straightforward, so they purposefully did not take many optimization opportunities available to them. They did the same with the Fotran and C benchmarks.
These things are all pretty subjective though when it comes to 'what is a fair benchmarking methodology'? Another valid approach would be to have experts in each language try to write as optimal an implementation as possible, but this would quickly become a bit of an arms race.
But even "relatively straightforward idiomatic code" is highly opinionated and developer dependent. For example, the array operation `a(:) = b(:) + c(:)` is very well known to be slower (and much uglier) than `a = b + c` in Fortran. Yet, many developers write in the former style and I see instances of such style usage in the Fortran code of Julia benchmark. Also, Fortran is a natively parallel concurrent vectorized language. The for-loops are automatically parallelized with `do concurrent` construct. The fact that the other languages, and perhaps Julia, lack such native capabilities, does not mean such features of Fortran should be excluded in a benchmark. Otherwise, what is really the point of a benchmark?
The point of the benchmark is to benchmark serial loops and automatic SIMD of the compilers, so no, parallelizing it doesn't make sense. In fact, Julia has native multithreaded and distributed parallelism (which can utilize MPI for the interconnects). It would be a separate test that would test native distributed parallelism, and that's not a very interesting test though since any language with these same constructs (so Julia, C++, Fortran) will all end up bandwidth limited at the same speed. This is something to give as a homework problem to students though since it's easy to demonstrate and teaches how to use the parallel constructs.
It depends on what's going on. JIT compilers have more information to optimize on so they can do surprising things. For example, FFI calls into shared libraries is generally faster with fast JIT languages.
This is one reason why you could see Julia outperforming Fortran in some cases where the FFI speed matters. But Fortran does have easier aliasing analysis (because you can't alias) so that helps there, but other than that most of the compiler passes are pretty much the same.