Pipelining did also play a role. The Quake inner rasterization loop has a decent amount of non-division math in it as well that leverages the Pentium's ability to execute FP add/multiplies at 1/cycle. The K6 and 6x86 FPUs were considerably slower -- 2 and 4 cyclesnon-pipelined (http://www.azillionmonkeys.com/qed/cpuwar.html).
Additionally, the FXCH instructions required to optimally schedule FPU instructions on the Pentium hurt 486/K6/6x86 performance even more since they cost additional cycles. Hard for the 6x86 to keep up when it takes 7 cycles to execute an FADD+FXCH pair vs. 1 for the Pentium.
Additionally, the FXCH instructions required to optimally schedule FPU instructions on the Pentium hurt 486/K6/6x86 performance even more since they cost additional cycles. Hard for the 6x86 to keep up when it takes 7 cycles to execute an FADD+FXCH pair vs. 1 for the Pentium.