I thought the m1 has dedicated js instructions.

GeekyBear · on Dec 14, 2020

There are several theories as to why the M1 does so well on on JavaScript benchmarks.

>Firestorm can do 4 FADDs and 4 FMULs per cycle with respectively 3 and 4 cycles latency. That’s quadruple the per-cycle throughput of Intel CPUs and previous AMD CPUs, and still double that of the recent Zen3, of course, still running at lower frequency. This might be one reason why Apples does so well in browser benchmarks (JavaScript numbers are floating-point doubles).

https://www.anandtech.com/show/16226/apple-silicon-m1-a14-de...

>Apple has confirmed that it’s a massive 192KB instruction cache. That’s absolutely enormous and is 3x larger than the competing Arm designs, and 6x larger than current x86 designs, which yet again might explain why Apple does extremely well in very high instruction pressure workloads, such as the popular JavaScript benchmarks.

https://www.anandtech.com/show/16226/apple-silicon-m1-a14-de...

Reason077 · on Dec 14, 2020

> "There are several theories as to why the M1 does so well on on JavaScript benchmarks."

A major reason it does so well is the native Javascript support added in ARMv8.3-A.

Specifically, FJCVTZS (Floating-point Javascript Convert to Signed fixed-point, rounding toward Zero)[1] which is a Javascript-specific variant of FCVTZS implementing the overflow and exception handling behaviour that Javascript wants.

Javascript needs to do these conversions a lot, since it doesn't have integer types, and it's much faster to have it implemented in silicon rather than using the old instructions and having to handle the overflow/error checking.

[1] https://developer.arm.com/documentation/100076/0100/a64-inst...

GeekyBear · on Dec 14, 2020

People who were looking for a root cause of the iOS JavaScript performance advantage have been pointing to that instruction as "the" reason before it was even used by Safari.

https://mobile.twitter.com/saambarati/status/104920213252247...

I'm sure it doesn't hurt, but the performance advantage predates the instruction's use.

Here's another theory.

>Finally doing perf counters on a dedicated test bench... 9900K having 60% worse branch misprediction than Apple's A12.

https://twitter.com/andreif7/status/1307420010177007625

marmaduke · on Dec 14, 2020

the second theory seems better in my experience. I pulled a few js benchmarks and ran on native Safari on m1 vs rosetta2 node and the rosetta2+node is significantly faster, suggested the instruction set is not the key.

gpderetta · on Dec 14, 2020

As far as I understand that instruction is needed to get x86 FP semantics on ARM. I.e it doesn't make JS faster but just not as slow as it would otherwise be.

Also apparently most JS engines do not actually use the FP unit but can do most computations on the integer units (unless is really FP math) which normally have much lower latency (but also lower bandwidth).

cesarb · on Dec 14, 2020

I've been thinking about the instruction cache of the M1 lately.

Another difference between x86 and ARM is that, for historical reasons, on x86 there is no need to invalidate the instruction cache explicitly when writing instructions to memory. That is, on x86, when a core writes to memory, the corresponding line in the instruction cache has to be invalidated. Since the instruction cache is usually VIPT for performance reasons, it has to be indexed only by bits which don't change in the virtual to physical mapping, otherwise there's a risk of cache aliases. For an instruction cache, an alias should not be a problem (it just wastes space with duplicated data), except that all aliases have to be flushed when invalidating by physical address.

IIRC, in 64-bit ARM user space (EL0) the only available instruction to invalidate the instruction cache is an "invalidate by virtual address" instruction. Since calling that instruction (after calling an instruction to flush the data cache to the point of unification) is required on ARM, there's no need to be able to invalidate all aliases of a physical address, like would be required on x86. That means it would be easier on ARM to have much larger instruction caches than on x86.

ssutch3 · on Dec 14, 2020

It's just part of the ARM instruction set.

Reason077 · on Dec 14, 2020

Yeah, but any guesses as to which ARM architecture licensee submitted these instructions to the ISA? It was probably the one who was first, by several years, to implement them, right? ;)

rsynnott · on Dec 14, 2020

Where'd you get that from?

The m1 _does_ have one weird non-ARM extension; it can adopt an x86 memory model on demand.

count · on Dec 14, 2020

It's got better/more floating point capacity, which makes it faster at JS because all the numbers in JS are floating point number.

wezen · on Dec 14, 2020

Which are part of arm ISA

ducktective · on Dec 14, 2020

What are js instructions and what does javascript(?) have to do with ISA?

Gaelan · on Dec 14, 2020

IIRC, the standard ARM instruction set does include some instructions with "javascript" in the end. I think they're floating point instructions that handle some edge case (NaNs?) in the same way that x86, and therefore the JS spec, do.

flohofwoe · on Dec 14, 2020

See here:

https://developer.arm.com/documentation/dui0801/g/A64-Floati...

Not Apple Silicon specific though

Decabytes · on Dec 14, 2020

This raises so many questions for me that are predicated on this being the case, but this is way out of my league so I could be off the mark. Incoming geeking about Universal Programming Languages.

1. If you can increase the performance of a language by adding language specific instruction sets to the processor, how much of a boost would we see on average?

2. Instead of building in instructions for say Python, C#, Java, etc, what if you built it in a language designed to create other languages I.E Racket? (I like Racket but another language that specialized in this would be fine). Since languages built on Racket ultimately compile to valid Racket code, you'll still get the speed up, and can program with the features and syntax you want (within reason obv constrained by the limitations of the base language).

3. I wasn't around for this, but isn't this what made Lisp faster on Lisp Machines? Since they had dedicated hardware for interpreting Lisp instructions?

kazinator · on Dec 14, 2020

You can certainly speed up a dynamic language by building VM which is a more or less ideal translation target for that language, and then implementing that VM in real hardware.

Hardware can parallelize type checks. For instance, you can have an add instruction which proceeds on the assumption that the two arguments are numbers. In parallel, a type checking unit in the hardware can abort that instruction and cause a branch to some handler if the operand types are wrong.

Function calls in dynamic languages can be expensive partly due to the dynamic checking that there are not too few or too many arguments. This affects code that doesn't otherwise require type checking (like logic that is doing nothing more than just passing arguments through several layers of functions and capturing return values). Hardware can help here also.

The industry moved toward general-purpose hardware. The Lisp people figured out ways to compile Lisp well to general-purpose hardware.

The thing about hardware is that it also needs optimization: a machine designed for ideal execution of Lisp is going to be expected to produce new revisions that are faster and faster, keeping up with advances in general-purpose machines. If it fails to do so, its performance will be overtaken by Lisp that is compiled to the general-purpose machines.