
Speed Without Wizardry - mnemonik
http://fitzgeraldnick.com/2018/02/26/speed-without-wizardry.html
======
austincheney
The simple rule I have found for achieving superior performance in high level
languages, particularly JavaScript is to simply _do less_. It isn't that
simple though.

Doing less really means less code totally at the current compilation target,
essentially feeding fewer total instructions to the compiler. This means no
frameworks and minimal abstractions. It means having a clear appreciation for
the APIs you are writing to. It means minimizing use of nested loops, which
exponentially increase statement count.

Sometimes caching groups of instructions in functions can allow for cleaner
code with a positive performance impact.

V8 cannot compile arithmetic assignment operators, which it calls left-side
expressions, so you can see a rapid speed boost in V8 when you replace
something like _a += 1_ with _a = a + 1_.

The side benefit of less code is generally clearer and cleaner code to read.
There isn't any wizardry or black magic. No tricks or super weapon utilities.

As an example I wrote a new diff algorithm last year that I thought was really
fast.
[https://news.ycombinator.com/item?id=13983085](https://news.ycombinator.com/item?id=13983085)
This algorithm is only fast because it does substantially less than other
algorithms. I only wrote it because I could not wrap my head around the more
famous Myers' O(ND) algorithm. A side benefit, in this case, of doing less is
an algorithm that produces substantially more accurate results.

~~~
whatyoucantsay
> "It means minimizing use of nested loops, which exponentially increase
> statement count."

Nesting two loops has an n^2 cost and nesting 3 levels deep costs n^3. At no
point does it ever cost 2^n or any x^n.

It's polynomial, not exponential.

~~~
austincheney
Before I begin just let me say I am not a mathematician. I program so that I
don't have to do complex math.

I know in reality the frequency of iterations varies considerably but for
simplicity of discussion let's remove variability.

Say we have a loop with 1000 iterations. That is at minimum 1000 statements in
the loop body plus expression overhead from the loop itself. If this loop is
nested once with a same sized loop there are now 1,000,000 iterations plus
some expression overhead per iteration. If it is nested twice deep there are
now 1 billion iterations.

That example is exponential of 1000. Given that there is overhead associated
with operation of a loop it is actually greater than exponential. It may not
be quite so dramatic as a logarithmic growth curve though.

I completely concede that in reality loops vary in iteration count and so
nested loops aren't likely perfectly exponential unless their iteration counts
are identical. The increase of iterations from nesting loops does increase
more dramatically than a simple multiplicative operation as the depth of loop
nesting increases, such that the growth of total iterations is a curve on a
graph. A polynomial growth operation when graphed should present a straight
diagonal line without the presence of a third variable.

~~~
tachyoff
I wouldn’t call time complexity particularly complex math. Imagine a loop. It
does something n times. Within that loop, you do something n times. For each
outer looo through n,you do an inner loop through n. Trivially, this is n*n,
or n^2, also known as quadratic time. If you nest another loop, it becomes
n^3, or cubic time.

Anyway, there might be some confusion of terms, perhaps. Exponential time in
the algorithmic complexity sense is any algorithm that takes 2^n operations to
complete. If you’re talking about the number of iterations after you loop
through n things n times, then that increase itself would not be exponential.
But that delta in iterations is unrelated to exponential and polynomial time
complexity.

~~~
austincheney
I am sooo not a math person.

Let's not forget there is overhead to loops. At a minimum let's assume there
is a single statement in the loop body, an increment statement and a terminal
condition. In a single loop of 1000 iterations there are 3000 things to
evaluate. The math becomes:

(n * 3)^x

If a simple loop is nested twice (3 depths) there would be 1 billion
iterations but about 27 billion evaluations. Would it be correct to say that
is just slightly faster polynomial growth?

------
nickm12
I've got to admire the graciousness in this response. It's making the point
that mraleph's “Maybe you don’t need Rust and WASM to speed up your JS”
article completely neglected code maintainability as a factor, but it does so
without turning the whole thing into a pissing match. It's all been a
fascinating to read.

~~~
mraleph
> completely neglected code maintainability

Where did I neglect maintainability as a factor? The only optimization that
potentially affects maintainability is manually allocating Mapping-s in the
typed array. And there I openly acknowledged that it affects readability and
makes the code error prone. All other optimizations are not in any way
affecting maintainability.

Even typed array optimization is purely confined in the library internals...
On the other hand WASM spills out of the library by requiring users to
explicitly destroy SourceMapConsumer.

~~~
ghusbands
Turning a function into a string and back is also a reduction in
maintainability, as programmers now have to be sure it doesn't use any
captured variables or such, in future edits. Using a Uint8Array and integer
constants rather than a string and character constants makes the code harder
to read. Separating the sorting pass and doing some of it lazily (for good
performance reasons) makes the final ordering slightly less clear.

But "completely neglected code maintainability" is definitely unfair. While
you made changes that reduced maintainability, you weren't neglectful.

------
Felz
Funny, I was using the source-map library under Nashorn. The performance was
poor enough that I had to switch to embedding V8; I'm not sure whether that
was a consequence of Nashorn itself being too slow, or the Javascript
optimizations intended for V8/Firefox just completely missing their mark.

Not that the WASM version of the library would've helped, since Nashorn
doesn't do WASM at all. But maybe the performance would've been decent if it
had.

------
hobofan
> But a distinction between JavaScript and Rust+WebAssembly emerges when we
> consider the effort required to attain inlining and monomorphization, or to
> avoid allocations.

I'm not sure that is true. Having worked/interacted with a lot of people
working with Rust on different experience levels, most of them (that includes
me) don't have a deep knowledge of what Rust concept maps to a specific
concept with which performance implications. And if they do it's often only
partial. I'd say that right now, only very few people that don't work on the
Rust compiler have a broad knowledge in that area. Sure, it's much better to
have to Result of the optimization expressed in code itself, but I'd say that
the amount of knowledge and effort required to get to such a level of
optimization is similar to optimizing Javascript.

I also found the hint to `#[inline]` suggestions, a bit disingenuous. In the
end they are just _suggestions_, and your are just as much at the mercy of the
Rust/LLVM optimizer to accept them, as you are with a Javascript JIT.

I'm a big fan of Rust, and I'm a big fan of Rust+Webassembly (working with it
is the most fun I had programming in a long time!). Generally I think that
Rust has one of the better performance optimzation stories, I just don't a
gree with some of the sentiments in the post. There are also enough other
reasons to love Rust+WebAssembly than just the peformance!

~~~
lambda
> I'd say that the amount of knowledge and effort required to get to such a
> level of optimization is similar to optimizing Javascript.

Really? After reading all of the articles in this series (the original about
porting to Rust, the rebuttal about optimizing JavaScript, and this one)?

I'm more familiar with Rust than JavaScript, but I found that other than the
algorithmic optimization, the rest of the JavaScript optimizations were very
non-obvious in order to achieve an effect that is entirely natural in Rust.

There's no need to be careful to avoid particular types of function calls,
since there are no dynamic variadic function calls in Rust. There's no need to
resort to using the equivalent of `eval` for monomorphization, it's a natural
feature of the language. There's no need to do manual memory management by
encoding values into typed arrays of integers; you can simply allocate vectors
of structs, borrow references, and so on.

These are all things that had significant cost in JS, and needed to be worked
around via some intensive profiling and knowledge of how JIT engines work, but
just doing things naturally in Rust leads to a pretty much zero-cost solution.

It's true that if you want to really get into the nitty-gritty of micro-
optimization in Rust, you need to learn some more and do things like profiling
and inspecting the generated code to see what optimizations the compiler was
able to apply or not.

But the Rust rewrite of source maps did none of that; they just rewrote it in
fairly idiomatic Rust, and achieved a similar speedup to what a JIT engineer
armed with a profiler, a deep knowledge of what affects how well a JIT works,
and a willingness to do manual memory management in typed arrays was able to
do.

> Having worked/interacted with a lot of people working with Rust on different
> experience levels, most of them (that includes me) don't have a deep
> knowledge of what Rust concept maps to a specific concept with which
> performance implications

What parts of Rust do you feel like you don't understand the performance
implications of? I feel like the mental model for performance is relatively
similar to C or C++, with relatively few things that would surprise you if
you're familiar with modern C++ (in fact, a lot fewer performance surprises
than modern C++ offers, in addition to the extra safety).

------
DannyBee
I really don't understand this article, and the claims really rub me the wrong
way.

The main point it makes is, again "He perfectly demonstrates one of the points
my “Oxidizing” article was making: with Rust and WebAssembly we have reliable
performance without the wizard-level shenanigans that are required to get the
same performance in JavaScript."

This doesn't make a lot of sense as a claim.

Why? Because underneath all that rust .... is an optimizing compiler, and it
happens the author has decided to stay on the happy path of that. There is
also an unhappy path there. Is that happy path wider? Maybe. It's a
significantly longer and more complex optimization pipeline just to wasm
output, let alone the interpretation of that output. I have doubts it's as
"reliable" as the author claims (among other things, WebAssembly is still an
experimental target for LLVM). Adding the adjective "reliable" repeatedly does
not make it so.

Let's ignore this though, because there are easier claims to pick a bone with.

It also tries to differentiate optimizations between the two in ways that
don't make sense to me: "In some cases, JITs can optimize away such
allocations, but (once again) that depends on unreliable heuristics, and JIT
engines vary in their effectiveness at removing the allocations."

I don't see a guarantee in the rust language spec that these allocations will
be optimized away. Maybe i missed it. Pointers welcome.

Instead, i have watched plenty of patches to LLVM go by to try to improve it's
_heuristics_ (oh god, there's that evil word they used above!) for removing
allocations for rust. They are all heuristic based, they deliberately do not
guarantee attempting to remove every allocation (for a variety of reasons). In
general, it can be proven this is a statically undecidable problem for a
language like rust (and most languages), so i doubt rustc has it down either
(though i'm sure it does a great job in general!)

The author also writes the following: "WebAssembly is designed to perform well
without relying on heuristic-based optimizations, avoiding the performance
cliffs that come if code doesn’t meet those heuristics. It is expected that
the compiler emitting the WebAssembly (in this case rustc and LLVM) already
has sophisticated optimization infrastructure,"

These two sentences literally do not make sense together. The "sophisticated
optimization infrastructure" is also using heuristics to avoid expensive
compilation times, pretty much all over the place. LLVM included. Even in
basic analysis, where it still depends on quadratic algorithms in basic
things.

If you have a block with 99 stores, and ask LLVM's memory dependence analysis
about the dependency between the first and the last, you will get a real
answer. If you have 100 stores, it will tell you it has no idea.

What happened to reliable?

Why does this matter? For example: Every time rust emits a memcpy (which is
not infrequent), if there are more than 99 instructions in between them in the
same block, it will not eliminate it, even if it could. Whoops. Thats' a
random example. These things are endless. Because compilers make tradeoffs
(and because LLVM has some infrastructure that badly needs
rewriting/reworking).

These "sophisticated optimization infrastructures" are not different than JITs
in their use of heuristics. They often use the same algorithms. The only
difference is the time budget allocated to them and how expensive the
heuristics let things get.

There may be good reasons to want to write code in rust and good reasons to
believe it will perform better, but they certainly are _not_ the things
mentioned above.

Maybe what the author really wants to say is "we expect the ahead of time
compiler we use is better and more mature than most JITs and can spend more
time optimizing". But they don't.

Maybe it would also surprise the author to learn that their are JITs that beat
the pants off LLVM AOT for dynamic languages like javascript (they just don't
happen to be integrated into web browsers).

But instead, they make ridiculous claims about heuristics and JITs. Pretending
the compiler they use doesn't also depend, all over the place, on heuristics
and other things is just flat out wrong. At least to me (and i don't really
give a crap about what programming language people use), it makes it come off
as rampant fanboism. (Which is sad, because i suspect, had it been written
less so, it might be actually convincing)

~~~
tedmielczarek
> Why? Because underneath all that rust .... is an optimizing compiler, and it
> happens the author has decided to stay on the happy path of that.

There are two big differences here: 1) You're comparing "staying on the happy
path" of a JIT compiler in the JS case, vs. an optimizing compiler in the Rust
case. With the latter you can just compile your code and see what comes out
and it tends to be fairly predictable. With the former, I'm not even sure
there are tools to inspect the generated JIT code, and you're constantly
walking the line of JS engines changing their heuristics and throwing you off
the fast path. This was one of the primary motivations for the
asm.js/WebAssembly work: the ability to get predictable performance.

2) Many of the optimizations mraleph performed were tricks to avoid allocation
(which is normal optimization stuff, but more of a pain in GCed languages). In
JS he winds up having to effectively write C-in-JS which looks pretty hairy.
In Rust controlling allocation is a built-in language feature, so you can
write very idiomatic code without heap allocation.

~~~
hobofan
> what comes out and it tends to be fairly predictable

Predictable as long as you stay on the same version of the compiler (yes, I
know that there are crater runs to prevent regressions). Also, how does much
can/does the output for different target architetures differ in performance?
Couldn't that be likened to trying to optimize for multiple JS engines?

------
IncRnd
> _This is a factor of 4 improvement!_

This is a common mistake. It should read, "This is a factor of 3 improvement!"

x+x+x+x is an improvement over x of 3x not of 4x. The improvement factor is 3.

~~~
recursive
It's 3x faster _than_ , but 4x _as fast as_.

~~~
IncRnd
Yes

