Speeding Up the Webcola Graph Viz Library with Rust and WebAssembly

miohtama · on July 4, 2021

This is quite a hardcore trip down to optimising single CPU instructions. I need to tip off my hat to the author. Though not sure what kind of motivation and project goals warrant this deep and detailed problem tackling.

Ameo · on July 4, 2021

Author here - I was on a week-long vacation at the time and had been working on the visualization as a side project before that. I started really getting into the deep end of the optimization work while on that week-long break.

The thing about optimization work is that it's incredibly addicting once you get into it. Seeing consistent, measurable improvement and watching the frame times shrink is so satisfying, plus the process itself is really enjoyable in and of itself.

I definitely agree that the optimizations here are overkill for the actual application itself. However, it served as a great learning experience for me personally and allowed me to write this blog post as well!

tyfon · on July 4, 2021

> I definitely agree that the optimizations here are overkill for the actual application itself. However, it served as a great learning experience for me personally and allowed me to write this blog post as well!

While this might be a bit overkill, I wish more developers spent at least 10% as much effort as here, the modern world of software really feels bloated often. The core engines are getting faster and the cpus are getting faster but the web feels slower.

And another important factor is that more and more software runs on battery powered portable devices in some form. Instructions saved is battery saved = happier users :)

rmnoon · on July 5, 2021

Did you look at any alternative graph layout algorithms? There are non-quadratic ones that are quite good (like Barnes-Hut and its variants). I did a Typescript port of ForceAtlas2 a couple years ago that could run on a Web Worker (so as to not block the interactivity of the graph) but never got around to open sourcing it.

I'd be amazed to see what this level of profiling skill could do on a fancier algorithm.

miohtama · on July 5, 2021

Thank you. I believe VM developers will learn tons of stuff from this post. Also it debunks the myth "there is no point to optimise, JavaScript is fast."

jazzyjackson · on July 5, 2021

Being able to write a blog about it is a great reason to try something, cheers

devwastaken · on July 5, 2021

I've been delving deep into SIMD for wasm and have had a few findings: Clang/llvm doesn't use the wat format. Its similar to gas preprocessor, and supports relocation, so you can write your wasm assembly similar to how you would for arm and integrates in your makefile/C code as normal. This also makes it easier to port SIMD from other platforms to wasm simd.

The SIMD spec is quite lacking. Many algorithms take advantage of the chips ability to do multiple things at once in one instructions. Rounding, saturation, and truncation are the main ones I see that wasm does not support out of box, and you have to emulate them. Minus add_sat, that one does saturation.

So far aarch64 is lovely to port, but x86 is terrible. As noted in OP's blog here one single f64.max instruction ends up being 5-6 instructions on x86. While aarch64 and armhf just package it in a singular instruction.

I hope there is more work being done to remedy this so we can bring more algorithms at great performance in the browser. compiler work is above my abilities currently.

ngzhian · on July 10, 2021

If you are using Emscripten, porting x86 should be easy (see https://emscripten.org/docs/porting/simd.html#compiling-simd...), but you will take some performance hits (like float min/max) and these are all documented in the links above.

If you are interested to bring more SIMD instructions, you can participate in the SIMD subgroup. What's needed is interest in suggesting new instructions, working on a proposal (following the process) and pushing it through. Compiler expertise not required :)

viraptor · on July 5, 2021

I just spotted something about the arrays and wonder how much of a difference it would make in practice / is that where the TypedArrays don't get the expected speedup. The loops in computeDerivatives are mostly in this form:

    for (i=0; i<...; i++) {
        this.g[i][u]...
        H[i][u][v]...

I expect JS can't realistically invert the access order when JIT-ing and this completely destroys the cache.

If this can be inverted to:

        this.g[u][i]...
        H[u][v][i]...

over a TypedArray, that would scan a continuous array instead. (maybe even extract the common `this.g[u]` and `H[u][v]` out of the loop explicitly)

The rust code may have the same potential since it's iterating `array[i*n+...]` and `array[i*n*n...]` in a few places.

jarym · on July 4, 2021

"This whole experience served to reinforce my confidence in the power of the modern web as a flexible and fully-featured application platform." !! Yes, congrats to author for taking the considerable time and dedication necessary for the exercise at hand.

leeoniya · on July 5, 2021

super cool!

i've done my share of optimizations and benchmarking of various charting libs [1] but this goes deeper still.

SVG is not a good choice for large raw datasets when you need performance.

[1] https://github.com/leeoniya/uPlot#performance

Ameo · on July 5, 2021

Ooh, very nice!! I'm a heavy user of charting libraries; I've used highcharts in the past and ECharts more recently (including on the Spotifytrack site where this graph visualization lives).

I'll definitely be trying out uPlot for my next project - those performance metrics are stunning!

msoad · on July 5, 2021

Fascinating read! Well written and super interesting! One additional performance boost to this, now that it is using <canvas> for rendering is OffscreenCanvas[1] although it is not well supported in all browsers yet. It can use feature detection to fallback to main thread canvas renderer.

[1] https://developer.mozilla.org/en-US/docs/Web/API/OffscreenCa...

etimberg · on July 4, 2021

Great writeup! I love the level of detail and that the author touched on strategies they tried that didn't end up working out.

underwater · on July 4, 2021

When clicking onto another data screen on spotifytrack.net, and then returning to the graph, the labels are messed up. They render as plain black text (on a black background). Maybe that Sprite caching trick has some failure modes?

The sprites also look janky when zooming in, because they remain low res.

Ameo · on July 5, 2021

You're right - that is something I'll have to fix! Thanks for pointing that out

acidburnNSA · on July 4, 2021

Great post.

In physics simulations when we're dealing with the distance formula we often just skip the square root part. This works if you're dealing in relative distances. I bet this algorithm could skip the sqrt altogether.

Ameo · on July 4, 2021

That is a good point. The distance computation and displacement checking indeed could be changed to use the squared forms. However, I was unable to remove the square rooting because of their use in the core math.[1]

There are components of it like this:

   2. * weight * (distance - ideal_distance) / (ideal_distance_squared * distance)

I don't think there's a way to change this math to work without taking the square root to get `distance`. It's possible that there is a way to change that to work with the squared forms, but I don't grasp the actual math going on well enough to determine that unfortunately.

Looking at the perf metrics for the full function's disassembly[2], square root-related instructions only account for a tiny amount of the program's runtime anyway.

That all being said, thank you so much for taking the time to read the post closely enough to come up with this idea!

[1] https://github.com/Ameobea/webcola-wasm/blob/master/src/wasm... [2] https://ameo.link/u/91u.html

e12e · on July 4, 2021

I'm not entirely clear on what's going on with gs and hs - but at least gs seem to only be used with an additional multiply with distance - but I don't think that immediately allows to simplify away the sqrt by squaring both sides of the divide in the definition?

    let gs =
      2. * weight * (distance - ideal_distance) /
      (ideal_distance_squared * distance);

    let distance_cubed = 
      distance_squared * distance;


    unsafe {
      *self.g.get_unchecked_mut(
        i * n + u) +=  distance * gs };
    }

But hs seems to simplify to 0 in how it's used?:

    let hs =
      2. * -weight /
      (ideal_distance_squared * distance_cubed);

  let idk = hs
    * (2. * distance_cubed
       + ideal_distance * (distance * distance - distance_squared));

Unless the intent is some kind of float precision error handling?

    (distance * distance - distance_squared) =
   (distance_squared - distance_squared) = 0

Or is hs and gs used for something I'm missing?(could be, I'm on mobile)

lmeyerov · on July 5, 2021

great writeup!

in the same spirit of the above comment, the webcola core seems to be doing O(n^2) comparisons. probably somewhere in the top 10 for cool foundational scientific computing algorithms that people like here are barnes hut, which cuts the comparisons to O(n log n), and fancier, fast multipole. basically, no need to precisely compare far apart things, which are most of them. by the time of something like interactive UIs over 100K+ nodes, the asymptotics costs start dominating everything, and 2-8X C/SIMD speedups get overwhelmed if you don't scale the core alg. this is at the heart of many modern ML algorithms as well: many need to iteratively compare vectors, such as knn and umap.

pretty neat as well is that barnes hut + multipole are also SIMD friendly, and even better, GPU friendly. it is a bit trickier b/c now dealing with ideas like SIMD/GPU tree construction, but still works! for graphistry, our engine prototype had a multicore wasm frontend impl, gpu webcl frontend impl, and gpu opencl backend impl, and just by adding that, got some pretty cool #'s. nowdays we do a bit wilder and focus more on other layers like interactive analytics, bigger-than-memory, connectors/collaboration/embedding/etc, but all are logical steps building on what is being described here. bravo!

edit: it's good the blogpost included rendering. rendering gets pretty fun with graphs b/c while points are ~easy, edges get at (a) dynamic geometry (b) that can span the screen and (c) blow up in count pretty quick. ouch! the pathfinder project for vector text rendering is a fun related effort for what it takes to speed that up with simd/gpu. if someone is interested in a webgl contract around helping our users explore bigger interactive graphs (webgl/react client <-apache arrow-> python/CUDA server), ping build@graphistry.com !

udbhavs · on July 5, 2021

What's the fastest force-directed graph library for the web right now? I am using Vis Network for an app of mine and although it is smooth it struggles to start up with ~300 nodes

leeoniya · on July 5, 2021

probably this one: https://nblintao.github.io/ParaGraphL/

asimjalis · on July 4, 2021

Is there a command line version of WebCola or an online demo version?

tpetry · on July 5, 2021

Are you going to open-source your optimized version?

Ameo · on July 5, 2021

It's all here! https://github.com/Ameobea/webcola-wasm/blob/master/src/wasm...