This is quite a hardcore trip down to optimising single CPU instructions. I need to tip off my hat to the author. Though not sure what kind of motivation and project goals warrant this deep and detailed problem tackling.
Author here - I was on a week-long vacation at the time and had been working on the visualization as a side project before that. I started really getting into the deep end of the optimization work while on that week-long break.
The thing about optimization work is that it's incredibly addicting once you get into it. Seeing consistent, measurable improvement and watching the frame times shrink is so satisfying, plus the process itself is really enjoyable in and of itself.
I definitely agree that the optimizations here are overkill for the actual application itself. However, it served as a great learning experience for me personally and allowed me to write this blog post as well!
> I definitely agree that the optimizations here are overkill for the actual application itself. However, it served as a great learning experience for me personally and allowed me to write this blog post as well!
While this might be a bit overkill, I wish more developers spent at least 10% as much effort as here, the modern world of software really feels bloated often. The core engines are getting faster and the cpus are getting faster but the web feels slower.
And another important factor is that more and more software runs on battery powered portable devices in some form. Instructions saved is battery saved = happier users :)
Did you look at any alternative graph layout algorithms? There are non-quadratic ones that are quite good (like Barnes-Hut and its variants). I did a Typescript port of ForceAtlas2 a couple years ago that could run on a Web Worker (so as to not block the interactivity of the graph) but never got around to open sourcing it.
I'd be amazed to see what this level of profiling skill could do on a fancier algorithm.
Thank you. I believe VM developers will learn tons of stuff from this post. Also it debunks the myth "there is no point to optimise, JavaScript is fast."
I've been delving deep into SIMD for wasm and have had a few findings:
Clang/llvm doesn't use the wat format. Its similar to gas preprocessor, and supports relocation, so you can write your wasm assembly similar to how you would for arm and integrates in your makefile/C code as normal. This also makes it easier to port SIMD from other platforms to wasm simd.
The SIMD spec is quite lacking. Many algorithms take advantage of the chips ability to do multiple things at once in one instructions. Rounding, saturation, and truncation are the main ones I see that wasm does not support out of box, and you have to emulate them. Minus add_sat, that one does saturation.
So far aarch64 is lovely to port, but x86 is terrible. As noted in OP's blog here one single f64.max instruction ends up being 5-6 instructions on x86. While aarch64 and armhf just package it in a singular instruction.
I hope there is more work being done to remedy this so we can bring more algorithms at great performance in the browser. compiler work is above my abilities currently.
If you are interested to bring more SIMD instructions, you can participate in the SIMD subgroup. What's needed is interest in suggesting new instructions, working on a proposal (following the process) and pushing it through. Compiler expertise not required :)
I just spotted something about the arrays and wonder how much of a difference it would make in practice / is that where the TypedArrays don't get the expected speedup. The loops in computeDerivatives are mostly in this form:
for (i=0; i<...; i++) {
this.g[i][u]...
H[i][u][v]...
I expect JS can't realistically invert the access order when JIT-ing and this completely destroys the cache.
If this can be inverted to:
this.g[u][i]...
H[u][v][i]...
over a TypedArray, that would scan a continuous array instead. (maybe even extract the common `this.g[u]` and `H[u][v]` out of the loop explicitly)
The rust code may have the same potential since it's iterating `array[i*n+...]` and `array[i*n*n...]` in a few places.
"This whole experience served to reinforce my confidence in the power of the modern web as a flexible and fully-featured application platform." !! Yes, congrats to author for taking the considerable time and dedication necessary for the exercise at hand.
Ooh, very nice!! I'm a heavy user of charting libraries; I've used highcharts in the past and ECharts more recently (including on the Spotifytrack site where this graph visualization lives).
I'll definitely be trying out uPlot for my next project - those performance metrics are stunning!
Fascinating read! Well written and super interesting! One additional performance boost to this, now that it is using <canvas> for rendering is OffscreenCanvas[1] although it is not well supported in all browsers yet. It can use feature detection to fallback to main thread canvas renderer.
When clicking onto another data screen on spotifytrack.net, and then returning to the graph, the labels are messed up. They render as plain black text (on a black background). Maybe that Sprite caching trick has some failure modes?
The sprites also look janky when zooming in, because they remain low res.
In physics simulations when we're dealing with the distance formula we often just skip the square root part. This works if you're dealing in relative distances. I bet this algorithm could skip the sqrt altogether.
That is a good point. The distance computation and displacement checking indeed could be changed to use the squared forms. However, I was unable to remove the square rooting because of their use in the core math.[1]
I don't think there's a way to change this math to work without taking the square root to get `distance`. It's possible that there is a way to change that to work with the squared forms, but I don't grasp the actual math going on well enough to determine that unfortunately.
Looking at the perf metrics for the full function's disassembly[2], square root-related instructions only account for a tiny amount of the program's runtime anyway.
That all being said, thank you so much for taking the time to read the post closely enough to come up with this idea!
I'm not entirely clear on what's going on with gs and hs - but at least gs seem to only be used with an additional multiply with distance - but I don't think that immediately allows to simplify away the sqrt by squaring both sides of the divide in the definition?
let gs =
2. * weight * (distance - ideal_distance) /
(ideal_distance_squared * distance);
let distance_cubed =
distance_squared * distance;
unsafe {
*self.g.get_unchecked_mut(
i * n + u) += distance * gs };
}
in the same spirit of the above comment, the webcola core seems to be doing O(n^2) comparisons. probably somewhere in the top 10 for cool foundational scientific computing algorithms that people like here are barnes hut, which cuts the comparisons to O(n log n), and fancier, fast multipole. basically, no need to precisely compare far apart things, which are most of them. by the time of something like interactive UIs over 100K+ nodes, the asymptotics costs start dominating everything, and 2-8X C/SIMD speedups get overwhelmed if you don't scale the core alg. this is at the heart of many modern ML algorithms as well: many need to iteratively compare vectors, such as knn and umap.
pretty neat as well is that barnes hut + multipole are also SIMD friendly, and even better, GPU friendly. it is a bit trickier b/c now dealing with ideas like SIMD/GPU tree construction, but still works! for graphistry, our engine prototype had a multicore wasm frontend impl, gpu webcl frontend impl, and gpu opencl backend impl, and just by adding that, got some pretty cool #'s. nowdays we do a bit wilder and focus more on other layers like interactive analytics, bigger-than-memory, connectors/collaboration/embedding/etc, but all are logical steps building on what is being described here. bravo!
edit: it's good the blogpost included rendering. rendering gets pretty fun with graphs b/c while points are ~easy, edges get at (a) dynamic geometry (b) that can span the screen and (c) blow up in count pretty quick. ouch! the pathfinder project for vector text rendering is a fun related effort for what it takes to speed that up with simd/gpu. if someone is interested in a webgl contract around helping our users explore bigger interactive graphs (webgl/react client <-apache arrow-> python/CUDA server), ping build@graphistry.com !
What's the fastest force-directed graph library for the web right now? I am using Vis Network for an app of mine and although it is smooth it struggles to start up with ~300 nodes