
Weld: Fast parallel code generation for data analytics frameworks - heydenberk
https://weld-project.github.io/
======
xiphias
Wow, this is amazing how easily fusing can be done with just a few linear
typed primitives.

The cool part is that if you have a for loop of vecbuilders and inside the for
loop you have vecbuilders as well, the code can be fused quite easily. This is
very hard to do with other data structures.

The paper is much better than the site and very easy to understand:
[https://cs.stanford.edu/~matei/papers/2017/cidr_weld.pdf](https://cs.stanford.edu/~matei/papers/2017/cidr_weld.pdf)

------
wyldfire
I didn't quite get it from the initial skim of this page.

But here's the key bit:

> [existing solutions slow] ... due to extensive data movement across the
> functions. Weld’s take on solving this problem is to lazily build up a
> computation for the entire workflow, optimizing and evaluating it only when
> a result is needed.

The "Background" [1] section of the tutorial has more info too.

I've done lots of optimizations before with CPU/GPU computations and I've
definitely noticed that minimizing trips across the bus and doing computation
during bus activity is critical to saturating the processors/memory. But would
we see the orders-of-magnitude improvements shown on this page for CPU-only
work? I can't quite tell what was measured in the graphs.

[1] [https://github.com/weld-
project/weld/blob/master/docs/tutori...](https://github.com/weld-
project/weld/blob/master/docs/tutorial.md#weld-background)

