Co-Dfns v5.7.0

mlochbaum · 2024-07-10T18:15:38 1720635338

Huh. This is the commit that introduces the register model: https://github.com/Co-dfns/Co-dfns/commit/f89919144f22441f21...

In the compiler, it's working with dense adjacency matrix representations, so this will be at best O(n^2) in whatever the relevant sort of node is (expressions? functions?). "at present is not optimized" seems a bit of an understatement here: I've never heard anything about these sorts of graph algorithms being possible with good asymptotics in an array style. In practice, I think any value of Co-dfns would be more in the fact that it emits GPU code than that it does it quickly, but this calls into question the value of writing it in APL, I would say (for what it's worth, I believe Co-dfns is still not able to compile itself).

The phrase "allocate all intermediate arrays" seems fairly confusing: what's actually being allocated must be space for the pointers to these arrays and other metadata, not the data of the array itself. As the data is variable-size, it can't be fully planned out at compile time, and I'm fairly sure the register allocation is done when there's not enough shape or type information to even make an attempt at data allocation. This change can only improve constant overhead for primitive calls, and won't be relevant to computations where most of the work is on large arrays. I think it's helpful for Co-dfns in a practical sense, but not too exciting as it only helps with bad cases: if this is important for a user then they'd probably end up with better performance by switching to a statically-typed language when it comes time to optimize.

It does mention that "For most small array values, we will use static allocation types, which prevents the need for allocating additional space for an array above and beyond its header.", so small arrays are taken care of by register allocation. Tradeoff there between being able to fit larger arrays in this space versus wasting space for values that don't use the full header.

Constant lifting within the compiler is pretty cool, I'll have to look into that.

moonchild · 2024-07-10T21:53:19 1720648399

> I've never heard anything about these sorts of graph algorithms being possible with good asymptotics in an array style

yeah—idk graph algos really, but have heard parallelising combinatorial search in general (eg sat) is hard because forks and joins happen heterogeneously and erratically. this 2001 vintage has a bunch of completely sequential graph colouring algorithms in j https://dl.acm.org/doi/pdf/10.1145/570406.570416 (and j at least has sparse matrices!)

> Constant lifting within the compiler is pretty cool, I'll have to look into that.

hrm, it seems to refer to 2 things: 1) constants allocated in a special space; 2) interning. 2 is obviously worthwhile; 1 i would guess is related to memory being weird on the gpu?

systems · 2024-07-10T18:36:06 1720636566

"High-performance, Reliable, and Parallel APL" -github

because i had no clue what co-dfns is

tokyovigilante · 2024-07-10T19:38:45 1720640325

Still not further enlightened.

rscho · 2024-07-10T20:03:17 1720641797

It's a compiler for the APL language, that generates GPU code. What's interesting about it is that it's also hosted on the GPU, and it also has a very original architecture representing trees as arrays of pointers.

jiggawatts · 2024-07-10T22:34:22 1720650862

This is a very valuable contribution to computer science that has been overlooked because of how obscure the implementation language is.

The idea of that flattening a tree into an array with relative indexing to represent the structure opens up a whole bag of performance tools.

Parallel processing is just one (major!) aspect. It also enables GPU processing and unlocks SIMD instructions. Not to mention that instead of millions of tiny inefficient mallocs, this can use large contiguous arrays.

An itch I’ve been meaning to scratch is implementing something like Equality Saturation on top of this data structure in the style of EGG (e-graphs good).

Combine that with the Category Theory work done recently in the Haskell world for arbitrary structure to structure functional maps and you could make a compiler that takes a bunch of term rewrite rules and spits out parallel GPU code that can apply optimisation rewrites at enormous speed.

“The best thing about the computer revolution is that it hasn’t happened yet.”

Duanemclemore · 2024-07-10T18:09:19 1720634959

Cool - I was just installing the drivers and ArrayFire needed to run Co-Dfns yesterday. When I get back to it later I'll pull this new version. Thanks for posting!