Hacker News new | past | comments | ask | show | jobs | submit login
Efficient parallelization of an ubiquitous sequential computation (arxiv.org)
89 points by cs702 on Dec 7, 2023 | hide | past | favorite | 33 comments



It's worth noting that you can solve these linear recurrences, `x(t) = a(t)x(t-1) + b(t)`, using a single parallel prefix sum where the elements are the input tuples `(a(t), b(t))`.

The following operator called `combine` works. It's associative over these tuples; the final result will be the final value of the second component. Altogether, this gives you O(n) work and O(log n) span, but using just a single parallel prefix kernel, which may be more efficient in practice.

    combine ((a1, b1), (a2, b2))  =  (a1 * a2, b1 * a2 + b2)
For example, here's a C++ implementation: https://github.com/MPLLang/parallel-ml-bench/blob/main/cpp/l... And, here's an implementation in a functional language: https://github.com/MPLLang/parallel-ml-bench/blob/main/mpl/b...

I'm pretty sure this generalizes, too, to abstract multiplications and additions in any field (at first glance, seems like it should but I haven't done the formal proof yet).

Anyway, it would be interesting to compare this against the solution in the arxiv paper.

----

EDIT: ah, good, this is already being discussed here: https://github.com/glassroom/heinsen_sequence/issues/1

And, for reference, I learned this algorithm from Guy Blelloch; see Sec 1.4.1 of https://www.cs.cmu.edu/~guyb/papers/Ble93.pdf


Yes! That's pretty much how I would have thought about this with my puny little brain.

But... I'm not sure 1 parallel scan with 2 floating-point mults and 1 sum per step is faster than 2 parallel scans with 1 sum per step. I don't know which is better in theory or in practice. And then, wouldn't we have to think about how to sidestep all the numerical issues with the cumulative products?

Or am I missing something?


On modern multicore hardware this will be memory-bound; the amount of computation per byte is pretty small (just a few arithmetic instructions on average). My intuition is that the single scan will be faster because it requires a much smaller number of cache misses.

And yes, definitely, the numerical accuracy thing could be a problem. I suspect it wouldn't be too difficult to work around, but I can't say for sure off the top of my head.


Memory pressure is even worse on GPUs. I did some work to generalize Blelloch to 2D parallel prefix sums for integral image computation back in 2008 [1], and the number of memory accesses really dominates. On a GPU, for sufficiently small problems the number of passes matters more, and it is worth using a simpler, non-work-efficient algorithm to reduce setup overheads.

[1] https://people.xiph.org/~tterribe/pubs/gpusurf.pdf Section III.A


Thank you, this is helpful... although my initial thought was this may be useful for a rather different type of application: These new AI models, "linear RNNs," that have many layers, each layer processing large batches of sequences in parallel, each sequence with potentially up to millions of tokens, each token with thousands of features. Definitely not small-scale. Hard to reason about, at least for me.


> My intuition is that the single scan will be faster because it requires a much smaller number of cache misses.

Thank you, that's helpful. Your intuition may be right, but I'm not sure either. Too hard to reason about, at least for me. Maybe the thing to do is test both... unfortunately that would involve the hassle of writing code for executing the single scan efficiently on GPUs.


> And then, wouldn't we have to think about how to sidestep all the numerical issues with the cumulative products?

I think so. And also potential issues with all the sums. :) (see: "compensated summation") god I love this shit. :)


This is very very well known. Cf https://en.wikipedia.org/wiki/Affine_group

I don't see how people should glorify this with the word "algorithm". It is a trivial undergrad homework exercise, once you give the hint "use parallel reduce / fold / prefixsum".

This may involve more interesting tradeoffs if you deal with large or sparse matrices or matrix-free operators.


If you know more than others do, that's great, but instead of posting putdowns, please share some of what you know so the rest of us can learn.

The trouble with comments like this is that they degrade discussion put others down without really teaching us anything.

https://hn.algolia.com/?dateRange=all&page=0&prefix=true&sor...

https://news.ycombinator.com/newsguidelines.html


Is this equivalent to treating the linear recurrence as a convolution and computing it with a Fourier transform? See https://www.youtube.com/watch?v=OpJMn8T7Z34&t=840s


I had the same thought. If a were constant I believe the solution would just be a convolution of b_t with an exponential filter. I think this is different because a_t depends on time, so it's equivalent to convolving b_t with a filter whose timescale changes at every timestep, which doesn't seem like something that's obviously parallelizable with FT.


As a super minor grammar point for the author, "ubiquitous" is a rare example of a word that starts with a vowel but should be proceeded by "a" instead of "an".


I think that's got to be a great test for foreign spies!

"ubiquitous" starts with the sound of "you-biquitous" and so the suffix -n is a duplicated non-vowel. ("y" is probably a vocalic glide, but still not in {a,e,i,o,u}.)

I bet the real rule is some reality about feasibility of pronunciation, even though native english speakers see the rules explained in terms of spellings.


> I bet the real rule is some reality about feasibility of pronunciation, even though native english speakers see the rules explained in terms of spellings.

The rule as often given in English classes is to use "an" if a word starts with a "vowel sound", rather than starting with a vowel. So, it's "an herb" (since the 'h' is silent in (American) English), but "a ubiquitous".

Relatedly, you can infer whether someone likely pronounces "URL" by spelling it out (like "you are ell") or as a word (like "earl") based on whether they write "a URL" or "an URL".


As someone who uses a dialect/accent of English that means I often pronounce these "silent" letters (e.g. I would pronounce the "h" in herb), I've often wondered if selection of "a" vs "an" is supposed to be accent specific.


I think so, yes; if you say "herb" with a non-silent 'h', you'd write "a herb".

By way of example, the name Herb is commonly pronounced with a non-silent H, so even in a dialect where "herb" is pronounced "erb", you'd write "A Herb picked up an herb.".


> I bet the real rule is some reality about feasibility of pronunciation

YUP! Sorry I don't have a good citation handy, but lots of English grammar happens as a result of misaligned word boundaries. "napron" (from the French naperon) became "an apron". Orange (from the Arabic naranj)... "an ewt" became "a newt". etc.


It's regional depending on silent or transmuted consonants in the local vernacular. An 'ospital, an 'orse, a nuncle, it's a yewbiquitous phenomenon.


Oy, keep yer spillage off my napron!


A union

A urinal

...


Too early to tell, but this looks like it could both simplify implementation and speed-up execution of linear RNNs like RWKV, Mamba, retnet, etc.


They already use closed form summations. That's their whole point


Yes, but this uses only two plan-vanilla parallel scans, shown as a* and b* in the pdf.

In the sample code, that's two calls to PyTorch's built-in API.

That looks very efficient to me, almost like it shouldn't be possible. But I tested the sample code[a] and it works.

Not an expert on this, but I don't think I've ever seen that before. Have you?

[a] https://github.com/glassroom/heinsen_sequence


There's a GitHub issue on the repo that explores this!


Thank you. It seems only testing will answer the question.


It seems like they must require (or maybe I missed something) that none of their a’s are zero (I guess this would be easy to check and avoid). It seem like it would be harder to know beforehand that none of the it b*’s are equal to -x_0 and none of their b’s are equal to the corresponding a*. But maybe this is just unlikely enough to not care about.


I wish there were more explanation for why this work. There is a proof but a more opinionated description would also help. My summary of it is that parallel prefix sum is work efficient for computing general recurrences of the form xi = x_i-1 + yi. Furthermore, a product of elements can be expressed as a sum by taking the sum of the log of elements and exponentiated the result. These are two super useful properties, but merging the two operations to this general formulation is not clear. Is it simply sum the logs, and the addends via prefix sums, and then sum those two results? That seems pretty straightforward, so I suspect I am missing something that makes the solution less obvious and more complexity. Is there some fancy fixed point combinator, or solution to a differential equation of the linear system for the recurrence equation?


Interesting. They don't explicitly state it, but it looks like the restrictions of a_t and b_t is that while they may vary in time, they must have a closed form solution which is independent of x.


I think I've seen something similar for computing fibinocci values by converting the sequence into a series of matrix multiplications which can then be done in parallel because they are commutative (?). Each mareix is 2x2 with 1 in 3 cells and 0 in the last, representing the first fib numbers. Very cool transformation


transforming the recurrence to a matrix then applying the matrix as a transformation of itself can be expressed as multiplication. but AAAAAA doesnt really save you much to get the 6th fibonnaci number. the power is in powers of the matrix. Now you can say AA = A^2 , and A^2A^2 = A^4 still not computationally helpful, but A^4A^4 = A^8 A^8*A^8 = A^16 ...32 ...64 and then you've got something useful.


Unfortunately it doesn't work for matrix valued coefficients. Unless I am mistaken?


The math works (I think, don't really see why it shouldn't); the intermediate of the scan is matrix-valued, though.


Sure the math works but it's not efficient. Matrix matrix multiplication instead of matrix vector multiplication.. and then throw an matrix exponential in the mix for good measure.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: