It's worth noting that you can solve these linear recurrences, `x(t) = a(t)x(t-1) + b(t)`, using a single parallel prefix sum where the elements are the input tuples `(a(t), b(t))`.
The following operator called `combine` works. It's associative over these tuples; the final result will be the final value of the second component. Altogether, this gives you O(n) work and O(log n) span, but using just a single parallel prefix kernel, which may be more efficient in practice.
I'm pretty sure this generalizes, too, to abstract multiplications and additions in any field (at first glance, seems like it should but I haven't done the formal proof yet).
Anyway, it would be interesting to compare this against the solution in the arxiv paper.
Yes! That's pretty much how I would have thought about this with my puny little brain.
But... I'm not sure 1 parallel scan with 2 floating-point mults and 1 sum per step is faster than 2 parallel scans with 1 sum per step. I don't know which is better in theory or in practice. And then, wouldn't we have to think about how to sidestep all the numerical issues with the cumulative products?
On modern multicore hardware this will be memory-bound; the amount of computation per byte is pretty small (just a few arithmetic instructions on average). My intuition is that the single scan will be faster because it requires a much smaller number of cache misses.
And yes, definitely, the numerical accuracy thing could be a problem. I suspect it wouldn't be too difficult to work around, but I can't say for sure off the top of my head.
Memory pressure is even worse on GPUs. I did some work to generalize Blelloch to 2D parallel prefix sums for integral image computation back in 2008 [1], and the number of memory accesses really dominates. On a GPU, for sufficiently small problems the number of passes matters more, and it is worth using a simpler, non-work-efficient algorithm to reduce setup overheads.
Thank you, this is helpful... although my initial thought was this may be useful for a rather different type of application: These new AI models, "linear RNNs," that have many layers, each layer processing large batches of sequences in parallel, each sequence with potentially up to millions of tokens, each token with thousands of features. Definitely not small-scale. Hard to reason about, at least for me.
> My intuition is that the single scan will be faster because it requires a much smaller number of cache misses.
Thank you, that's helpful. Your intuition may be right, but I'm not sure either. Too hard to reason about, at least for me. Maybe the thing to do is test both... unfortunately that would involve the hassle of writing code for executing the single scan efficiently on GPUs.
I don't see how people should glorify this with the word "algorithm". It is a trivial undergrad homework exercise, once you give the hint "use parallel reduce / fold / prefixsum".
This may involve more interesting tradeoffs if you deal with large or sparse matrices or matrix-free operators.
I had the same thought. If a were constant I believe the solution would just be a convolution of b_t with an exponential filter. I think this is different because a_t depends on time, so it's equivalent to convolving b_t with a filter whose timescale changes at every timestep, which doesn't seem like something that's obviously parallelizable with FT.
As a super minor grammar point for the author, "ubiquitous" is a rare example of a word that starts with a vowel but should be proceeded by "a" instead of "an".
I think that's got to be a great test for foreign spies!
"ubiquitous" starts with the sound of "you-biquitous" and so the suffix -n is a duplicated non-vowel. ("y" is probably a vocalic glide, but still not in {a,e,i,o,u}.)
I bet the real rule is some reality about feasibility of pronunciation, even though native english speakers see the rules explained in terms of spellings.
> I bet the real rule is some reality about feasibility of pronunciation, even though native english speakers see the rules explained in terms of spellings.
The rule as often given in English classes is to use "an" if a word starts with a "vowel sound", rather than starting with a vowel. So, it's "an herb" (since the 'h' is silent in (American) English), but "a ubiquitous".
Relatedly, you can infer whether someone likely pronounces "URL" by spelling it out (like "you are ell") or as a word (like "earl") based on whether they write "a URL" or "an URL".
As someone who uses a dialect/accent of English that means I often pronounce these "silent" letters (e.g. I would pronounce the "h" in herb), I've often wondered if selection of "a" vs "an" is supposed to be accent specific.
I think so, yes; if you say "herb" with a non-silent 'h', you'd write "a herb".
By way of example, the name Herb is commonly pronounced with a non-silent H, so even in a dialect where "herb" is pronounced "erb", you'd write "A Herb picked up an herb.".
> I bet the real rule is some reality about feasibility of pronunciation
YUP! Sorry I don't have a good citation handy, but lots of English grammar happens as a result of misaligned word boundaries. "napron" (from the French naperon) became "an apron". Orange (from the Arabic naranj)... "an ewt" became "a newt". etc.
It seems like they must require (or maybe I missed something) that none of their a’s are zero (I guess this would be easy to check and avoid). It seem like it would be harder to know beforehand that none of the it b*’s are equal to -x_0 and none of their b’s are equal to the corresponding a*. But maybe this is just unlikely enough to not care about.
I wish there were more explanation for why this work. There is a proof but a more opinionated description would also help.
My summary of it is that parallel prefix sum is work efficient for computing general recurrences of the form xi = x_i-1 + yi. Furthermore, a product of elements can be expressed as a sum by taking the sum of the log of elements and exponentiated the result. These are two super useful properties, but merging the two operations to this general formulation is not clear.
Is it simply sum the logs, and the addends via prefix sums, and then sum those two results? That seems pretty straightforward, so I suspect I am missing something that makes the solution less obvious and more complexity.
Is there some fancy fixed point combinator, or solution to a differential equation of the linear system for the recurrence equation?
Interesting. They don't explicitly state it, but it looks like the restrictions of a_t and b_t is that while they may vary in time, they must have a closed form solution which is independent of x.
I think I've seen something similar for computing fibinocci values by converting the sequence into a series of matrix multiplications which can then be done in parallel because they are commutative (?). Each mareix is 2x2 with 1 in 3 cells and 0 in the last, representing the first fib numbers. Very cool transformation
transforming the recurrence to a matrix then applying the matrix as a transformation of itself can be expressed as multiplication. but AAAAAA doesnt really save you much to get the 6th fibonnaci number. the power is in powers of the matrix. Now you can say
AA = A^2 , and A^2A^2 = A^4
still not computationally helpful, but
A^4A^4 = A^8
A^8*A^8 = A^16
...32
...64
and then you've got something useful.
Sure the math works but it's not efficient. Matrix matrix multiplication instead of matrix vector multiplication.. and then throw an matrix exponential in the mix for good measure.
The following operator called `combine` works. It's associative over these tuples; the final result will be the final value of the second component. Altogether, this gives you O(n) work and O(log n) span, but using just a single parallel prefix kernel, which may be more efficient in practice.
For example, here's a C++ implementation: https://github.com/MPLLang/parallel-ml-bench/blob/main/cpp/l... And, here's an implementation in a functional language: https://github.com/MPLLang/parallel-ml-bench/blob/main/mpl/b...I'm pretty sure this generalizes, too, to abstract multiplications and additions in any field (at first glance, seems like it should but I haven't done the formal proof yet).
Anyway, it would be interesting to compare this against the solution in the arxiv paper.
----
EDIT: ah, good, this is already being discussed here: https://github.com/glassroom/heinsen_sequence/issues/1
And, for reference, I learned this algorithm from Guy Blelloch; see Sec 1.4.1 of https://www.cs.cmu.edu/~guyb/papers/Ble93.pdf