
Non-linear Thinking with CUDA - StylifyYourBlog
http://viralfsharp.com/2015/09/08/non-linear-thinking-with-cuda/
======
rielfowler
Why is the naive solution n^3? Generating one partial sum costs n, and you
need to do it n times, so n^2. Even if you need to store all the intermediate
sums, you get them as you sum the entire partial sum. No memoization needed.
What am I missing here?

~~~
ColinWright
Naively:

    
    
       There are n starting points
       There are n ending points
       The interval is n/2 in length on average
       Thus it requires n * n * (n/2) additions
    

You said:

    
    
        Generating one partial sum costs n, and
        you need to do it n times, ...
    

I think you need to do it n^2 times.

~~~
rielfowler
There are n^2 results, right?

~~~
ColinWright
Yes, and naively each of those results takes O(n) to compute.

~~~
rielfowler
Let me correct my self, there are n^2 results in total, when the algorithm
ends. The naive algorithm is doing n times more work. That is redundant,
because there is clearly an overlap of some partial sums.

~~~
ColinWright
OK, now I don't understand what you are trying to say.

There are n^2 results, each result takes O(n) to compute, so the naive
algorithm is O(n^3). That's what you asked - why does the naive algorithm take
O(n^3) - and that seems to answer your question.

Yes, clearly there is an overlap of some partial sums, and that's why a less
naive algorithm is sub-O(n^3).

So, what are you asking, or what additional point are you making?

------
amelius
I'm wondering if this problem can be approached as a standard convolution
problem.

~~~
pavanky
Especially if you are doing it in frequency space, all you would need to do is
1 forward FFT and n reverse FFTs.

Alternatively you can just perform a prefix sum operation and subtract each (n
- k)th element from nth element for results in the kth iteration. This could
be even faster.

