There's a limit to how much work can be completed in one cycle. There's also a limit to the complexity of the function, given that the programmable datapath is fixed size.
I also wonder how well this approach would work when you're working with an algorithm where the data access pattern is as important as CPU.
It doesn't help. For that you need a programmable memory controller.
I also wonder how well this approach would work when you're working with an algorithm where the data access pattern is as important as CPU.