Nice simple introduction. Would be interesting for someone here with access to HLS tools, esp C to verilog stuff, to tell us what speedup they get straight-up using the C code. As in, very little input past what the HLS tools come up with.
Faster, potentially cheaper, and more expensive to produce?
The history of the bitcoin miner has details and us a real-world example of software on x86 ASIC -> FPGA -> Custom ASIC process. It's easy to find the relative performance of the bitcoin miner running on everything from Rasberry PI's to CUDA clusters [1].
Note that the article is using the very flexible DE2-115 and there's lots of interesting trade-offs made to fit a bitcoin miner in only 115,000 gates... iirc, if you have 250k gates, it can run 4x (???) faster due to optimizations during synthesis.
I guarantee it would smoke it by similarly ridiculous numbers. Assembler will inherently be doing ops sequentially while also waiting on memory accesses in between them where not cached. An expected speed up might factor in the clock difference between it and yours plus number of cores. Yet, you're not going to get the kind of parallelism and simple operation you have with custom HW. It's the lasting drawback of general-purpose CPU's.
And why Intel is buying Altera. Stuff like this article will get easier and with even bigger speedups in the near future. Just wait. :)
That's the problem I think. He's changed from a shared pixel memory in the reference design to a non shared one in his HW accelerated design. That's the impression I get from the diagrams.
It's somewhat moving the goal posts IMO. Guess it's ok since it's a student project.
Pixels are mapped to memory locations, but they don't have to be, if you can access the map directly. I don't exactly know what I'm talking about here, just a thought.
What would be the difference between writing to shared hardware pixels over a shared memory performance wise?
I mean, pixels are just like a memory except that they glow. They hold their value and can be writen a new value in sync with a clock (which is normally 60 Hz for most monitors which is MUCH slower then on chip memory). There could be no performance benefit.