
Recurrent Neural Networks Hardware Implementation on FPGA - Katydid
http://arxiv.org/abs/1511.05552v1
======
ilurk
> We implemented a RNN with 2 layers and 128 hidden units in hardware and it
> has been tested using a character level language model. The implementation
> is more than 21× faster than the ARM CPU embedded on the Zynq 7020 FPGA.

I'm left curious on the performance gain factor when scaling the network in
terms of layers and units. Would the performance gap widen as the RNN grows?

~~~
leonardt
> Figure 8: The execution time is projected to decrease with the increase of
> number of LSTM cells running in parallel. This can lead to significant
> performance improvement.

They do say in the text that "Figure 8 shows the expected speed up, assuming
the data throughput is high enough to handle the parallel processing" so take
it with a grain of salt. There could and most likely will be (as there always
is) other factors that prevent ideal scaling.

------
iaw
Is this unexpected? I feel like we're heading towards more hardware
implementations of Machine Learning to work around the current bottlenecks.

~~~
spooningtamarin
There are extremely large amounts of hardware based implementations. Any more
complicated TV upscaler (Sony 4K TVs certainly) probably has a neural network
embedded in it.

Any algorithms working with images need FPGA implementations to be quick
enough, and there are a lot of convnets in use there.

There's a chapter on convnets here:
[http://www.cambridge.org/us/academic/subjects/computer-
scien...](http://www.cambridge.org/us/academic/subjects/computer-
science/pattern-recognition-and-machine-learning/scaling-machine-learning-
parallel-and-distributed-approaches?format=HB)

RNNs are nothing special in particular, at least these with small number of
layers and nodes.

HMMs or CRFs are easier to handle for sequences, and would probably work well,
and there are FPGAs all over the place of these models.

~~~
p1esk
Any links to TV upscaler chips using NN algorithms? I couldn't find any.

Also, my impression was that FPGAs are much slower than GPUs for neural nets.
Unless you're talking about really high end chips like Stratix 10 from Altera,
which cost over $30k. Power consumption is a different matter though.

~~~
spooningtamarin
[https://community.sony.co.uk/t5/blog-news-from-
sony/inside-4...](https://community.sony.co.uk/t5/blog-news-from-
sony/inside-4k-what-is-x-reality-pro/ba-p/1346224)

They won't tell you it's NNs but it is. Sony distributes a lot of movies, they
use their database of movies to train the upscaling models (which is obviously
NNs [https://github.com/nagadomi/waifu2x](https://github.com/nagadomi/waifu2x)
) and then put the chip in the TV.

It's something almost equivalent to storing Pride and Prejudice and Zombies in
your TV in 4K, and then reproducing it when they match it with what's playing
on TV.

~~~
p1esk
It does not mention anything about neural networks in that link. How do you
know? I looked up that Sony chip, and I can't find anything "neural" about it
either.

I don't quite understand what you mean by: "It's something almost equivalent
to storing Pride and Prejudice and Zombies in your TV in 4K, and then
reproducing it when they match it with what's playing on TV."

Can you explain? What exactly do they have to store in the TV?

~~~
spooningtamarin
Well, Sony Pictures is huge. Their database of movies is enormous. (Pride and
Prejudice and Zombies is one of the movies distributed by Sony)

Imagine they stored all of the movies they distribute in 4K in your TV.
Whenever the movie is displayed they just find it in the database and
reproduce the 4K version.

Of course, that is very much infeasible, it requires too large of a storage.

What they do is something like Waifu2x I linked, they train a neural network
to learn how to properly upscale movies. They put the "algorithm" on the chip,
and its fast enough for real time usage.

~~~
p1esk
It seems like you're speculating. Unless you're familiar with the details of
their chip, your guess is as good as mine.

~~~
spooningtamarin
I don't have to be familiar, you can know state of the art deterministic
upscalers and their image quality, and know state of the art statistical
upscalers and their image quality.

When you see side-by-side for these upscalers, it's obvious they aren't that
good in deterministic upscaling.

------
nuand
Seems like an interesting approach. However the 21x speedup seems a little
underwhelming considering the speed of the Zynq's programmble logic fabric and
how parallizable NNs are. They're quoting a 21x speedup over the ARM processor
on the Zynq 7020, which is on par with what powers the RaspberryPi 2. My guess
is they didn't pipeline their design enough or appropriately causing one of
the datapaths to significantly limit their throughput.

------
dang
Url changed from [http://hgpu.org/?p=14968](http://hgpu.org/?p=14968), which
points to this.

