
Reverse Engineering the Comtech AHA363 PCIe Gzip Accelerator Board - todsacerdoti
https://tomverbeure.github.io/2020/06/14/AHA363-Reverse-Engineering.html
======
triangleman
So how do you get a web server to take advantage of a hardware gzip
accelerator? Is there like a custom nginx plugin that points to a custom
driver? Seems very cool and I've never heard of something like this.

~~~
tverbeure
According to their press release, it was an Apache plugin.

[https://www.businesswire.com/news/home/20081008005207/en/Com...](https://www.businesswire.com/news/home/20081008005207/en/Comtech-
AHA-Announces-Apache-GZIP-Compression-Module)

~~~
mng2
They mention APIs for Windows, Linux, OpenSolaris, a ZLIB, and the Apache
module.

[http://www.aha.com/DrawProducts.aspx?Action=GetProductDetail...](http://www.aha.com/DrawProducts.aspx?Action=GetProductDetails&ProductID=14)

------
tachyonbeam
I'm curious to know how much faster than Gzip compression on a modern
multicore CPU this is. The AHA webpage says "Compresses and decompresses at a
throughput rate over 5.0 Gbits/sec" (that's 1.2 GB/s). How fast can you gzip
compress on a 16-core Ryzen CPU, for example?

~~~
masklinn
According to
[https://rachaellappan.github.io/pigz/](https://rachaellappan.github.io/pigz/)
pigz did about 360MB/s on a 96 cores machine, though that was 3 years ago.

~~~
Ballas
I regularly get 80-200MB/s on my 8 core 7700HQ, though it is likely limited by
disk speed.

------
jtchang
Wow. I know very little about FPGAs and hardware and this is pure magic to me.
Debugging these things must be a royal pain.

~~~
pjc50
Yes, it is :)

The normal development flow for FPGA software is similar to ASIC in that
people focus on testing as many elements as possible in software simulation to
a very high level of coverage before even downloading to the FPGA. Once in
there, you're reliant on JTAG (high-speed serial bus) to read out values from
the target device.

Tools like ChipScope can let you see what's going on and set ""breakpoints"".
[http://web.mit.edu/6.111/www/labkit/chipscope.shtml](http://web.mit.edu/6.111/www/labkit/chipscope.shtml)

All of this is much harder when it's on a board you didn't design!

------
_sbrk
Nice development board for $20, even if it takes some work. Bought one!

~~~
tverbeure
It's too late now since you've already bought one, but right now, it's still a
doorstop than can't be used for anything: I haven't even been able to get one
of the LEDs blinking, and that's only the Hello World of the FPGA hobbyist.

~~~
_sbrk
No worries. As the saying goes "this isn't my first rodeo". I'll let you know
what I discover.

~~~
tverbeure
Get in touch!

One thing that didn’t make it in the blog post was that a strategically
soldered wire shortened the JTAG chain to only include the 2 Intel chips and
bypass the AHA chips.

The power to the AHA chips gets cut off at some point, breaking the JTAG
chain, but the FPGAs stay active.

So by bypassing, you keep the chain alive.

------
happycube
Are there any free drivers for it with the native hardware? There are ones on
their site... but behind a registration-wall.

It might be pretty nifty to run swap through this...

------
battery423
I'm super exicted what will come in this direction after PS5 announcement and
Microsoft mentioning Direct Storage.

PS5 has a IO chip and extra architecture for decompressing textures (here the
context with this gzip accelerator) and more features.

Mark Cerny said that they needed this chip because from a CPU ressource
usage,it would use all CPU cores. So nvm now are so fast that a io co
processor is feasable again.

~~~
rasz
PC SSD drives started out with buildin compression. This is why most test
software has separate "incompressible data" graphs. Its usually something you
actually dont want, akin to fake streamer tape capacities/speeds assuming 2x
compression.

> IO chip and extra architecture for decompressing textures

you dont want uncompressed textures in your GPU memory, compressing DXT
compressed textures is not idea to say the least, you can count on 30-50%
compression ratio, not the marketing 2x peak number Sony was throwing around

>would use all CPU cores

is marketing exaggeration, LZ4 decompression can achieve ~3GB/s per core. How
it is today: [https://www.jonolick.com/home/oodle-and-ue4-loading-
time](https://www.jonolick.com/home/oodle-and-ue4-loading-time)

~~~
battery423
When i look at his data, its capped and the additional PC transfer speed of
over 512mb/sec vs hitting the speed limit much sooner, does show that
something is missing.

~~~
rasz
Diminishing returns. Cap is most likely caused by the graphic engine fixed
costs, at some point you cant use more speed no matter what without rewrite
(initialization, deserialization etc).

------
shawnz
Why would they design it to require two ASICs AND an FPGA? Couldn't they just
have built the FPGA program into the ASICs?

~~~
Palomides
pure speculation:

the FPGA does the PCIe and all necessary data marshaling, which allows a lot
of flexibility for updates and bug fixing on the most finicky parts of
hardware

two ASICs because once you have one manufactured, most of the costs are sunk
and per unit it's cheap to stick a second on the board

~~~
tachyonbeam
You're probably right. The FPGA is probably the most expensive part of this
board by far, and maybe they figured the FPGA and the PCIE bus can handle
enough traffic to keep two compression chips busy.

~~~
tyingq
The FPGA is also potentially insurance for bugs on the ASIC that aren't
economical to fix. Catch the bug and fix it on the way out of the card.

