
AWS EC2 FPGA Hardware and Software Development Kit - ktta
https://github.com/aws/aws-fpga
======
patrickg_zill
To save you the digging, the chips are Xilink VU9p UltraScale series :
[https://www.xilinx.com/products/silicon-
devices/fpga/virtex-...](https://www.xilinx.com/products/silicon-
devices/fpga/virtex-ultrascale-plus.html)

(retracted, can't find exact part on digikey) Quantity 1 pricing is about
$2000/chip ; however, that price would no doubt drop substantially at even a
1000 chip order. A typical discount (appears to be) 20% at qty 100; so at qty
1000, maybe 33%? So 1 of these chips might be $1200 after you figure a bunch
of other discounts?

EDIT:

[https://www.xilinx.com/products/silicon-
devices/fpga/virtex-...](https://www.xilinx.com/products/silicon-
devices/fpga/virtex-ultrascale-plus.html#productTable)

^^^ the product table.

The XCVU9P is listed as having ~6800 DSP slices, 2,2586K (= ~2.5 million) of
system logic cells. That pretty much matches what Amazon describes as them
using.

(The DDR-4 RAM is external - so it doesn't help narrow down the device.)

However there are multiple, very similarly named parts, and it doesn't seem
that the XCVU9P is listed on Digikey. Other parts named UltraScale are - thus
the confusion.

~~~
aseipp
These aren't the basic VU9x's. The specs put them somewhere around the highest
end card on that series: the XCVU37P, with 64GB DDR4 and 2.8 million cells/7k
DSPs.

That isn't a $2k card attached over fabric... It's probably closer to a
$40,000 card.

~~~
bisrig
I think the parent comment to yours (or at least the edited version) is right,
they are almost surely XCVU9Ps based on the logic element and DSP counts. The
ram numbers listed in the product table are embedded memories (BlockRAMs in
Xilinx-speak), DDR4 wouldn't be a spec-feature of the FPGA as it's external to
the part on the PCB.

Avnet web pricing is $27k for the -1 speed grade, extended (not industrial)
with 12-week lead. Safe to say Amazon is getting a better deal on both counts.

I suspect the add-in card itself is either this or something similar to it,
based on specs etc.
[http://www.bittware.com/xilinx/product/xupp3r/](http://www.bittware.com/xilinx/product/xupp3r/).
I didn't see pricing info readily available for that, but a similar card with
less on-board DDR4 runs about $7k straight from Xilinx:
[https://www.xilinx.com/products/boards-and-
kits/ek-u1-vcu118...](https://www.xilinx.com/products/boards-and-
kits/ek-u1-vcu118-es1-g.html)

Edit: Oh yeah, I forgot to add: I thought it was funny that Amazon's page
referred to "logic elements" given that this was traditionally an Altera term
- Xilinx preferred/prefers "logic cells".

~~~
aseipp
Thanks for your correction, you're totally right. I have no idea why I
confused BRAMs with DDR4 on the board...

------
wjakob
I'm curious as to how AWS plans to prevent users from generating malicious
FPGA bitcode that physically damages the FPGA itself and/or the host machine
over PCI Express. The possibility of instantiating arbitrary logic gates in
the cloud seems very dangerous.

~~~
bedros
I have not read the whole FPGA HDK guide yet, but, I doubt amazon will give
the designer access to pins on FPGA; they would give access to interfaces or
IO blocks which connect to the pins

your design sits inside a wrapper with access to IO blocks.

inside the FPGA logic, you cannot do electrical damage, no matter how hard you
try.

~~~
mikejmoffitt
You could have a lot of combinational logic dependent on floating/high-Z
inputs, which is not good for CMOS circuitry.

~~~
dahtguy
What's so bad about high-Z inputs? Designing around a floating input is the
same as designing around an undefined variable. Bad design but you can't
"hurt" the hardware.

~~~
femto
An undefined variable is an abstract construct. You can model a floating input
in as being analogous to an undefined variable and the model will fit reality
most of the time, but the underlying reality is still governed by the
complexities of physics, with potential for the model to break down.

Damage is probably more of a risk for an external I/O, but it's still a
possibility for an internal I/O. Probably more of a certainty with a malicious
programmer.

Who's going to be the first person to design an on-die switched capacitor
voltage multiplier, using parasitic capacitance between gates as the energy
storage, and use it drive an internal high impedance input to a damaging
level?

------
patrickg_zill
As a side note, there is this board (and others):

[https://www.olimex.com/Products/FPGA/iCE40/iCE40HX1K-EVB/ope...](https://www.olimex.com/Products/FPGA/iCE40/iCE40HX1K-EVB/open-
source-hardware)

which has a much-smaller FPGA on it. However the advantage of this board is
that the entirety of the toolchain you can use to program it, is open source.

This board is a little more powerful (but still way less powerful than the AWS
FPGA), and will use the same basic toolset from Xilinx (EDIT: see comment and
recommendation for a better board, below from aseipp):

[http://numato.com/mimas-v2-spartan-6-fpga-development-
board-...](http://numato.com/mimas-v2-spartan-6-fpga-development-board-with-
ddr-sdram/#feaandspec)

The LISPM FPGA implementation, _may_ work on this board, but I have not bought
one yet to test it.

\----

Looking at it more, I think I was sure that it was a $2K part because I
thought the AWS instance price was under $2 per hour - so there was no way
they could make their money back!

Now that I see the price is about $14 per hour, it makes sense that (making a
guess they won't always be utilized) they could in fact recoup their costs
after about 18 months or 2 years.

~~~
aseipp
FWIW, that Spartan-6 Mimas v2 chip will _not_ use the same toolset as the one
on AWS. Spartan-6 FPGAs are older and require the "Xilinx ISE" EDA tools,
which are the older, no-longer-developed tools they created. All modern Xilinx
FPGAs are the "-7 series" FPGAs and they use "Xilinx Vivado" for development.
This includes the combination ARM + FPGA devices, the 'Zynq series'.

This sounds like nitpicking but it's a very important distinction to make,
since ISE is no longer maintained and much worse than Vivado (though I never
used it much). Plus, although it's not the dominant concern for many people
and needs -- 7-series boards are, far and away, much more powerful than the
Spartan series. If you wanted to do something like use Linux on an FPGA
softcore, that's where you want to be (granted, the Mimas specifically _can_
handle embedded Linux, like the J2 core!)

A much better Xilinx board IMO is the Digilent 'Arty', which is an Artix-7
with 4x the amount of SDRAM, 3x the logic and many more peripherals like
ethernet. This thing is powerful and will be good for a lot of tasks, and the
Vivado license is completely free:

[http://store.digilentinc.com/arty-artix-7-fpga-
development-b...](http://store.digilentinc.com/arty-artix-7-fpga-development-
board-for-makers-and-hobbyists/)

~~~
GrumpyYoungMan
You mention the Zynq so I'd like to add that I'd add that Digilent seems to
have recently released their Zynq-based 'Arty Z7' boards which provide a dual
core ARM on the same chip as the FPGA and is also supported by the free Vivado
design tools.

[http://store.digilentinc.com/arty-z7-apsoc-
zynq-7000-develop...](http://store.digilentinc.com/arty-z7-apsoc-
zynq-7000-development-board-for-makers-and-hobbyists/)

~~~
wincy
For a software engineer who has never done anything with FPGAs or embedded
devices but is interested in learning, do you know of any good resources?

~~~
heathjohns
With apologies for plugging my own project:
[https://www.blinklight.io/](https://www.blinklight.io/) starts right at the
ground level.

~~~
MrBuddyCasino
Thats a pretty nice project you got there. Worth checking out!

------
banjo_milkman
Another application is accelerating DNA sequencing, the first FPGA-based cloud
service on AWS?: [http://www.edicogenome.com/applications/dragen-on-amazon-
web...](http://www.edicogenome.com/applications/dragen-on-amazon-web-
services/) [https://www.nextplatform.com/2016/12/07/configuring-
future-f...](https://www.nextplatform.com/2016/12/07/configuring-future-fpgas-
genomics/)
[http://www.edicogenome.com/news/dragenblog/](http://www.edicogenome.com/news/dragenblog/)

------
ktta
Huh, surprised me that this got attention up rather than the blog post. You
should check that out[1] since that is the more appropriate place for general
discussion.

Here's a link[2] to previous discussion from when it was first announced.

[1]:
[https://news.ycombinator.com/item?id=14149538](https://news.ycombinator.com/item?id=14149538)

[2]:
[https://news.ycombinator.com/item?id=13072432](https://news.ycombinator.com/item?id=13072432)

------
dooglius
Cool, but I think it's very unfortunate that the HDK license is so unfree --
it looks like you can't ever use the HDK outside AWS
([https://github.com/aws/aws-
fpga/blob/master/hdk/LICENSE.txt#...](https://github.com/aws/aws-
fpga/blob/master/hdk/LICENSE.txt#L96-L98)).

------
scott00
I'd love to hear what people are using the f1 instances for. I would have
guessed that most systems with the scale to make FPGA development economical
would also make operating a datacenter economical and thus not be running on
AWS. (But I don't know very much about FPGAs, and Amazon biz devs are no
dummies, so I'm sure there are plenty of use cases.)

~~~
sidmontu
My first guess would be inference for various types of deep neural nets.

------
webaholic
Now all we need is for someone to port chisel3 to work on these fpgas.

~~~
Cyph0n
Doesn't Chisel (in theory) support all FPGA tools out of the box? Chisel
compiles to Verilog[1], so all you would need to do is import the resulting
Verilog into the Xilinx toolchain, then test and synthesize it.

[1]: The full process is Chisel -> Firrtl -> Verilog, which is analogous to
C++ -> LLVM IR -> ASM.

~~~
aseipp
Yeah, Chisel shouldn't have a problem with that part. I use Clash (the Haskell
equivalent, but not a DSL) quite a lot with a variety of FPGA tools and it
tends to work pretty well.

The real task, of course, is binding up all those IP interfaces into nice
type-safe Chisel interfaces for users... That's always a huge pain, especially
for a device of this class -- where it's going to be PCIe and ethernet
interfaces you want to use...

~~~
ktta
Have you used Chisel?

I've only worked with Verilog but Chisel and Clash are fascinating to me. A
lot of people seem to use Chisel, including the people related to RISC-V
development.

How do you suggest getting started with Clash?

------
nraynaud
Does anyone know how they prevent customers from destroying the FPGA with
malicious bitstream?

~~~
AWS_F1
Refer to jeffbarr's answer: Great question! I checked in with the team and
this is what they told me: "The developer FPGA code is enclaved inside AWS
FPGA Shell, to prevent malicious FPGA code from damaging the hardware and to
provide the necessary protection for PCI Express and the host machine. The pin
assignment of the FPGA is controlled by AWS." And "AWS infrastructure monitors
the thermals as well and the F1 hardware was designed to sustain high power
consumption to enable developers to utilize the maximal available FPGA
resources and frequency."

------
mars4rp
Where pins are connected to? is the FPGA connected to any disk that can hold
big data?

~~~
AWS_F1
The FPGA pins are connected to the host CPU via PCIe Gen3, 4 local DDR4
channels for each FPGA, and if you are using the f1.16xlarge, there are pins
connecting between the FPGA.

Both f1.2xlarge and f1.16xlarge have NVMe SSD, attached as PCIe device to the
host, and not connected directly to the FPGA. One could consider using
standard linux NVMe drivers or SPDK user space drivers for high throughput and
low latency data movement between the NVMe SSD and the FPGA

------
mikek
Can someone ELI5 this to me? I am not a hardware expert but am a programmer.

~~~
GrumpyYoungMan
ELI5, hmmm? Okay, I'll take a stab at it.

For some types of processing, it's vastly more efficient to implement
specially-designed digital circuits to do the work instead of using a regular
CPU. If the need is high-volume enough, these can be fabricated on custom
silicon chips. Common examples are DSPs, GPUs, and custom Bitcoin mining
chips.

For low-volume applications where it's not cost efficient to fabricate custom
chips, there's a specialized type of "generic" chip known as a field-
programmable gate array or FPGA. These FPGAs contain a grid of digital logic
gates (to oversimplify a bit) and programmable interconnections that allow
them to be configured to create any type of digital circuit. While it can't
run as fast as a fully custom silicon chip, it's still fast enough to get a
tremendous speedup where a custom digital circuit design is beneficial.

Now that Amazon has made EC2 instances with FPGA accelerator cards available,
ordinary users with the need for these custom digital circuits now have access
to these specialized devices without having to make the enormous upfront cash
investment of purchasing and operating servers with these FPGA accelerator
cards themselves.

Addendum: It's also worth noting that, as Moore's law slowly grinds to a halt,
that building custom digital circuits using FPGAs for specific processing
needs is one of the few promising ways remaining for getting a big boost in
compute-intensive app performance in the future. By making these available to
a wide audience, Amazon is effectively accelerating the speed at which this
type of technology will make it to ordinary desktop computing.

~~~
jononor
The primary competitor for FPGAs at the moment is not so much CPUs as GPUs.
Will be interesting to see what kind of systems will end up using FPGAs in
datacenters.

Low-latency inference using neural networks could be one. Especially if the
practice of "quantizing" the networks (using <32bit integers) as Google does
with their TF chips take off.

------
Briney
Any word on F1 pricing yet?

~~~
luhn
Yeah, F1s are generally available in US East now, so the pricing is listed.

f1.2xlarge: $1.65 per Hour f1.16xlarge: $13.2 per Hour

------
fabmilo
I have limited experience on FPGA programming, but I am thinking if you could
translate a Tensorflow model (or part of it) using XLA TO FPGA bytecode and
get a serious speed up compared to using GPUs.

~~~
jacquesm
You might be able to create something along the lines of Google's TPU in a
large enough FPGA (with a large enough memory bank attached), but it would
cost a small fortune to run, likely not enough benefit to rent a GPU instance
instead. But it would be very interesting to see how far that could be pushed
and what side models could be run on it.

GPUs are hard to beat on price, and the only reason Google made the TPU in the
first place is because it is an ASIC, which has a _much_ better
price/performance ratio (once you make enough of them) than either a GPU or an
FPGA, at the cost of not being able to change the design easily.

