
Intel Gears Up for FPGA Push - Katydid
https://www.nextplatform.com/2017/10/02/intel-gears-fpga-push/
======
nimish
Tooling tooling tooling. FPGA hardware is cool but the tooling fucking sucks.
Like stepping back into the time where you needed to pay thousands of dollars
to get decent compilers and debuggers.

The ability for anyone to develop software for CPUs at zero-cost is an amazing
freedom. You literally cannot do that on certain fpgas--not only do they cost
thousands, so do the tools to actually create a working design.

Until that changes, FPGAs will always be niche.

~~~
nakedrobot2
FPGA's are extremely powerful, but the toolchain is absolute dog shit. Xilinx,
shame on you. Intel, will you change anything here?

The compound problem of this in reality is that finding the right people who
have the right strain of semi-insanity to do this really well, is very slim.
And most of them are tied up with a massive salary from some aerospace company
such as Raytheon, GE, Honeywell, or someone like Philips (building medical
devices).

Add to that compile times of an hour or a day, and you have a glacial
development cycle on your hands.

~~~
jalons
HFT also still pays top dollar for competent FPGA developers. I'd wager
they're willing to pay more than the listed industries too, especially if you
factor in profit sharing and bonus structures.

------
burntrelish1273
The biggest problem with reconfigurable computing is it's either an
afterthought, tools are too narrow/proprietary or an extra add-in card.

Add a million LUTs and several thousand special-purpose blocks (ALUs, CAM,
SRAM, DSPs, etc.) on CPU die that can be reconfigured within a few 10k cycles
(ie process context), and then future AI-enabled optimizing compilers could
incrementally profile and accelerate applications with reconfigurable
resources in new, creative and interesting ways. IRAM and similar zillion core
approaches are another approach to solve traditional performance locality
bottlenecks: distribute processing, interconnect and RAM amongst each other.
It makes the most sense to co-evolve such a radical undertaking gradually with
compiler vendors and large customers so that support and usability is working
from launch, rather than just throwing silicon over the wall and using the
"hail mary" method of product (non)design.

PS: Intel could push almost any new on-die CPU technology service/product into
the mainstream right now on the server side because of their de-facto monopoly
oligopoly dominance. I don't think they will as there is immense
organizational pressure to not innovate _too much._

~~~
fulafel
I thought FPGAs had a limited number of updates, enough to preclude this level
of dynamism?

~~~
sannee
Not really. FPGAs tend to be accompanied by a flash memory storing the
firmware, which of course has limited erase cycles, but you can also upload
the firmware through other means.

------
jackyinger
If they do release something in this space they oughta make a SKU that can
plop into a standard desktop socket on a standard motherboard.

Electrical and computer engineering individuals want cheap big FPGAs too!
(Speaking for myself of course.)

Or just release a cheap pcie card! Cmon guys! Oh wait you've forgotten how
important it is to get a developer community started around your hardware
cause it's been so long since the x86 was new!

~~~
spaceseaman
> Oh wait you've forgotten how important it is to get a developer community
> started around your hardware cause it's been so long since the x86 was new!

Intel: Huh? Developers? But they don't buy servers full of chips. Why do I
have to sell to developers?

~~~
StillBored
That is pretty much the ARM and POWER problem too, they think they can grow in
new markets with these hugely expensive machines in quantities of ten racks.

Both would have significantly more traction if they offered a reasonable
desktop class machine, but they don't seem to be able to do it. For some
reason there are dozens of RPi type devices but making a $200-$300 device with
a reasonable set of expansion ports (sata+USB3+pcie gen3+m.2+10g ethernet+etc)
seems all but impossible. POWER has a similar problem, with the only
inexpensive devices being old NXP parts (yah NXP is still selling a core that
turns 20 years old this year [https://www.nxp.com/products/microcontrollers-
and-processors...](https://www.nxp.com/products/microcontrollers-and-
processors/power-architecture-processors/powerquicc-processors/powerquicc-
ii:PCPPCMPC82XX) thats great if your still selling car parts from 20 years
ago, not so great if your designing a part today.).

~~~
pm215
The underlying problem here is that the cost of making an SoC is enormous. So
to make, say, an ARM desktop class machine you have only a few choices:

1) Make your own SoC designed for the purpose. You can tailor it to meet your
requirements precisely, but given the low volumes you'll be selling, your
system will be at least $10,000 a box, likely more.

2) Use a designed-for-mobile SoC. This will hit your $200-$300 price point
pretty easily (perhaps even sailing under it by a big margin), but the IO
options will be bad, because mobile SoCs don't neeed SATA or PCIe. CPU
performance is likely to be underwhelming because mobile SoCs are designed to
hit a power consumption level that won't drain mobile batteries or make mobile
devices overheat. Almost all the cheap devboards you can get today are in this
category.

3) More recently we're starting to see server or networking SoCs, which gives
you an option a bit like 2 but with different cost and expansion tradeoffs.
Generally a bit more expensive than 2 because networking won't hit mobile
volumes, but better I/O capability. CPU power may still be less than you might
like. Examples here are the Macchiatobin board and the dev box that Socionext
announced last week.

I don't think anybody disagrees that a proper desktop class machine for these
architectures would be great; but there are huge economic barriers to getting
there. Personally if I was looking at getting a new ARM setup I'd try
something in class 2, likely the Socionext box when it becomes available (end
of the year, I think they said).

~~~
bradfa
Raptor Computing Systems will sell you a 4 core POWER9 CPU for $340 each [1].
A desktop mobo with the kinds of features discussed shouldn't be more than a
few hundred dollars if slimmed down to one socket and a reasonable but limited
set of peripherals (the Talos mobos go a bit over the top with features and
hence are >$1k). Someone just has to design one, which is not easy or
inexpensive, then hope that developers actually buy them.

[1]:[https://secure.raptorcs.com/content/CP9M01/purchase.html](https://secure.raptorcs.com/content/CP9M01/purchase.html)

~~~
pm215
It's a few hundred dollars _if_ you have the market volume to sell it at that
price. IBM have the volume (presumably mostly for servers) so they can do it.
You can't spend the amount it costs to make a new SoC based only on the "hope"
that people buy it; you need a business plan that says where the volume will
come from, and there aren't enough customers for a developer box alone to
provide that volume, so the dev box uses will always be a sideline from
something else.

~~~
wmf
I think you're the only one in this thread asking for a new SoC. Other people
are asking for affordable developer boards for existing SoCs.

~~~
pm215
You can get lots of dev boards for existing SoCs. They have the problems of no
PCIe, no SATA, etc that the n^parent was complaining about. If you want those
features you first have to identify an SoC that has those; my claim is
essentially that such pre-existing SoCs with the kind of features you want in
a "developer box" are so thin on the ground as to be pretty nearly
nonexistent. "Use a preexisting SoC" is what my classes 2 and 3 are.

------
sitkack
The whole things smells like monopoly dragging its heels. It bought the
competition to slow adoption, claiming that CPU-FPGA hybrids were just around
the corner. What we need is a new FPGA based fabric that has the major
building blocks to create any number of RISCV cores. Tiles that can be RISCV,
or if "sacrificed" turn into clusters of LABs for general purpose use. And the
chip company with the software vision for this is AMD with HSAIL.

------
zw123456
FPGA's are the shit but the software is the key. To make them really fly we
need some really brilliant software to configure them. Someone is going to
disrupt this space soon and whoever does I think will make a unicorn. Just my
guess.

~~~
chisleu
They bought one of the best hardware makers with one of the worst software
platforms. I hope to all that is holy that they replaced the software
entirely. What a nightmare those classes were...

~~~
ngsayjoe
Yes Quartus is complete crap! I wrote some of it.

~~~
davrosthedalek
To be fair, Quartus was way ahead of Xilinx' offering until they leap-frogged
it with Vivado.

------
DesiLurker
What I am interested in is a programmable vector fabric that I can reconfigure
fast (in say L2/L3 access times). right now there are 100s of AVX2
instructions & as the number of HPC applications grow it is only going to
explode. My problem is that when you are using one vector instruction the
silicon for the rest is just sitting around when it could be perhaps used to
make a much wider SIMD unit for just the instructions I'll be using. If they
can get just that right IMO it'll be a huge success.

IMO the problem with FPGA or even silicon dev is mostly tooling. I often say
this, the biggest contribution to Open SOurce movement is not of linus/linux
..its gcc. just imagine if 'they' could tangle up every bit of open source
code in IP litigation emerging from proprietary compilers.

------
amelius
I think Intel or somebody else should put more effort in whetting our appetite
for possible applications.

What long list of applications can CPU+FPGA bring us that a CPU or GPU can't?

~~~
gcp
The applications are the same, but an FPGA allows extremely narrow
optimization. Got something that doesn't parallelize all that well, but can be
pipelined, or benefits from a wide datapath? Design that. Got something that
parallelizes well and only needs a tiny datapath? Make it a 1000 wide vector
machine. It needs difficult control flow? No problem, we're flexible as we're
not a GPU, etc.

The optimization needs to beat out the inefficiency of the FPGA compared to an
ASIC though (not to speak of development time). But sometimes, you can do just
that.

------
AriaMinaei
In the case of Lisp and Smalltalk and other highly dynamic languages, would
these on-die FPGAs open new compiler optimisation opportunities?

~~~
simias
The problem is that tailor-made FPGA designs won't run nearly as fast as a
modern ASIC CPU. That's why for instance FPGA emulators are not more common,
for older hardware software implementations are accurate enough and manage to
run in realtime so the additional cost and complexity of FPGAs is not
warranted.

For more modern hardware software implementations can't reach absolute
accuracy in real time but FPGAs big and fast enough to handle these designs
are prohibitively expensive (if they even exist in the first place). That's
why FPGA GameBoy emulators are not popular and you probably won't see a FPGA
PS3 emulator any time soon (unless the pricing changes dramatically in the
near future).

FPGAs shine when you need to process very high throughput data with low (or at
least constant) latency or for special-purpose algorithm with no hardware
acceleration available on CPU or GPU (video codecs, crypto, computer vision
etc...). But in general when an algorithm becomes popular enough (AES,
SHA-256, H264...) hardware support is backed into the ASIC eventually with
much better performance and power consumption than a FPGA.

I see a potential for FPGAs in CPUs for professional applications if you want
to process big datastream without having to rely on external hardware. For
instance I work in broadcast video transmission where we routinely handle
uncompressed HD or even 4K streams, being able to prototype directly on my
workstation's CPU would be pretty cool.

As for general purpose programing language optimization I have a hard time
imagining what it would look like. The problem is that you don't code for a
FPGA the way you code for a CPU. To put it very simply on the CPU serial is
cheap while forking and synchronizing threads is expensive. On an FPGA it's
basically the other way around. Automatically transforming one form into the
other automatically in a compiler or JIT sounds very much non-trivial. Maybe I
just don't have enough imagination.

~~~
AriaMinaei
I remember a talk by Alan Kay where he mentioned a number of problems commonly
solved serially, such as 2D layout and typesetting, and then he demonstrated a
parallel algorithm for each problem. If I remember and understand correctly (I
may not), one of his points was that such algorithms can become trivially
implementable in a dynamic language and would run fast too, if the hardware
supported some of the primitives of that dynamic language natively. He
expressed hope that FPGAs may create such an opportunity.

~~~
selimthegrim
Do you remember the title or have a link?

~~~
morphle
Alan Kay gave a talk at Qualcomm in San Diego, on October 30, 2013. "Is it
really complex or did we just make it complicated?"
[https://vimeo.com/82301919](https://vimeo.com/82301919)

~~~
AriaMinaei
Was just going to post that link. It's a very nice talk. Highly recommended!

------
tim--
What's new here anyway? Intel bought Altera, and Altera have had development
kits similar to this for a long time.

What is the price of the new FPGA add-on cards? That is what will make the
difference here for a 'new push'.

~~~
makomk
The new thing is that the embedded hard CPU core is some flavour of Xeon
rather than ARM or MIPS.

------
murph-almighty
Does this mean we can develop using some not-Xilinx proprietary crapware?

~~~
ngsayjoe
You can always use third-party EDA tools if you have the money rather than the
vendor bundled free software. Eg Cadence, Mentor Graphics, etc.

~~~
gcp
Don't you need the vendor-specific backend for the physical P&R?

~~~
LoweD
Yes, you do need the vendor's place-and-route. PAR performance is as good as
you could expect though.

Most complaints about vendor software are closer to the front of the flow.
Examples...

* Poor language support (applies for both SystemVerilog and VHDL)

* Not enough transparency and access to primitives and IP blocks, resulting in poor ability to automate

* Generally buggy elaboration and synthesis results, sometimes even causing the tool to crash

My opinion is that the FPGA companies spend too much money improving the HLS
(c-to-gates) and IP wizard experience, in an attempt to make their devices
more accessible to the mythical software engineer who wants to use an FPGA.

They should have spent that time and money supporting language standards, and
improving the RTL experience, which is how most engineers use their products.

------
agumonkey
I can't help to think that FPGA would marry well with highly dynamic meta-
loving languages like CL (or similar).

------
ww520
Can JIT run on FPGA? Use the JIT info to reconfigure the chip on the fly.
Would there be enough speedup to justify the effort?

~~~
adrianratnapala
It sounds like it would be little different from a CPU implemented on an FPGA.
True, you want to use run-time info to optimize on the fly, but my guess is
that modern branch predictors, caches etc, already do as much of this stuff as
we know how to do.

In the unforseeable future, then who knows, maybe the predictors will become
so flexible that even conventional CPUs evolve into something like basically
FPGAs.

------
katastic
I've honestly been waiting for this since for YEARS.

I've had a vision for "the future of computing". FPGAs that reconfigure
themselves (<-these exist) to become whatever macro-level hardware assets your
computer needs. Running tons of SHA-256 encryptions per second? The
CPU/OS/OpenSSL (or whatever) detects this condition and switches from hand-
coded ones, to an IP core that comes with the CPU. The CPU flashes the FPGA's
to become SHA-256 "CoreS" and now you're running 4096x the output with less
heat. (As CPUs are designed for one thing "few, large, complex cases" while
FPGAs are perfect for many, parallel simple cases" even more than a GPU.) Now,
you shutdown your encoding and switch to video encoding, or Doom 2019 and your
CPU reflashes (Alteria specialized in PARTIALLY flashable FPGAs so you don't
have to nuke the entire FPGA, only sections) and adds cores for video, or
physics, or "shader units".

This would be hard for a single person, but any large company could handle
making this. You can even do it with off-the-shelf FPGAs. The biggest problems
are 1) bandwidth. The "macro" function size needs to be bigger than the
latency hit you take for asking the FPGA over computing it internally.
(Intel's on-CPU FPGA would be insanely fast access.) and the other one is 2)
How do you get people to use it! The simplist is of course, only supporting
people who actually request it. But, you can take libraries that perform
common, encapsulated macro functionality, like OpenGL, a physics library, or
OpenSSL, where people don't care about the inner code ("how it gets done") but
instead the result. Asking for floats to be multiplied would be bad. But
asking for a cross-product would be much faster. Asking for an SHA-512 key
would be super.

And the benefit here is, you don't have to hardcode that functionality into a
CPU. The FPGA can have NEW or improved IP cores downloaded with Windows Update
every week.

Back when I was in college, I actually bought a Lattice dev kit with a PCI
Express card, dual-gigabit ethernet, DDR3, and an near top-of-the-line FPGA on
the board and it cost a mere $100. Unfortunately, I was a more software guy so
I really got in over my head (plus health issues set me back and have never
let up since), so I never got a working prototype built.

But it's still there! A huge opportunity waiting to be seized that could
really become another tier of "the standard PC." In the same way we think of
SSDs as "almost RAM" scratchpads, or GPUs as "CPUs for massive amounts of
simple decisions." Well an FPGA is the "GPU of GPUs". Even simpler decisions
and insanely fast and parallel even at "low" (by CPU standard) clockrates of
400 MHz.

Here's an older (2009) project/research article that inspired me called the
512 FPGA cube.

[http://cc.doc.ic.ac.uk/projects/prj_cube/Welcome.html](http://cc.doc.ic.ac.uk/projects/prj_cube/Welcome.html)

[http://cc.doc.ic.ac.uk/projects/prj_cube/spl09cube.pdf](http://cc.doc.ic.ac.uk/projects/prj_cube/spl09cube.pdf)

And here's a direct link to the data table results between FPGA, FPGA cube,
and Xeon (and cluster of Xeons) trying to do the same work:

[https://i.imgur.com/byjmEDG.png](https://i.imgur.com/byjmEDG.png)

Those are massive differences in numbers in both power efficiency and compute
rate. 72,000 Watts of Xeons to get the speed of a single 832 watt cube. That's
two orders-of-a-magnitude!

I mean, imagine a world if they bothered to make FPGA's you could plug into
Ethernet, and make them configure themselves according to a simple programming
tool that was "easy" for normal programmers to exploit instead of requiring
intense understanding of logic gates, propagation delay, and so on. A tool
that wasn't "as fast as" a dedicated engineer, but 90% (or even 70%) as fast
at zero cost and effort. All a sudden you could run tons of programs and
macro-sized functions like you had personally stamped them into a printed
circuit yourself, but without spending millions on development.

I'm honestly not sure why this hasn't already happened. I can't be the only
"smart" person who came up with this idea. And the research (and practice with
bitcoin miners) all points to a huge opportunity to be exploited if they could
lower the knowledge barrier-to-entry so you can basically "push a button" and
unleash an FPGA at a problem. Imagine LAPACK and BLAS with FPGA support.

~~~
booblik
> As CPUs are designed for one thing "few, large, complex cases" while FPGAs
> are perfect for many, parallel simple cases" even more than a GPU.

FPGAs are quite good at parallel simple cases, that is correct, but they would
lose to GPUs in performance/watt in most cases. Where FPGAs really shine is in
parallel complex, non-uniform cases, especially cases that don’t map well to
the classic CPU instructions, but can easily be performed with small latency
on FPGAs.

~~~
scottlegrand2
FPGAs own low latency computation (less than 1 microsecond) because GPUs
really need 3-20 microseconds to initialize after a kernel launch. This is why
they're used instead of GPUs at the front line of high frequency trading. When
I was at a hedge fund, I tried in vain to get Nvidia to do something about
this based on the unofficial work of another former Nvidia employee implying
this could be improved dramatically.

All that said, these are golden years to be a low-level programmer who
understands parallel algorithms whether you work in Tech or at a hedge fund
because there just aren't that many of us.

But the real problem with FPGAs is that even if they find another lucrative
application where they excel relative to GPUs, Nvidia can simply dedicate
transistors in their next GPU family to erasing that advantage as they did
with 8-bit and 16-bit MAD instructions in Pascal and with the tensor cores in
Volta. Too bad they don't care about latency or I believe they could disrupt
FPGAs from HFT in a year or two when someone started using them and started
winning.

------
cptskippy
I wonder if this will receive the Edison treatment.

------
m_mueller
... and presumably they are also preparing to drop that program in about 3-4
years.

~~~
nullc
If they ever release it in the first place (or if they ever let it get past a
point where you need to specifically contract with intel to get access to it).

------
0xbear
This is so obviously not going to fly unless they either start bundling with
Xeons, or offer something Xilinx can’t do. Cloud providers aren’t stupid, and
there are enough eggs in the Intel basket already.

~~~
pratumlabs
One of the key benefits of Intel's solution is that the CPU and FPGA share the
same RAM, avoiding the O(N) cost of moving data to/from the devices. This type
of a zero-copy transaction can enable very high performance applications
compared with dedicated discrete cards.

~~~
dragontamer
AMD already does "Zero Copy transfers" (the on-chip cache!!) with its "Fusion"
APUs (ex: A10-7850K) for CPU <<\---->> GPU.

I'm not really seeing loads of people taking advantage of the feature however.
The platform is cheap, the technology is available but its just way too weird
an architecture to become mainstream.

There are numerous benefits: the CPU can create a linked list or graph, and
the memory will still be valid on the GPU. CPU / GPU atomics are unified, and
GPUs can even call CPU functions under AMD's HSA platform.

* [https://images.anandtech.com/doci/7677/20%20-%20HSA%20Use%20...](https://images.anandtech.com/doci/7677/20%20-%20HSA%20Use%20Cases.jpg)

* [http://developer.amd.com/wordpress/media/2012/10/hsa10.pdf](http://developer.amd.com/wordpress/media/2012/10/hsa10.pdf)

* [https://www.anandtech.com/show/7677/amd-kaveri-review-a8-760...](https://www.anandtech.com/show/7677/amd-kaveri-review-a8-7600-a10-7850k/6)

\---------

I think Intel had a similar technology implemented on their "Crystalwell"
chips, which were basically an L4 cache which provided a high-bandwidth link
between the CPU and GPU (although not quite as flexible).

No, its not an FPGA, but OpenCL / GPGPU compute seems to be a bit more
mainstream than FPGA compute at the moment. I haven't seen too much excitement
in general for this feature however.

~~~
imtringued
>I'm not really seeing loads of people taking advantage of the feature
however. The platform is cheap, the technology is available but its just way
too weird an architecture to become mainstream.

AMD sells a consumer product. For most consumers even a smartphone offers
enough CPU and GPU performance. The content producers who care about
performance usually buy the best CPU and GPU. HSA isn't available on AMD's
Ryzen or Threadripper processors.

Intel is trying to sell to datacenters where performance or energy efficiency
is a major selling point.

~~~
dragontamer
> The content producers who care about performance usually buy the best CPU
> and GPU. HSA isn't available on AMD's Ryzen or Threadripper processors.

Raven Ridge will be based on Zen CPU cores and Vega GPU cores. But naturally,
Raven Ridge will be slower than Threadripper because the GPU will take up some
space (that otherwise would have been additional CPU cores).

Rumored specs of Raven Ridge APU is 4 CPU cores and 11 GPU Compute Units. In
contrast, Threadripper is around 16 CPU Cores and Vega 64 is 64 GPU cores,
separated by a PCIe x16.

So basically, its the price you pay for sticking so many things onto a single
package. There are thermal limits, as well as manufacturing limits (ie:
practical yield sizes) to how large these chips can be.

If you want the best of both worlds, like an EPYC CPU with Vega 64 or a high-
end NVidia Pascal / Volta chip, you'll need to buy a dedicated GPU and a
dedicated CPU. True, a hybrid chip like Raven Ridge (or any of the AMD HSA
stuff) has benefits with regards to communication, but the penalty to CPU
speed and/or GPU speed seems to be huge.

\-----------

I personally expect that if any "mainstream" FPGA solutions come out, they'll
be connected to the PCIe and not merged into the CPU. There seems to be just
too many heat and manufacturing issues to make a merged product compared to
the standard PCIe x16, which is quite fast.

Alternatively, certain tasks (like Cryptography) can be accelerated using
dedicated instructions, like the Intel AES-NI instruction set. Or Intel's
Quicksync H.264 encoding solution. Fully Dedicated chipspace (like AES-NI) is
way faster and more power efficient than FPGAs after all.

