
Intel Marrying FPGA, Beefy Broadwell for Open Compute Future - walterbell
http://www.nextplatform.com/2016/03/14/intel-marrying-fpga-beefy-broadwell-open-compute-future/
======
trengrj
Does an Open Compute Future include Intel's ME?

Intel's Management Engine (which is on all modern Intel chips) acts as a
unverifiable second processor with memory and network access running closed
proprietary software. Its existence prevents any sort of security from state
level actors.

~~~
winter_blue
Woah, I didn't know this existed! This is really serious. Why aren't people
talking more about this?

~~~
beeboop
Intel ME has very useful legitimate purposes for existing. It's a very
powerful tool for some people, and most people would call you a conspiracy
theorist for suggesting that Intel, or someone who manages to get security
creds from Intel, would misuse them break into your computer. I have also had
allegedly advanced system engineers tell me that there is really no risk for
an outside party being able to compromise your system with this technology,
but I remain pretty skeptical of that claim.

In short, the overlap between people who know about it and are also security
conscious is pretty small. There's also dozens of other things people should
be more concerned about in terms of a corporate or state actor gaining
unauthorized access to your computer.

~~~
PaulHoule
I am much more afraid of non-state actors that would get access to back doors
than state actors. The people at the NSA might not be angels but they have
some sense of ethics and some controls, other people out there don't. What if
someone like Snowden or the Walker brothers get access to it and make it open
source or sell it to ISIS?

------
pjc50
The critical issue here is whether this is really "open".

The x86 platform, like most processors, has a documented instruction set and
software loading process. (There are undocumented corners, but the "front
door" is open). Whereas historically almost all FPGAs have had fully closed
bitstream formats and loading procedures. This necessitates the use of the
manufacturer's software which is (a) often _terrible_ and (b) usually
restricted to Verilog and VHDL.

If Intel ship a genuinely open set of tools, then all manner of wonderful
things could be built on the FPGA, dynamically. That requires being open down
to the bitstream level, which _also_ requires that the system is designed so
that no bitstream can damage the FPGA.

To me this is most interesting not at the server level but at the ""IoT""
level; if they start making Edison or NUC boards that expose the FPGA to a
useful extent.

~~~
bravo22
(a) you don't have to use any vendor's IDE, only their placement and synthesis
tools which all support command line (b) Verilog and VHDL are the only two
dominant HDL languages.. what other language would you want to program in??

~~~
HanW
b) Bluespec

~~~
bravo22
Bluespec is SystemVerilog extensions, so it still falls under Verilog support.
Bluespec compiler spits out Verilog RTL for it to be fed into other tools,
like Xilinx's synthesizer. So yes you can use Bluespec with Xilinx/Altera
tools today.

It also is new and thus not known and used by thousands of RTL coders
everyday.

------
sitkack
AMD HSA is more compelling. These look like they will be just as hard to
program as adding in an accelerator card, only the bandwidth between the CPU
and the FPGA is higher. Everything is merging into an amorphous blob, FPGAs
have been adding hard blocks for years, GPUs have been adding scalar
accelerators. The vector and the scalar, the hardwired and the adaptable are
all becoming one. Hell, even the static languages are adding dynamism and the
dynamic languages are adding types. Floating point is becoming succinct [0].
Computation is becoming a continuum.

[0] [http://johngustafson.net/unums.html](http://johngustafson.net/unums.html)

~~~
btown
That Gustafson link is intriguing - efficient floating point without the
drawbacks. Previous HN discussion:
[https://news.ycombinator.com/item?id=9943589](https://news.ycombinator.com/item?id=9943589)

------
arc776
Programmable logic on chips will be INCREDIBLE for the field of intrinsic
hardware evolution, which is a slowly emerging science. This is huge for the
field of AI and electronics.

I've been waiting to see this kind of thing for years, ever since I read
Adrian Thompson's work on evolution with FPGAs, in which he:

"Evolved a tone discriminator using fewer than 40 programmable logic gates and
no clock signal in a FPGA" (slides:
[https://static.aminer.org/pdf/PDF/000/308/779/an_evolved_cir...](https://static.aminer.org/pdf/PDF/000/308/779/an_evolved_circuit_intrinsic_in_silicon_entwined_with_physics.pdf))

EDIT: Full paper:
[http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.50....](http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.50.9691&rep=rep1&type=pdf)

The field has crawled along pretty slowly since then as far as I can tell.

However, this could be a HUGE thing for computing; developers would finally
have a way to create hardware that interacts directly with the physical world
in ways we haven't thought of yet. As a small example, Thompson's work
revealed circuits that were electromagnetically coupled in novel ways but not
connected to the circuit path, yet were required for the circuit to work.
Using evolution, in time we should be able to come up with unique solutions to
hardware problems not envisaged by human designers.

This is really exciting.

------
mchahn
There were many years when any hardware computing acceleration card failed
pretty quickly because CPUs were advancing so rapidly hardware could not keep
up. Apparently we have reached an end to this. With GPUs, and things like this
FPGA integration, hardware matters again.

~~~
0x07c0
+1

Before the difference between highly optimized code and ok code was maybe 2-3x
speedup. Roughly one Moore's law doubling. With heterogeneous computing that
is more like 20-30x or more. And Moore's law is dead! This will change a loot
in the IT world (More servers are the solution, developer time is more
expensive then computer time, etc..). Learn C, down on the metal programming
is back, the future is heterogeneous parallel computing.

~~~
sitkack
No, native won't help you. Native is a red herring. C is not the correct
abstraction level to take advantage of heterogeneous parallel hardware. The
advantage of Rust isn't that it is native, the advantage is that removes the
GC and the pressure on the memory subsystems and the latencies involved in
compaction. The Hotspot JIT produces code as fast or faster than a C compiler.
One could design a language that is high level and removes the GC through an
affine type system. I predict there will be a hybrid language that does
gradual-affine typing that marries a GC, escape analysis and use at most once
semantics.

~~~
0x07c0
I wish that to be true, but that is not what I'm seeing. (I'm doing HPC.) .
It's not about native c performance vs some other language. Its about the the
low level stuff you can do in C. You use avx (and the compiler don't help(they
are supposed to, but don't do it very well), you have to use intrinsics or
asm), then memory stuff, cash blocking, alignment, non temporal stuff. Same
for CUDA, compiler don't get that much performance. You have to think about
all low level stuff, usually memory, like alignment, use shared memory or not,
cash line size etc.. . And then you are using multiple GPUs.. No help from
compiler, you have to do all by your self. Had been nice with compiler doing
it, and there are some compiler that helps. But you don't get max performance,
and with some effort the performance you get by handcoding all this stuff is
much greater then what compilers can give you. And that advantage is
increasing.

~~~
sitkack
Ok, maybe it isn't a question of C and native, but access to low level
semantics, memory layout and specialized instructions. The majority of
programs and programmers are better served by going with a higher level,
easier to parallelize semantics than dropping down to architecture specific
features. I am thinking Grand Central Dispatch vs assembler.

I would argue that the low level work you are doing _should_ be done in a
macro or compiler.

[http://www.graphics.stanford.edu/~hanrahan/talks/dsl/dsl1.pd...](http://www.graphics.stanford.edu/~hanrahan/talks/dsl/dsl1.pdf)

[http://www.graphics.stanford.edu/~hanrahan/talks/dsl/dsl2.pd...](http://www.graphics.stanford.edu/~hanrahan/talks/dsl/dsl2.pdf)

Pat Hanrahan makes a compelling argument for using special purpose DSLs to
construct efficient performant code that takes advantage of heterogeneous
hardware.

See the Design of Terra, [http://terralang.org/snapl-
devito.pdf](http://terralang.org/snapl-devito.pdf)

~~~
0x07c0
Thanks! This are really useful. (I'm actually now making a small dsl for
distributing work on accelerators.)

I personally really like the idea from the the Halide language, having one
language for algo, another for how the computation is done. If something like
that could be made general purpose it would be very useful.

[http://halide-lang.org/](http://halide-lang.org/)

>should be done in a macro.. Encourage c programmers to use macros is like
encouraging alcoholics to drink :) But I guess you didn't think about pre
processor macros.

~~~
sitkack
I find halide really interesting. Like the split between control and data
planes. It made me realize we conflate things and don't even realize that they
_can_ be separated.

>...macro

Yeah, I didn't have preprocessor macros in mind. ;*| But wonderful, AST
slinging hygienic Macros!

Take a look at [http://aparapi.github.io/](http://aparapi.github.io/) it one
of the best examples of making OpenCL a first class citizen in Java.

------
deadgrey19
Programable logic on the die sounds like a great thing in principle, but the
place where it really comes into its own is doing I/O work. Network/disk
acceleration, offload, encryption. This is where hardware (which is slow and
wide), but is reconfigurable over software lifecycle (e.g. protocols, file
systems etc which change rapidly) would be a benefit. So the real question is,
what is the I/O capability of one of these things? Will the high speed
transceivers be exposed in a way that I/O devices can talk directly to it, or
will they all need to go through a slow, high latency PICe interconnect. If
the later, then I would predict a chocolate tea-pot in the making.

~~~
srcmap
One can program the NIC to DMA packets directly info the address space allot
for FPGA. Once setup, the FPGA should be able to get hold the packets and
start processing completely without a single CPU cycle use on data plane.

~~~
deadgrey19
Sure. This is a possibility, although it is a bit round about and there would
be an interesting song and dance in the NIC driver. NICs typically are told
where to DMA to using descriptor tables programmed into the NIC by the driver.
To do this truly without CPU intervention, you would need to write a hardware
driver in the FPGA to program the NICs descriptor tables (can't even imagine
what a nightmare that would be). Otherwise, you would have to have the CPU
involved in setting up and negotiating transfers between the NIC and FPGA and
a second driver between the FPGA and software. It's pretty messy either way.
And given the proliferation of cheap FPGA enabled NIC's it seems like a non-
starter. If the FPGA transceivers are broken out directly, then a simple
adapter board would allow the FPGA to talk directly the network and/or memory
device.

~~~
gricardo99
> you would have to have the CPU involved in setting up and negotiating
> transfers between the NIC and FPGA and a second driver between the FPGA and
> software

Plenty of "kernel bypass" and RDMA type functions use shared/user-space memory
for "zero-copy" (in reality one copy), operations between NIC and software. If
a similar scheme can be used with the FPGA then it would not have too much
overhead. I agree, not as direct/efficient as having FPGA serdes I/O go
directly to some SPF+/network transceiver, but then you'd also be taking up
valuable FPGA gate capacity to run NIC PHY/MAC and standard L2/L3 processing
functions that you get from a NIC.

~~~
deadgrey19
RDMA/kernel bypass NICs work by mapping chunks of RAM and then automatically
DMA'ing packets into those chunks. Again, it would be a pretty round about way
to give the FPGA access to packets to copy data to RAM, then copy down to the
FPGA, then copy back up to RAM. Much simpler/better to let the data stream
through the FPGA from/to the wire. In addition, the PHY/MAC layers these days
are pretty thin for Ethernet style devices and modern FPGA's are by comparison
huge. I'm not saying it _can 't_ be done, I'm just saying it seems sub-optimal
when FPGA's already have a ton of I/O resources and are already used as NICs.
The question as to wether these resources are exposed to the outside world is
the salient one.

------
CoffeeDregs
Finally, this technology is gaining acceptance. Leopard Logic and others tried
this about 15 years ago but Moore's Law and Dennard Scaling were still going
so CPU+FPGA didn't take hold. I'm not sure exactly how Intel is going to
implement this but the predecessors had multiple configuration planes so that
the FPGA could be switched from Config1 to Config2 in nanoseconds (e.g. TCP
offload, then neural network calculation, then TCP offload, etc) and had some
automatic compiler support.

------
csense
My question is what market is going to be driving this? Who will want to buy
this, and how deep are their pockets? Is this a niche product for a handful of
applications, or something we'll see in every PC in 5 years?

The GPU was successful because it had a killer app: Gaming. What's the killer
app for the FPGA going to be?

~~~
extrapickles
Server side things. Machine learning, on cpu die network switch, various forms
of offloading (SSL, compression, possibly hypervisor stuff).

It will be awhile before it shows up in consumer gear, as the use cases are
not there yet. Consumers may still benefit as when someone figures out
something amazing for it to do, they will get a hardened version of it.

~~~
creshal
> various forms of offloading (SSL, compression, possibly hypervisor stuff).

I wonder whether Intel will allow that. Better hardware offloading for various
algorithms (SHA, RSA, AES, …) and hypervisor acceleration (VT-x, VT-d, EPT,
APICv, GVT, VT-c, SRIOV, …) have been one of the main selling points for new
CPU generations. An FPGA would render most of them moot by allowing operators
to configure whatever offloading they need without requiring new, expensive
Intel chips.

~~~
wtallis
The virtualization stuff isn't offloadable. They're a bunch of invasive
changes to the memory and I/O paths of the processor core, not stuff that
could be handed off to a coprocessor.

~~~
Dylan16807
You could route a lot of the I/O paths through the FPGA if it was suitably
wired. A slight latency bump to make aspects of virtualization never require
any cycles to be spent handling them.

------
jstoja
I think that along the hardware problem of integrating both chips on the same
die, the other problematics are about their programmation. We have a pretty
advanced abstraction when it comes to CPU nowadays, but looking at some code
to program FPGAs we definitely can see that it's not that simple for
developers to enter this world.

~~~
Ericson2314
Actually I'd argue x86 is huge and scary, and Windows/Unix on x86 huger and
scarier, but plain sequential circuits quiet simple.

The actual problem (as stated in the other comments) is that the tooling is
all proprietary (huge and scary) and has received NO love.

~~~
adwn
"Plain sequential circuits" are also quite slow and therefore useless as
accelerators alongside modern x64 CPUs.

------
PaulHoule
I think "beefy" and "broadwell" is a contradiction.

