
Intel Xeon processor with FPGA now shipping - chclau
https://fpgaer.wordpress.com/2018/05/24/intel-xeon-processor-with-fpga-now-shipping/
======
sehugg
From The Next Platform ([https://www.nextplatform.com/2018/05/24/a-peek-
inside-that-i...](https://www.nextplatform.com/2018/05/24/a-peek-inside-that-
intel-xeon-fpga-hybrid-chip/)):

 _The initial workload that Intel is targeting is putting Open Virtual Switch,
the open source virtual switch, on the FPGA, offloading some switching
functions in a network from the CPU where such virtual switch software might
reside either inside a server virtualization hypervisor or outside of it but
alongside virtual machines and containers. This obviates the need for certain
Ethernet switches in the kind of Clos networks used by hyperscalers, cloud
builders, telcos, and other service providers and also frees up compute
capacity on the CPUs that might otherwise be managing virtual switching. Intel
says that by implementing Open Virtual Switch on the FPGA, it can cut the
latency on the virtual switch in half, boost the throughput by 3.2X, and crank
up the number of VMs hosted on a machine by 2X compared to just running Open
Virtual Switch on the Xeon portion of this hybrid chip. That makes a pretty
good case for having the FPGA right close to the CPU – provided this chip
doesn’t cost too much._

~~~
mtgx
That seems like a weak benefit. I don't think the fpga part will be cheap. In
fact it will probably be significantly more expensive than the CPU. So then
why not buy a chip with twice the cores if running twice the VMs is what you
want to do with it? Also you can buy an AMD EPYC chip with twice the cores for
about the same price you'd pay only for the CPU part of this chip/half the
intel cores.

~~~
baybal2
>Also you can buy an AMD EPYC chip with twice the cores for about the same
price you'd pay only for the CPU part of this chip/half the intel cores.

Indeed. What benefits the most from being turned into FPGA circuit are
stateless or minimally stateful circuits like media decoders/encoders.

Complex stuff like fancy routers, classifiers, internet protocol
inspection/handling will not gain a hundredfold speedup unlike the stuff
above. This is why cheap x86 based routers are still a thing.

At the moment, I am involved with one cloud provider in China that bids big on
cheap FPGAs and RDMA. AWS and Azures can be defeated in detail.

The plan is following: provide hardware accelerated "building blocks" of any
modern dotcom business.

Need memcached? We have it running on RDMA, from a bare metal ASIC, 10 times
faster than any x86.

Need transcoding? We have it available on RDMA, from a bare metal ASIC, 10
times cheaper than any AWS instance on buck/megabyte.

Need API proxy for TLS/Gzip with gigabytes per second throughput? We have it
running on RDMA, from a box with 4 PCIe accelerators, and AWS has nothing to
offer for this use case other than "buy a hundred of top tier high performance
instances and put them behind load balancers"

~~~
guiriduro
Pretty sure AWS has had f1 instances for some time already.

~~~
baybal2
Yeah, but for those one you still have to do all the HDL programming stuff
yourself. Alibaba on other hand aims to provide everything ready to use over a
cute RDMA api

------
chx
On a whole different level, if you are merely interested in FPGAs in your
computer then the PicoEVB neatly goes in the same M.2 slot as wifi cards do
and communicates over PCIe, even if just PCIe Gen 2 x1. As far as I am aware
this is, by far, the cheapest way to get an FPGA on the PCIe bus.

~~~
kristianp
Don't new laptops only have one of those, for their hard drives?

~~~
agumonkey
It's confusing me. wifi cards don't need m2 slot, I thought it was a standard
made for SSDs. Usually wifi cards use basic mini pcie sockets, even on old
(2006>) 12" laptops you can find 2, albeit sometimes only one socket is
installed (the other is empty space with pins there, some people even soldered
one manually)

~~~
loa-in-backup
Directly from Wikipedia:

Buses exposed through the M.2 connector are PCI Express 3.0, Serial ATA (SATA)
3.0 and USB 3.0, which is backward compatible with USB 2.0. As a result, M.2
modules can integrate multiple functions, including the following device
classes: Wi-Fi, Bluetooth, satellite navigation, near field communication
(NFC), digital radio, Wireless Gigabit Alliance (WiGig), wireless WAN (WWAN),
and solid-state drives (SSDs).

------
pjc50
Note that this is a ~$2500 processor with a ~$5000 FPGA stuck to it.

------
Keyframe
Depending on the price, this will be interesting for I/do-something/O
workloads like video encoding/decoding. Depending on the price, looking
forward to it. Of course, depending on the price. :)

~~~
thesz
You won't have enough resources for video encoding, most probably. Arria 10
FPGA has 67M _bits_ of memory total (8.5M _bytes_ ), including one simulated
with registers. As far as I remember, just storing 1920x1080 (HD) frame in
4:2:0 mode would require 4M bytes, half of memory. 4K would require 4 times as
much resources.

You may do some on-line processing like running neural net on the video
content, but that's about it. Don't expect anything super exciting from that
chip.

Yet, it will bring joy to high frequency traders. The systems there do most of
the work in FPGA, including UDP/TCP/IP packets processing and offload some
work to CPU (broadcasts about network topology are handled on CPU, for
example). They also would like to receive CPU computation results as fast as
they can and this chip is exactly that.

~~~
pjc50
You have more than enough resources for video encoding, because all modern
codecs have a macroblock structure. You don't need to keep the whole current
frame in at once, you can slide a window around it. (Conversely, you _do_ need
to keep the matching area of the previous frame and maybe even the next frame
in order to do proper inter-prediction). That said I would assume GPUs are
more suited to this task.

By FPGA standards this thing is _enormous_.

~~~
thesz
You are also limited by number of access paths into the memory which holds,
say, macroblock. I forgot about this issue, sorry.

For block RAM in previous generation FPGAs from Altera there were one read and
one write paths. To have two read paths you would need to copy block RAM as
many times as you need read paths. This means that if you search for block
content inside a macroblock with N parallel accesses you would need N copies
of macroblock stored.

Tabula's time shifting tech allowed for up to, I believe, twelve paths into
the block RAM, six for read and six for write (they were time-scheduled to
1-read-1-write block RAM operating at six times the frequency). I thought this
thing would be a road for superscalar FPGA CPUs, but Tabula was closed.

You can imagine not using RAM at all, but then you will spend other resources
in FPGA.

These are problems with video compression I see here. I think they are
substantial but not unsurmountable and require a balance to solve. It is just
me that I saw the balance is not in favour of FPGA.

~~~
cbHXBY1D
Such a shame that Tabula closed down. They had some of the most interesting
tech -- hopefully someone is able to pick it up someday. As someone who works
with FPGAs, I think a solution like theirs is the only way FPGAs will become
mainstream.

~~~
nxmehta
Altera took many of the employees (although quite a few are now at AWS) and
it's rumored they also bought the IP. A lot of the core ideas will live on in
Stratix, hopefully.

------
arcanus
Could this be a hint at Intel's supercomputing and AI strategy? They cannot
compete with GPUs on flops with Xeon, but an embedded FPGA might get them
closer.

It is a risky strategy however. Even if they can attain similar performance,
which I doubt, programmability remains the big problem for FPGAs. I know Intel
is pushing openCL but it simply does not have an ecosystem for software right
now, and it remains to be seen if they can even enable much of the feature set
of openCL on an FPGA.

~~~
vegetablepotpie
>programmability remains the big problem for FPGAs

I agree because there's added levels of complexity with HDLs that SW doesn't
have to deal with. What would be nice is if there was a tool that you could
declare your problem (functions) and your constraints (throughput, LUTs
available) and it would figure out the memory, ALUs and pipe-lining needed to
solve the problem.

~~~
meuk
That is called high-level synthesis and there are some tools for this
available (and I know at least one company who produces a proprietary
toolchain - including hardware - for this).

In my experience, high-level synthesis makes things a lot easier for the
programmer, but you still have to be aware of very low-level details (and you
still can get bitten by abstractions that you don't fully understand).

~~~
exikyut
That company might fruitfully make their system available within a datacenter
/ "VPS"-type context, eg the way Cloud9 does it.
([https://c9.io/](https://c9.io/), you make a free account, you instantly get
a Web based editor connected to a Debian VM)

If their system accepts "mostly ordinary" code in eg C, a fair amount of stuff
would probably work with it, so it could scale beyond educational
exploration/tinkering, too.

~~~
alde
Here is a service I know of:
[https://reconfigure.io/](https://reconfigure.io/)

------
sddfd
This is great! I think that some time in the future, FPGA will be a vital part
for data processing.

In theory, an application with a computation-heavy task could program the FPGA
to provide part of that task in hardware (think the hot innermost loop body).

What I am worried about is the infrastructure that is needed to make this
happen: Is there even support for this in our compilers? What would support
look like?

~~~
pjc50
> In theory, an application with a computation-heavy task could program the
> FPGA to provide part of that task in hardware (think the hot innermost loop
> body).

This is absurdly far away; it reminds me of the decades of assumption that
4GLs are going to obsolete programmers or the decades of trying to cross-
compile C to FPGAs, badly.

It doesn't help that there's a huge infrastructure barrier caused by closed
tools. Imagine if Intel brought out a processor with a proprietary instruction
set where you were only allowed to use their FORTRAN compiler (no C, let alone
anything more modern or JIT) with a per-seat license. That's where FPGA tools
are.

We won't see a Cambrian explosion in FPGA tooling until they are made properly
open. Building using open-source tools needs to be actively supported by the
manufacturer.

There are also conceptual obstacles; FPGAs are sufficiently different that
programmers have to re-learn and re-write idioms in order to get usable
results. It's as big a jump as going from Javascript to CUDA.

~~~
pasabagi
With the end of Moore's law, won't this kind of thing become increasingly
necessary, though?

I agree with you about the problems of closed-source - but I feel like the end
of Moore's law will encourage creative solutions to performance problems -
which FPGAs would at least make technically possible. And, even if it's like
the old days when people bought compilers and access to source, people will
still do it - since it'll be the best way to get an edge on the competition.

------
Someone
I thought FPGAs were limited to either cases where you need some custom
hardware _now_ or where demand is low because, if demand is high enough and
you can wait, custom ASICs are cheaper, lower power, and faster.

What has changed here? Are FPGAs faster now or are many more people
experimenting, creating sufficient demand for custom hardware at short notice?

~~~
slededit
CPU performance increases are a lot slower now so it now makes sense to spend
time on custom RTL. In the past you would just wait 8 months and buy the new
twice as fast CPU.

FPGAs are a waste of Silicon compared to ASICs, but for fixed applications
they are less wasteful than CPUs.

------
exabrial
This is cool! I have so many questions, like how does it work with context
switching? How do they protect access to the cache?

~~~
ManuelKiessling
They don't, it's an Intel CPU.

------
kbradero
does it mean that you can load your designs on that fpga ? where can we find
what features are exposed to sw ?

------
jdubs
Looks like a future AWS instance type.

