Hacker News new | past | comments | ask | show | jobs | submit login
Intel Shows Xeon Scalable Gold 6138P with Integrated FPGA, Shipping to Vendors (anandtech.com)
140 points by dmmalam on May 17, 2018 | hide | past | favorite | 109 comments

One stark flailure of the Altera acquisition is there has been little by way of tool chain integration. This is a CPU with a bag on the side, and that bag needs people that write HDL and understand computer architecture to make good use of. It's not really a harsh critique, but a warning that the lower you go is an increasingly smaller number and expensive lot of folks. Intel's marketing arm can't distort that reality, and their product development arm seems too taxed to bridge the gap.

> increasingly smaller number and expensive lot of folks.

There is this sentiment on HN that people in EE/embedded space don't really get paid well (compared to the average Javascript-slinging frontend dev). Is this not true for HDL folks?

We get paid bank, there are very few of us, and the older you get and the more horror stories you've seen the more you're worth (unlike software). Gray in your beard is a feature, not a flaw.

You're responsible for time critical and/or DSP things that are critical system architecture, and will never be outsourced (it's key IP, you'd be stupid to outsource that).

Don't let anyone know, we have a good thing going on.

Hey, thank you for that. I really needed this motivation at this point in my life as an EE (RF, RTL and PCB level).

It does get outsourced. I worked on a team that did a dsp/crypto multi core chip. The funny part is 90% of the team has now moved to US because they got paid better.

> it's key IP, you'd be stupid to outsource that

That doesn't usually stop companies. At best it creates an opening for new companies to recapture some high-value markets once the dust from the off-shoring stampede settles.

Would you like to share how does one get into this line of work?

I don't suppose there's an appropriate tutorial on egghead.io.

The standard path is more trodden than software development jobs -- ECE grad at a school with a good program.

I know I took a digital logic course and did some fpga implementations in it, man that was hard. I am petrified of having to deal with HDL in the near future for my research, maybe I'll take a class or something.

In school they make you do it the hard way. I find verilog to be very similar to my process of how I write embedded C code.... short version is if you can make block diagrams and write good C, you can learn verilog.

If you want to learn, there’s a ton of free training on the major vendor’s sites (Intel FPGA and Xilinx), and a development board with everything you need might run you a few hundred. But I don’t know what your job prospects would be without a degree in EE/CompE or experience in the hardware industry...

interested to hear what bank is here too

How do you define "bank"?

I get paid 130k with a phd EE from tier 3 with <3 years industry experience. If you want to get in get ECE and study computer architecture and RTL. I am doing hardware modeling to predict future performance. It’s difficult to get in the industry from a tier 3 w/o experience but it’s worth it. My counterparts in FAANG are no doubt getting 300k/year. Sometimes I think maybe I should leetcode and get a software job @ FAANG because HW is difficult.

all of my offers out of undergrad were more than that and I’m considered below average ¯\_(ツ)_/¯

Getting paid bank means getting paid a lot of money.

EDIT: You may also hear the phrase "making bank" to mean making (earning, not printing) a lot of money.

I think they were probably looking for a salary range.

lol, you might be right. I thought it was an honest question from someone of a different native tongue.

How would you recommend getting into the field?

Electrical Engineering college degree. That's the only option.

I've never seen non-college degree electrical engineers.

Allow me to introduce myself... not really becsuse anoniminity. But I’m effectively an EE at my company with no degree. Hardware, firmware, software, sourcing, production, validation/testing, and low volume assembly, all self taught. I’ll grt into FPGA when I have a use for it, but it certainly won’t be FFTs and DSPs but still.

I’m glad I don’t have to job surf, it would be hard without paper, but my experience would be very valuable in my tiny field.

For sure technicians that do these are a thing for finishing work, who may have AS degrees instead of BS EE. I've worked with them as well. They were always referred to as technicians that EE's hand-off more of the board-level debugging to (or even GDS II for ASIC design..). The core datapath processing algorithms were always done by EE's, though.

I'm another one. Embedded Engineer, no degree, works on core datapath and high level design, in addition to debug work.

We outsourced HDL validation.

I work on operating systems, which is similarly niche (but probably overall easier to learn and low risk vs fabbing ASICs) an it is very bi-modal. Sr. circuit designers can write their ticket, but you have a lot of warm bodies surrounding them at large companies.

Hi, does your team have an opening? I work on performance modeling and am always curious if I could make the jump to low-level system software and learn a ton of stuff for a while.

say EE/embedded space people are paid the same as average Javascript-slinging frontend dev, would you call this paid well?

I'm not even sure they're going to let you program the FPGA drectly (with HDL)?

I expect the main workflow will be using OpenCL to offload arbitrary work to the coprocessor along with a few Intel-provided modules capable of common tasks.

The great thing about having the FPGA on-die via UPI is that the cache-coherency, decreased latency and massive bandwidth will allow much more granular offloading of work. This is as compared with PCIe coprocessor where it only makes sense to offload larger chunks of work and minimise the communication and data passing between the two.

The greater the granularity of work that we can offload, the more viable the OpenCL/high-level synthesis/heterogeneous computing type stuff will be, as it will integrate more seamlessly into existing software development methods. This is the holy grail at the moment for FGPA vendors: to get to the point where software developers can program them on their own.

As to your point though I guess we'll find out soon what the dev tools for this will actually look like.

Using OpenCL to program an FPGA is a significantly more difficult task than programming in HDL. Atleast for what would be relevant to a Xeon co-processor. The OpenCL flow is just terrible at getting to the performance levels you need to realistically offload anything from a Xeon. Intel are certainly working on it, but that's not a realistic proposition for the next 5 years, and if the Xeon+FPGA isn't already successful in that time frame, it'll be canned long before OpenCL is a solution.

From what I've seen the only applications for this will be pre-canned FPGA images that were written in-house by Intel for things like encryption or FEC.

Video encoding would be another good application for the FPGA.

I'm not so sure - it's a fantastic fit for GPUs, FPGAs make sense for stream processing for video encoding - but that's more of a discrete device play where you can plug the video feeds directly into the FPGA daughter board.

Learning an HDL isn't that hard. I was productive within a few weeks of starting to use VHDL.

A former colleague is now an Altera/Intel FAE, I will ask how things are going the next time we meet up for a beer.

Learning an HDL isn't that hard. But it's a very small part of the overall flow for real development, and arguably it's one of the easier parts anyway... That's sort of the "problem", more or less.

Even very well-oiled products like the Amazon F1 and Intel's new acceleration cards are pretty non-trivial to use unless you're an experienced engineer, and they clearly spent a lot of time polishing off the roughest edges to make it as approachable as possible. Amazon more or less paid a team of seriously experienced people to make the whole flow as painless as possible (including a pretty good software SDK, and a lot of tooling around Vivado), and it's still non-trivial!

Some of the ARM/FPGA combos are a lot more approachable overall, but the tooling, BSPs etc are normally huge pains the moment you want to get creative, and the SDKs are comparatively worse, vs the "high end" ones listed above, in my opinion. Normally I just end up replacing them with my own Linux BSP, more or less, if I need the actual host side.

A really annoying aspect of all this though is that most kits which could offer features like high-speed peripherials (PCIe, etc) are pretty expensive for hobbyist developers to acquire and use, so there's certainly a bit of a self-fulfilling-prophecy going on here.

I have an ARM+FPGA system (Xilinx Zynq), I don't use the supplied BSP either.

Writing HDL isn't hard but doing it right(much like with C) I would argue is the tricky part.

The translation from Software -> Flip-Flops isn't nearly as natural and it's easy to try and apply SE techniques that while possible are totally unsuited for an FPGA.

Witness the many systems that say they can translate something high level to HDL and how poorly they perform compared to an expert. It's so easy to produce logic that will run slowly or be glitchy, and most EE graduates have very limited exposure to designing logic.

Yeah, I recall a thread earlier this week talking about a large divergence in how resets where handled between FPGAs and ASICs.

It's a interesting space for sure and I'm definitely curious how these hybrid systems shake out.

There are a multitude of foot bullets that you can generate when synthesizing valid HDL to real world hardware. These languages were meant for specification and simulation and as such allow you to do things that are plainly wrong for synthesis in the real world. It's very easy to generate a circuit that works 99.9% of the time and causes Heisenbugs if you approach it like software programming.

HDLs are fine. I've found the tooling around them to be quite atrocious (slow, buggy, opaque, gui-based) though. Be it for synthesis, simulation, or even just compiler errors it's pretty bad compared to software.

I’ve started using SpinalHDL, https://github.com/SpinalHDL/SpinalHDL. It’s a Scala DSL that spits out Verilog or VHDL for traditional synthesis tools. But, unlike Chisel or MyHDL, in my opinion it’s a great experience. And, now it has seamless integration with Verilator for simulation, and the open-source Verilator project is very capable—they claim it beats commercial simulators: https://www.veripool.org/wiki/verilator. Since Scala is quite a bit faster than Python, the simulations run much faster than something like Cocotb too!

I have my hobby project in SpinalHDL up at https://craigjb.com

Edit: also, GTKWave is pretty good! It’s a simple and straightforward waveform viewer that works on all platforms.

Are you writing your testbench code in C++? (that's what verilator wants, isn't it?)

Luckily there is (slow) progress on making good open source tools[1]. The Xilinx 7-series FPGAs were recently reverse-engineered too[2].

[1] https://rwmj.wordpress.com/2018/03/17/playing-with-picorv32-...

[2] https://github.com/SymbiFlow/prjxray

Have to agree this is a major pain point for wider adoption. The current tools are mainly developed by a few big EDA vendors. I'm hoping, through the growth of open hardware communities, we'll start to see more friendly tools making progress.

How do you get into HDL and FPGAs? Seems like it will become a more and more important skill since all other avenues in performance improvement have come to a halt.

I like http://www.fpga4fun.com/ as a decent high level overview + a bunch of great tutorials.

Grab a Lattice CPLD/FPGA demo board($25-$50) and have at it!

I just picked up a dev board and started making it flash LEDs, put patterns on the seven-segment display, then had a play with soft CPUs. This gave me ideas on how to use the technology in my job - embedded networking.

If you are coming from a software background then maybe look at getting a board with one of the ARM+FPGA hybrid chips, you can run a proper OS on the ARM CPU and I would guess that the bus interface to the FPGA part will be simpler than using Intel's UPI.

Could try something like Icarus Verilog for an open source compiler and start building some toy designs. There are some good communities and blogs to find help.

I guess that is what you get when the entire software engineering field turns to "Javascript and a distributed database" as a solution to everything. There is a lot more out there and people should spend the time learning it.

> little by way of tool chain integration

Have a look at nGraph: http://ngraph.nervanasys.com/docs/latest/optimize/generic.ht...

Co-design for hardware+software is tough, for sure. But the reality is that hardware has to be present to build the software on top of it. People need something to play with. So the "bag on the side" of FPGAs here is kind of like Lego blocks for cache / acceleration. If you are running a DNN for inference, for example, cache is usually your bottleneck. Rent GPUs to train the model, figure out your bottleneck, and build your own isolated and local system for the "expensive lot of folks" to create the valuable IP.

> that bag needs people that write HDL and understand computer architecture to make good use of

Over time I suppose they could have some machine learning that automatically configures the FPGA based on which programs you are running and the types of computations they have historically used.

There’s a pretty cool story about using evolutionary algorithms to program an FPGA for a specific task: https://www.damninteresting.com/on-the-origin-of-circuits. Probably not ML in the sense you were talking but very cool nonetheless. In the end the final result was hyperoptimized FPGA code that only ran on a specific board and used a bunch of nonsensical structures that seemed to be doing nothing, yet would stop the circuit from working if removed!

I remember this, thanks for brining it up,:I had read it years ago and it always haunted me.

The evolved solution involved using only 37 out of the 100 gates available, no clock, and some units were logically disconnected yet disabling them caused the chip to fail at its task, indicating it was relying on EM effects particular to the chip. Truly amazing (and perhaps a bit creepy too).

machine learning that automatically writes HDL programs to match functionality of x86 programs and are more efficient when executed on an FPGA than a general purpose CPU haven't yet been invented...

Nor do I think they will be invented soon. Machine learning is bad at doing discrete things, bad at things requiring zero mistakes, and bad at problems for which the answer can't be perfectly verified (due to needing to solve the halting problem).

I think spiralgen can also do FPGA(or could do if needed), i.e. automatically generate highly optimized numerical code.


I wouldn't be surprised if Intel came up with tools which help make use of the FPGA using higher languages like C.

Xilinx is already pushing it's Vivado HLx which is pretty much that. I'm not aware of Altera's current offerings but they won't overlook this so easily.

Well there you go. So GP's concerns are largely addressed. Intel can just work on this tooling, and people who are going to use HDL are going to continue using them.

Good idea in theory, in practice it's a shit show. The current level of language support is "As long as you know how to write what you want in HDL, then you can write some C that uses custom pragmas and design patterns and attributes that will badly reproduce what you could have written in HDL in half the time".

> I wouldn't be surprised if Intel came up with tools which help make use of the FPGA using higher languages like C.

These (almost by necessity) make you write code that doesn't look like any other C code. You can't just use some pre-existing C library and compile it to use FPGA resources. At that point, why not go a step further and just use a proper HDL?

>At that point, why not go a step further and just use a proper HDL?

You're assuming that you're starting with an FPGA only project. In reality most projects that are going to be accellerated with these Xeon+FPGA systems is existing software where only a handful of hotspots will have be ported to the limited C language which is significantly less effort than rewriting the entire algorithm in HDL.

Because although HDL doesn't look difficult syntax wise, it requires a very good understanding of logic gates. People need to pretty much take a course in digital design to be able to write something meaningful.

It took me a while to grok HDL and write working code (good, maintainable code aside) even though I took digital design courses.


FPGAs don’t get firmware.

HDL is a description of hardware.

There's no need for all caps.

HDL can be referred to as code. And I do know what HDL is and what FPGAs are, thank you.

>HDL can be referred to as code.

No, it can't. That's why the caps are there and I'm sticking to it. You basically justified it ;)

Code is something that you design to run sequentially. HDL is a description of gates and logic that runs all at the same time.

If you wanted to argue that the bitstream is firmware - that's a tougher discussion.

If code is inherently sequential, do you think that there is no such thing as "code" in functional programming languages?

Intel recently released their own High Level Synthesis compiler for Quartus; like HLx, it's free and you can use it now on whatever device you want. You still have to do the interconnection to the CPU, if you have one.

Unlike Xilinx, though, Intel (very very recently) just started offering their OpenCL-for-FPGA SDK for free, and it works on most of their device families, including Arria/Xeon and Cyclone/ARM. I always found it disappointing that the OpenCL SDKs were normally licensed, since they're more-or-less a logical extension of HLS support. So that's nice of Intel.

They don't have any equivalent to Xilinx SDSoC though, but for datacenter targets they're shipping a different set of SDKs anyway (called "OPAE"), so maybe in the future they'll build something on top of the OPAE and OpenCL support (e.g. single-source model, like Sycl)

Altera/Intel has had OpenCL support for years now

OpenCL support is different from this.

Its a different approach to HLS. Arguably much better than the Vivado approach

The FPGAs are programmable using OpenCL. That counts?


I don't know if I'd call that programmable, isn't it just sequencing work to predefined logic blocks? Can you do things that were not contemplated by the opencl design?

OpenCL provides a standardization for interacting with accelerators. It is up to the manufacturers of FPGAs to adopt the OpenCL standard. For a while, Nvidia didn't get on board because they already had CUDA and felt OpenCL was letting ATI back in the game. Xilinx makes good learner FPGAs that are compatible with the OpenCL standard.

I don't know whether it was deliberate, but I rather like that as a word: Flailure. (Although pronunciation and perception of it are not the easiest.)

A lot of flailing about, as the ship goes down.

(Speaking simply generally, about this "word". Not about the topic(s) in this thread.)

They've got an OpenCL SDK for it now, that brings it under the same interface that a lot of folks have been using for the Phi and other many core tech Intel has been pushing for HPC.

It's not really that hard to get rolling at this point.

It's really up to Intel to start funding education for architecture and HDL design.

"Price: Arm, Leg" ... hard to miss the sarcasm.

Arria 10 FPGA is kind of mid range, I think they go around for 500 bucks , you can buy the dev board for 5K (https://www.altera.com/products/boards_and_kits/dev-kits/alt...)

Why are dev boards that much more expensive in general? I can't believe that it costs 10x as much to "just" put the chip on a pcie board.

These are usually full-custom boards, very low-volume, with many layers and fine pitch traces. The parts themselves may have thousands of pins, routed across a dozen layers or more.

So the non-recurring engineering (NRE) costs are very high, and you don't get to recover them over much volume. Even at 10x the cost, the vendor is still losing money.

funnily it also applies down the chain, even dollar ESP chips dev boards are 10x the cost of the bare pcb :)

I believe it's a technique for filtering out everyone but "serious" customers, because whoever buys one and tries to make it work is inevitably going to ask questions about it.

Um, the FPGA chip on that board retails for $10K according to findchips.com so ... (of course, anyone really ordering these class of chips is going to order them directly from Intel or Xilinx--but that doesn't mean the FPGA will be cheap)

Intel isn't making a lot of money on those boards if they aren't actually taking a loss.

1) A PCI-e board is non-trivial. It requires some engineer to sit down and do some serious signal integrity work.

2) A board with a fine-pitch BGA is non-trivial. These boards generally have via-in-pad and blind vias on the top two layers. The boards are also generally 8-12 layers.

3) The support circuitry on those boards in non-trivial. There's high-speed, fine pitch connectors that are probably in the $100 each range, themselves.

4) Really, you're paying for customer support. You will hit bugs. You will have to communicate with Intel. You probably won't order enough volume to make it worthwhile for Intel.

If you are seriously interested in using 500$ chip in any "interesting" volume, shelling out 5k$ for a dev board is nothing you would care about.

I have an Arria 10 SoC dev board (not the PCI one, the stand alone version). It's a very complex and fully featured board, it's just not just a "breakout board" for the FPGA. It's also clearly aimed at companies looking to integrate an Arria 10 into their designs, not hobbyists looking for a cheap FPGA board and the pricing reflects that.

I remember Faggin (of microprocessor fame) dabbling with the idea of creating an all-FPGA ’hypercomputer’ back in the late 1990s, and even going as far as launching a start-up called Starbridge Systems. Nothing much came of it then, but here’s a WIRED period piece with all the de rigeur gushing optimism of the era.


I wonder if amd could produce epyc with tsmc and xilinx.

xilinx is a much better choice in fpga

What I'm really interested to know is how this looks from a power-performance perspective. An Arria 10 can draw 20-40 watts (I'm not sure that 90w is really realistic), and the TDP for the Xeon is 165W. When I saw the discussion about this in the past with people who work at Intel the discussion wasn't 165+90 = 225TDP. It was actually "Oh so you want to use the FPGA, better turn off 6 of those Xeon cores!

That could be a real problem for getting performance out of these systems.

Intel Scalable CPUs are already at 205W TDP, and also see: https://en.wikichip.org/wiki/intel/xeon_platinum/8124

> Also in the announcement was a mention of Intel’s desire to offer a discrete FPGA solution with a faster high-bandwidth coherent connection, although details of this interconnect were not provided

Sounds a lot like what OpenCAPI (https://opencapi.org) offers.

(disclosure: work for IBM on CAPI/OpenCAPI firmware enablement)

So what can you do with 1.15M logic gates that are in that FPGA? Can you, say, multiply, say 2 256x256 FP32 matrices?

Five top comments in and you're the first to ask the this question. Given how exotic this piece of silicon is I'd love to know too. This is the Intel webpage I found[0]. From there there's a link to an Altera page[1]. This is the marketing blurb that Anandtech got some of there info from[2].

A number of obvious interesting applications jump out at me: crypto acceleration for mining, machine learning acceleration, JIT acceleration for interpreted languages… How doable is all this? Would you have to roll your own code or are there libraries?

Wrt interpreted languages, I use Ruby fairly often and now that there's MRuby[3] I wonder if Ruby could be made run blazingly fast on something like this Xeon+FPGA thing?

Oh to have a spare $€£¥ to get me one of these.

[0] https://www.intel.com/content/www/us/en/servers/accelerators...

[1] https://www.altera.com/solutions/acceleration-hub/accelerati...

[2] https://itpeernetwork.intel.com/intel-processors-fpga-better...

[3] https://github.com/mruby/mruby

Top of the line Arria-10 FPGAs have about 1500 floating point MACs [1]. Using such a device, Intel claims ~1 TFLOP sustained for GEMM, the standard matrix multiply operation [2].

[1] https://www.altera.com/content/dam/altera-www/global/en_US/p...

[2] https://www.altera.com/content/dam/altera-www/global/en_US/p...

I would give a very strong health warning about taking those numbers seriously. What that 2nd reference doesn't make clear is that it is NOT a standard Matrix Multiply Operation, it's an 11 row by 16 column Matrix Multiplication.

This is important because unlike in software where performance scales well. For FPGA you would have to decompose every matrix multiplication into 11x16 style matrix multiplies. They don't mention this overhead in their specs.

If anyone is interested in using C# on a FPGA to learn take a look at https://hastlayer.com/project

Wow, someone managed to make something even worse than the "compile your C to an FPGA" tools you see occasionally.

The model is just too different for these things to make sense 99% of the time. It's like trying to run general purpose code on a GPU, but even worse.

For those of you missing context to understand why it's a bad idea...

Typical programming languages run code one line at a time from the top to the bottom, occasionally calling functions or looping.

FPGA's instead have code which specifies a kind of circuit diagram of what is connected to what.

You can make something to translate one to the other, but by translating a programming language to a circuit diagram you usually end up bits of creating circuitry for each line of code, and then extra circuitry which makes each bit be activated in sequence.

That typically leaves 99.9% of the circuitry unused (deactivated) at any point in time. FPGA's don't have much space for circuitry, so by translating a program from C to an FPGA directly you'll usually end up with a huge design with a very low throughput, since the vast majority of the circuit is sitting idle most of the time.

Real FPGA designs will typically aim to use circuitry very sparingly, and keep as much of it in use as possible all the time to get maximal performance. Complex, sequential and non-performance critical tasks are typically not a good fit for an FPGA and are usually offloaded to a programmable processor instead.

Yeah, I don't understand why so many people keep trying to write control-flow to FPGA compilers. The only way these things are going to be practical is if you understand how both FPGAs and the compiler work really well, and then write control-flow code in a really specific way that the compiler can handle. At which point you'd be better off using a tool that can model what you want directly.

The more useful way to write hardware using programming languages is to create a model of hardware and then use the language features to manipulate that model, and use it to create abstractions for actual hardware patterns.

Why would anyone in their right mind try (or even want to try, except as a lark) this?

"...The input of Hastlayer can be a program written in dozens of programming languages (including several of the most popular ones as C#, C++, Python, PHP..."

Wait until you see the python-to-FPGA tool :P

This one does that thanks to IronPython, lol.

writings fpga.el right now ><

What's the use case of an integrated FPGA?

I can think of two possibilities: 1. Hardware accelerated custom crypto algorithms. 2. Specialized co-processor. You could implement an exotic processing architecture on the FPGA to off-load computations to, like a complex number processor with 52-bit words in a custom floating point format for a specific application.

I don't know how fast the FPGA can reconfigure itself or part of it, but you could also have per process bare metal JIT

It takes in the order of seconds to reconfigure. JIT is not really viable.

Nope. Reconfiguration takes milliseconds. Also, Arria 10 supports partial reconfiguration which means that the FPGA can keep operating while some of the logic is reconfigured via bitstream.

we need to jit the fpga jit jit then

Another use case is if you have a large amount of asynchronous I/O and real time constraints.

Insert some random Moore's law comment here

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact