
Corundum: Open-source, high performance, FPGA-based NIC - lelf
https://github.com/ucsdsysnet/corundum
======
sebastianconcpt
_Corundum is an open-source, high-performance FPGA-based NIC. Features include
a high performance datapath, 10G /25G/100G Ethernet, PCI express gen 3, a
custom, high performance, tightly-integrated PCIe DMA engine, many (1000+)
transmit, receive, completion, and event queues, MSI interrupts, multiple
interfaces, multiple ports per interface, per-port transmit scheduling
including high precision TDMA, flow hashing, RSS, checksum offloading, and
native IEEE 1588 PTP timestamping. A Linux driver is included that integrates
with the Linux networking stack. Development and debugging is facilitated by
an extensive simulation framework that covers the entire system from a
simulation model of the driver and PCI express interface on one side to the
Ethernet interfaces on the other side.

Corundum has several unique architectural features. First, transmit, receive,
completion, and event queue states are stored efficiently in block RAM or
ultra RAM, enabling support for thousands of individually-controllable queues.
These queues are associated with interfaces, and each interface can have
multiple ports, each with its own independent scheduler. This enables
extremely fine-grained control over packet transmission. Coupled with PTP time
synchronization, this enables high precision TDMA._

~~~
throwaway15846
FGPA = field-programmable gate array. NIC = network interface card. PTP =
Precision Time Protocol. TDMA = time-division multiple access. I’m trying to
understand what it is and what makes it special.

~~~
ncmncm
What makes TDMA special? It seems to be a provisioning scheme to reserve queue
slots to enable guaranteed bandwidth for specific connections.

What makes Corundum special? Mainly that you can add client code into it and
maybe free up an isolated core, or stage packets to send on short notice.

Mainly of interest for low-latency finance and for high-performance compute
clusters.

~~~
yaantc
Regarding TDMA only: it's part of IEEE time sensitive networking (TSN), which
is intended to make Ethernet suitable for industrial application where short
latencies and deterministic behavior are critical, and not guaranteed with
stock Ethernet.

Supporting critical traffic with TSN is a two steps process. First, you
synchronize all the participating network nodes. For this you can use PTP
(IEEE 1588), which is like an Ethernet level NTP (grossly oversimplified, but
you get the idea). Once all the nodes are in sync, they can use time aware
scheduling (TAS) where a TDM frame is overlaid over all the LAN, with Ethernet
traffic classes (TC) assigned to specific ranges. In other words, you define a
repeating pattern, split into different sequential zones, and TC are aligned
to some zones. The goal is to define repeating ranges dedicated to specific
traffic classes, where one can control the load and make sure there is no
contention and traffic will go through with deterministic latency.

All this could be used in a plant, to support both best effort traffic but
also sensitive real-time traffic for automation, while protecting the later.

TSN started out for media applications (broadcasting) over Ethernet, but is
getting into industrial applications (see
[https://opcfoundation.org/](https://opcfoundation.org/)).

Support for TSN is planned for 5G (NR) release 16, to support industrial
applications.

All this area is in flux, so having a flexible programmable platform can be
interesting.

------
jdsnape
I've used NetFPGA ([https://netfpga.org/](https://netfpga.org/) ) before which
seems a little more complete, will be interesting to see how this compares.

~~~
alexforencich
NetFPGA is a toolbox for network-based packet processing. It is not a NIC, and
their NIC reference designs leave a lot to be desired. Corundum is
specifically a NIC.

~~~
musicale
If NetFPGA gives you access to the PCI bus, it should be possible to make it
into a NIC.

Simple matter of some verilog and a linux driver. ;-)

Of course, we also have non-FPGA Smart NICs from the likes of Netronome, etc.
which can do things like accelerate EBPF or run P4.

~~~
alexforencich
Well, we're planning on porting Corundum to the NetFPGA SUME hardware at some
point in the near future. Should be relatively straightforward as the PCIe
interface on the Virtex 7 is the same as on the Ultrascale parts.

NetFPGA does have a NIC reference design, but AFAIK it's just the Xilinx XDMA
core connected to a Xilinx 10G MAC. No accessible transmit scheduler, no
offloading of any kind, etc. Just about as spartan as you can get, and it's
built from completely closed components so you can't really make many
modifications to it.

For what we're doing, we can't use any existing commercial NICs or smart NICs
because they can't provide the precision we need in terms of controlling
transmit timing. We don't care about EBPF, P4, etc. We care about PTP
synchronized packet transmission with microsecond precision.

------
pedrocr
Really interesting project. Couldn't find the motivation explained. Is this
just for research? Is it usable in production running in an FPGA? Are there
plans to produce hardware?

~~~
crispyambulance
It's from a group at UCSD, so yes, this is research.

The applications for these kinds of things range from SDN (software-defined
networking) where low-latency is a concern and to applications in network
monitoring. One could, for example, put together a system that performs line-
rate TLS decryption at 10Gbps. You need an FPGA (a big one) for something like
that.

There are commercial vendors for this kind stuff (selling closed source IP and
hardware). It is not _yet_ in Open Compute networking projects, but I expect
that's coming soon.

You can now buy "whitebox" switches that run open network linux and put your
own applications on them. In the not-too-distant future those "applications"
will also extend to stuff that can run on FPGA hardware .

~~~
wbl
Nope! Netflix does 10G tls on commodity hardware in kernel space. CPU can do a
lot.

~~~
shaklee3
I believe they are doing 100Gbps now:
[https://t.co/cbb7NA9vJf?amp=1](https://t.co/cbb7NA9vJf?amp=1)

It's hard for me to see the use case of an FPGA nic. The reasons outlined
above don't seem compelling when a commodity nic like mellanox do so much more
already.

~~~
wmf
It looks like the UCSD team are exploring data center TDMA which no commercial
NIC supports. [http://cseweb.ucsd.edu/~snoeren/papers/tdma-
eurosys12.pdf](http://cseweb.ucsd.edu/~snoeren/papers/tdma-eurosys12.pdf)

~~~
alexforencich
The group web page is here: [https://circuit-
switching.sysnet.ucsd.edu/](https://circuit-switching.sysnet.ucsd.edu/)

Corundum was originally geared more towards optical circuit switching
applications, but it's certainly not limited to that. Since it's open source,
the transmit scheduler can be swapped out for all sorts of NIC and protocol
related research.

------
MeteOzturk
Might be silly question but is there a technique to rapidly program FPGA's
without interrupting other processes. Say I have multiple soft-CPU's and only
want to use my gates to enable ethernet once its needed or the user plugs in?

~~~
aylons
Not a silly question and actually a very powerful, seldom used feature:
partial reconfiguration.

There are, however, several limitations to it. Clock cannot be changed, for
example, and usually neither can I/O, specially high-speed transceivers. This
has been improving (Ultrascale Xilinx allows for reconfiguring I/O), but you
still have to reserve area for reconfiguring (meaning literal area, as in a
geographic region in the FPGA).

However, I/O versatility as you suggested has very few advantages to it. You
need the reserved logic for ethernet to be programmed at when you plugin. Why
would you leve it unprogrammed? If is simply disabled, it won't use any extra
power resources, and your soft-CPU's won't be able to take advantage of these
resources while you are working. Maybe you could use the area for new soft-
cpus, but then you'll hit the problem of over segmenting your design and
allowing for less optimization. This would inevitably impact timing
constraints and area usage.

Also, FPGA programming may take minutes to finish, and always at least a few
seconds. This will be very noticeable by an user and not very efficient if it
has to be done frequently.

There are, of courses, good uses for that. But there is also a lot of effort
on doing it right and you always risk overdoing it.

~~~
aseipp
Is programming speed really that bad for the ultra high-end devices? Minutes?
I don't remember it being that bad for the Amazon F1 when I ported a Xilinx
build to use the F1 SDK (I didn't spend lots of time with our prior one, so I
wouldn't know.) Of course, their programming strategy is extremely customized,
but even for very high-utilization images, it was only ever on the order of
seconds. Vivado is absolutely terribly slow though, no matter what you do, or
what device you use. (Not to mention if you want to use the ILA support over
the internet...)

Also, for some designs you can mitigate the reconfiguration time issue by
having two regions and draining requests to one of them, before doing an
update. Most of the Xilinx tooling for OpenCL does this kind of thing by
default (4-6 "opencl kernel" regions.) But of course it's not always an option
to give up that much space...

~~~
alexforencich
It depends on the programming interface. JTAG is bit serial and rather slow,
so it can take quite a while to load a large FPGA via JTAG. However, there are
several other interfaces that can be used, including QSPI, dual QSPI, parallel
flash, and a simple parallel interface from some other controller. These can
run at many MHz and can load a configuration into a large FPGA in less than a
second.

------
archi42
The hobbyist in me is disappointed: Even the ExaNIC X10 seems to come at
350US$ used, and forget about the four figure UltraScale boards :( Of course,
the FPGA can do a lot of stuff in addition, so for HPC this might be really
nice to offload application-specific stuff really early (because sustained
2*10G traffic is bound to go somewhere, e.g. CPU[s]).

~~~
ncmncm
$350 is cheap for such a powerful NIC. You could pay $10k for a Napatech, or
$2k for a Solarflare, and not get this level of programmability.

libexanic is a remarkably clean user-space kernel-bypass library that allows
you to do processing on early fragments of the packet while the rest are still
being received.

~~~
imtringued
Comparing used prices to new prices is dishonest. It costs $2,961.00 new.

[0]
[https://www.shi.com/Products/ProductDetail.aspx?SHISystemID=...](https://www.shi.com/Products/ProductDetail.aspx?SHISystemID=ShiCommodity&ProductIdentity=33645693)

~~~
alexforencich
I don't know where the heck that price came from. The cards are more like
$1200 new.

[https://www.cdw.com/product/exablaze-exanic-x10-network-
adap...](https://www.cdw.com/product/exablaze-exanic-x10-network-
adapter/3981568?enkwrd=exanic+x10)

~~~
harry8
[https://m.aliexpress.com/item/4000523308171.html](https://m.aliexpress.com/item/4000523308171.html)

Ultrascale. Significantly cheaper. Could it be made to work out is it subject
to your comments concerning kintex, pcie straddling?

Aside from that did you know you were going to do this when you did verilog-
ethernet etc?

------
alexforencich
Author of Corundum here--if you have any questions, ask away.

~~~
nullc
Are you aware of the extraordinary good deal on huge kintex (K420T) nic-like
dev boards on aliexpress?

There are boards with 4 sfp+, ones with 2 sfp+ and 2 QSFP+, and even one with
4 QSFP28 (and UltraScale+ XCVU9P)...

[https://www.aliexpress.com/store/group/FPGA-
DEV/620372_25030...](https://www.aliexpress.com/store/group/FPGA-
DEV/620372_250309057.html?spm=2114.12010612.pcShopHead_35478622.1_1)

they sound like great targets for your work...

~~~
alexforencich
Yes, I am aware of those. However, the kintex PCIe interface is a bit of a
pain as it has a TLP straddling mode that can't be disabled, so it will be
some time before it's supported as it will require some significant reworking
in the PCIe interface modules. I am planning on supporting straddling
eventually as this will improve PCIe link utilization on the ultrascale and
ultrascale plus parts. If someone wants to donate a board, I can look in to
supporting it.

~~~
hobo_mark
Interesting, I never heard of straddling, what is it supposed to achieve?

~~~
alexforencich
Stradding is an artifact of very wide interfaces. On the Ultrascale+ parts,
the PCIe gen 3 x16 interface comes out as a 512 bit wide interface. Every
cycle of the 250 MHz PCIe user clock transfers 64 bytes of data. The issue has
to do with how packets are moved over this type of interface. If your packets
are all a multiple of 64 bytes, no problem, you get 100% throughput. However,
if your packets are NOT a multiple of 64 bytes in length, you have a problem.
What byte lane do packets start and end in? The simplest implementation is to
always start packets in byte lane 0. The interface logic for this is the
simplest - the packets always start in the same place, so the fields always
end up in the same place. However, if your packet is 65 bytes long, the
utilization is horrible - it doesn't fit in one cycle, so you have to add an
extra cycle for every packet, and bus utilization falls to 50% as you have 63
empty byte lanes after every packet.

Straddling is an attempt to mitigate this issue. Instead of only staring
packets in lane 0, the interface is adjusted to support starting packets in
several places. Say, byte lanes 0 and 32. Or 0, 16, 32, and 48. Now, when you
have a packet end in byte lane 0, you can start the next packet in the same
clock cycle, but in byte lane 16 or 32. This increases the interface
utilization. The trade-off is now the logic has to deal with parts of two
packets in the same clock cycle, and it has to deal with multiple possible
packet offsets.

The specific annoyance with PCIe packets is that the max payload size is
usually 256 bytes, but every packet has a 12 or 16 byte TLP header attached,
which really screws things up when combined with the small max payload size.

~~~
hobo_mark
Fantastic explanation, thanks.

------
LinuxBender
This looks very interesting. If this were included in the Linux kernel, would
it read and utilize existing sysctl memory and qlen values as well as have
it's own sysctl settings, or would all the settings be derived at module load
in the modprobe parameters? Why I am asking is that I current disable TOE (TCP
offload engine) on all my nic's as they have their own buffer and retry
settings and ignore the OS network settings.

~~~
alexforencich
It's still in development at the moment. We'll see about the interface. But
there are no plans to implement any segmentation offloads or TOE in Corundum,
that will be left up to the network stack. However, scatter/gather DMA support
is planned so that software GSO will work. Right now, most of the low-level
twiddling is done from a user space app that directly accesses device
registers. For a research device, that's fine, but that would obviously have
to be improved for a commercial product.

------
imtringued
Although this is cool, the downside is that it only works on specific FPGAs.
Are hardware independent designs possible so that one can run it on any
sufficiently large FPGA?

~~~
aseipp
Making RTL portable, so it can be used with different toolchains, is certainly
possible -- and how/why you do this is, like everything, is an engineering
tradeoff. Sometimes when you write C code it's easier to just use Linux
features and not care about portability! Sometimes it's very easy (or very
desirable) to keep portability for your code, which might be easy (use the
standard library) or hard (use lots of #ifdef or whatever.)

But, unlike portable C code, to run designs like this on real hardware, you
need to do things like describe how the physical pins on the FPGA are
connected to the board peripherals (for instance, describing which pin might
be connected to an LED, vs a UART). This generally requires a small amount of
glue, and depending on how the project is structured, some amount of
Verilog/VHDL code, as well. It's not like saying "cc -O2 foo.c" with your
ported C compiler that has a POSIX standard library.

This is just the case if you're using the same base FPGA, but with different
board layouts. Using different FPGAs (for example, a current-gen FPGA by
vendor XYZ, vs XYZ gen N-1), or especially when porting between vendors -- the
details can become vastly more complex very quickly.

------
JackRabbitSlim
TDMA in a data center environment, anyone else have some heavy deja vu?

~~~
alexforencich
That's something we joke about quite a bit in the research group - "back to
the future!"

------
lightedman
I really wish people would try making original names for their products. Not
only am I now going to have to sift through Steven Universe stuff when I look
up corundum for people, now I'll have to filter this out as well.

Very annoying for geology-hobbyists.

~~~
alexforencich
There are two hard things in computer science: cache invalidation, naming
things, and off-by- one errors.

------
appleflaxen
I vaguely remember a corundum project related to the ruby (and rust?)
languages, but search engines fail because the terms are all related to
mineralogy.

FWIW: this appears to be different / unrelated to ruby or rust.

