
Reinventing the Network Stack for Compute-Intensive Applications - shaklee3
https://www.darpa.mil/news-events/2019-09-26
======
ChuckMcM
From the article -- _“The true bottleneck for processor throughput is the
network interface used to connect a machine to an external network, such as an
Ethernet, therefore severely limiting a processor’s data ingest capability,”
said Dr. Jonathan Smith, a program manager in DARPA’s Information Innovation
Office (I2O). “Today, network throughput on state-of-the-art technology is
about 1014 bits per second (bps) and data is processed in aggregate at about
1014 bps. Current stacks deliver only about 1010 to 1011 bps application
throughputs.”_

I don't disagree with this, but see the challenge not so much as the
transmission speed between nodes, rather I see it as the 'semantics' for
expressing addressing and placement. (and yes, I am a big fan of RDMA :-))

One of the things I helped invent/design back when I was at NetApp was some
network attached memory or NAM. NetApp was among a number of people who
realized that as memory sizes got larger, having them associated with a single
node made less and less sense from a computational perspective.

One can imagine a "node" which lives in a 64 bit address space, has say 16GB
of "L3" cache shared among anywhere from 12 to 12,000 instruction execution or
'compute' elements. More general purpose compute would look more like a
processor of today, more specialized compute looks like a tensor element or a
GPU element.

"RAM" or generally accessible memory would reside in its own unit and could be
'volatile' or 'involatile' (backed by some form of storage). With attributes
like access time, volatility, redundancy, etc.

That sort of architecture starts to look more like a super computer (albeit a
non-uniform one) than your typical computer. With nodes "booting" by going
through a self-test and then mapping their memory from the fabric and
restarting from where they left off.

I thought I might get a chance to build that system at Google but realized
that for the first time in my career that I was aware of I was prevented by my
lack of a PhD. And what was worse, given the politics at the time, no one with
a PhD would sign on as a 'figurehead' (I reached out to a couple and we had
some really great conversations around this) because of the way in which
Google was evaluating performance at the time. (think grad school ++ but where
only one author seems to get all the credit).

Now that I've had some free time I've built some of it in simulation to
explore the trade-offs and the ability to predict performance based on the
various channels. Slow going though.

~~~
jacquesm
It's very sad that lack of a PhD would cause such a decision to be made rather
than that it would get debated on its merits.

~~~
ChuckMcM
Or that the merits of a suggestion were inversely proportional to an employee
number, but cultures such as those frequently emerge in engineering driven
organizations with a lot of smart people. I speculate that it is an offshoot
off the imposter syndrome effect, when someone is feeling like an imposter and
they are being asked to make a big decision, they search for externally
generated metrics that would increase their confidence in the decision.

Degrees, low employee numbers, industry awards, are all signals that "someone
else" thought this person was "good". And not even super geniuses understand
everything about everything and they still have to make decisions.

Sad or not, in my experience working around gifted engineering teams it is not
uncommon.

~~~
pm90
> Sad or not, in my experience working around gifted engineering teams it is
> not uncommon.

OK, so I had hoped that this was a problem unique to Google but I guess its
not.

I just don't get this kind of thinking. I understand that a PhD is a good
marker for intelligence... but not for ingenuity (maybe a tiny bit). I've
worked with both PhD's and non PhD's... while PhD's are amazing to talk
through different ideas, what may or may not work, the people I was most
productive with have universally been people who get _excited_ about
technology (not saying they're mutually exclusive). Mind, not excited about
_fads_ , but those who can't wait to code a prototype to see if it really
works as expected. Those who will jump in the middle of an outage and are
excited to learn about what might have gone wrong.

Those are the attributes that I now look for whenever I have to make a choice
between teams, between companies, whatever. The people that get excited by new
designs and architectures, who love to talk through them, are the best people
to work with.

I guess I do understand why Google is that way... I just hope that this kind
of thinking doesn't infect other workplaces.

~~~
vlovich123
Google is a big company and maybe this was true but it feels false (having
worked there & other large tech companies) that the reason for failing to get
that project going was the lack of a PhD. Now having top engineers on board
probably is key, but again that isn't tied to any advanced degree but due to
Google hiring & promo committees which are intended to reward merit.

The value of this is multi-faceted. It shows technical validation by trusted
engineering talent. Those engineers are able to expend their own political
capital on the project. Those engineers have demonstrated an ability to
navigate corporate hierarchy and deliver large-scale projects. One of the key
requirements of higher levels of promotion is that you are focusing on
strategic goals for the companies rather than strictly complex/important
technical products.

Having interesting & exciting conversations is nothing. Everyone is smart &
happy to nerd out about complex and interesting abstract concepts/designs.
Doesn't mean that they think those ideas are worth pursuing or that they're
worth pursuing right now.

------
zamadatix
8 dual port 100G NICs (available for a while)or 2 dual port 400 gigabit NICs
(not yet available) would out-bandwidth the memory controller on an Epyc 7742;
how are NICs such a bottleneck that they need to be increased 100x to keep up
when DDR5 only doubles the bandwidth?

I forget what the bandwidth of CPU cache is but I'm guessing it's not 10
terabit/second either.

~~~
mrb
« _how are NICs such a bottleneck_ »

Simple: when the packet data does _not_ need to be processed by the CPU. For
example a router forwarding network packets at 10 Tbit/s. The data can stay in
the NIC cache as it is being forwarded. No PCIe/CPU/RAM bottleneck here.

Also, EPYC Rome has 1.64 Tbps of RAM bandwidth today (eight DDR4-3200
channels). 10 Tbps is less than three doublings away. It's conceivable server
CPUs can reach this bandwidth in 4-6 years.

~~~
zamadatix
32 port 400G (12.8 terabit/s) 1u routers already exist today, the conversation
is about the NIC at the server <-> network boundary not switching ASICs. The
only reason you don't see 400G NICs in servers is the lack of the servers
ability to put that much bandwidth over PCIe (the real bottleneck location).

------
gumby
I’ve long had the fantasy that machines that communicate a lot amongst
themselves could develop an optimized “argot” just as humans do. For example
we’re on the local lan and could dispense with the fragmentation
infrastructure. Or there are only six of us so could we just use three-bit
“nicknames” instead of 48 bits of MAC. Etc.

Perhaps some nice adaptive machine learning could drive this.

------
virtuallynathan
That’s well and good, but the limitation here is PCIe or other Bus bandwidth,
and memory bandwidth. PCIe 5.0 gets us to 1Tbps per x16 slot. The top-of-the-
line CPUs Max our at 6.4Tbps of memory bandwidth (Power10 when it is
released).

This would require a large ASIC or FPGA to be useful, loaded with high-flocked
HBM2.

~~~
AdrianB1
They are looking at the entire system and they are willing to challenge
(improve or replace) everything, including buses like PCIe. This is also an
effort not meant for regular servers, but very high end. We already have 128
PCIe lanes or more in a single server, that means 8 Tbps; not in a single
connection or slot, but aggregated.

Also dual socket CPU's are very popular and more than that is still
accessible; multiply the memory bandwidth by the number of sockets. Think
about total throughput, not per bus, per socket, per NIC, etc.

~~~
wmf
_They are looking at the entire system and they are willing to challenge
(improve or replace) everything_

That's also what I thought; just put the NIC inside the processor and connect
it to the internal fabric. (This still leaves plenty of software challenges.)
But then DARPA says "The hardware solutions must attach to servers via one or
more industry-standard interface points, such as I/O buses, multiprocessor
interconnection networks, and memory slots, to support the rapid transition of
FastNICs technology." Even if your "NIC" uses all 128 lanes of PCIe 5 that's
only 4+4 Tbps. If you get rid of serdes and use something like IFOP that's
~600 Gbps per port then you'd still need something like 16 of those links.

------
berbec
400Gbps switches exist today [1]. They cost the annual budget of the NFL, but
they do exist. 10Tbps "just" requires an increase of 25.

Now getting this in consumer-level hardware...

1: [https://www.router-switch.com/n9k-c9316d-gx.html](https://www.router-
switch.com/n9k-c9316d-gx.html)

~~~
noodlesUK
How much to they actually cost on a per unit basis? Roughly?

~~~
rhinoceraptor
I'm guessing since you need to get a quote, the answer is, "if you need to
ask..."

~~~
GhettoMaestro
If you need to ask... you are attempting to gather information to make a
rational decision. I really dislike that [gatekeeping] meme. People who rely
on "if you need to ask..." canards are most likely not in your interest.

~~~
ZWoz
That is not really about buyer, but seller. If seller hides price, that is
usually good indicator. They know that they don't compete on price, that
showing price frightens potential clients.

~~~
wmf
There are some markets (like 400G switches) where all the sellers say "call
for pricing" so that doesn't really give you any information.

------
nwmcsween
Before 100gbe or anything remotely close on consumer systems most (all?) in
kernel networking stacks are going to need to be scrapped.

By then you might as well redesign the OS to something less painful to use.

------
peter_d_sherman
Excerpt: "Enabling this significant performance gain will require a rework of
the entire network stack – from the application layer through the system
software layer, down to the hardware."

Let's start with TCP.

TCP is an abstraction layer over IP, creating the _abstraction_ of reliable,
ordered, guaranteed data delivery -- over an unreliable network, which does
not guarantee that any given packet will arrive, much less which order it will
arrive in, if it even does arrive.

That _abstraction_ is called a "Connection".

But connections come with a high maintenance overhead.

The network stack must check periodically if a connection is still active,
must periodically send out keep-alive packets on connections that are, must
allocate and deallocate memory for each connection's buffer, must order
packets when they come in in each connection's buffer, and must do I/O to
whatever subsystem(s) communicate with the network stack, etc.

Speeding up that infrastructure would mean rethinking all of that... Here are
some of the most fundamental questions to that thought process:

1) What new set of criteria will constitute a Connection -- on IP's packet-
based, connectionless nature?

2) Who will be permitted to connect?

3) How will you authenticate #2?

4) Where (outside NIC, inside NIC, computer network stack, ?) will you perform
the algorithmic tasks necessary for #2, #3?

5) What are you willing to compromise for faster speed? E.g, you could use raw
datagrams, but not only are they not guaranteed to arrive, but their source
can be spoofed... how do you know that a datagram is from the IP address it
claims to be without further verification, without the Connection (and further
verification of the Connection level, like SSL/Certificates,etc.)?

In other words, rethinking TCP/IP brings with it no shortage of potential
problems and security concerns...

It might be faster to simply make the NIC cards faster, as the article talks
about...

Or have the Clients or Server software be more selective about what data they
send or receive... or what they accept as Connections, from whom, and why...

That is, maybe it's not a speed problem... maybe it's a selectivity problem...

Still, I'm all for faster hardware if DARPA can realize it. :-)

~~~
chongli
It sounds like they don't go far enough. Perhaps the way to go is to redesign
the CPU as well. CPUs could consume raw datagrams directly off the wire. To
authenticate them we could use HMAC [1], which would presumably be built right
into the CPU cache.

[1] [https://en.wikipedia.org/wiki/HMAC](https://en.wikipedia.org/wiki/HMAC)

------
diffserv
I am genuinely curious to see what types of "compute-intensive" applications
fit the bill here. Outside of storage workloads (syncing data, etc.), why
would you need a 100x improvement in data transfer rates between the machines?
(We have TPUs and their specialized network architecture for ML-like workloads
...)

Physical distance between the machines in a DC prevents RAM style "shared-
memory" architectures, at least ones that aim to have 30~60ns access times
(10-20 meters). Unless there are new paradigms for computation in a
distributed setting, I don't see the benefit for this ...

Also, what are the fundamental limitations/research problems of todays
hardware that prevent us from building a 400G NIC? I cannot think of anything
outside of PCI-e bus getting saturated. We already have 400G ports on switches
...

~~~
scottlocklin
>I am genuinely curious to see what types of "compute-intensive" applications
fit the bill here.

Well, for example, physical simulations using finite element or boundary value
approaches. Pretty much anything you'd use with MPI or do on a supercomputer
is going to run better on a machine with a nice network stack like this.

Even large scale storage (think backtesting on petabytes of options data) that
uses a map-reduce paradigm and is properly sharded for the data access paths
and aggregates would benefit from something like this.

~~~
diffserv
Do you have numbers or papers that support your argument? That these
applications are bottlenecked by the network?

There is a 2015 paper [1] that argues that improving network performance isn't
gonna help MapReduce/data analytics type of jobs much:

" .. none of the workloads we studied could improve by a median of more than
2% as a result of optimizing network performance. We did not use especially
high bandwidth machines in getting this result: the m2.4xlarge instances we
used have a 1Gbps network link."

Granted things might have changed by now, but I am curious to see how and by
how much?

[1]:
[https://www.usenix.org/system/files/conference/nsdi15/nsdi15...](https://www.usenix.org/system/files/conference/nsdi15/nsdi15-paper-
ousterhout.pdf)

~~~
scottlocklin
The 2015 paper is obviously wrong or selling something; it's virtually always
IO bound. Yes, I know many people assert otherwise; they're wrong.

Some map reduce loads, especially the kind that people running spark clusters
want to do, end up moving a lot of data around. Either because the end user
isn't thinking about what they're doing (95% of the time they're some DS dweeb
who doesn't know how computers work), or because they need to solve a problem
they didn't think of when they laid their data down.

I guess I cite myself, having done this sort of thing any number of times, and
helped write a shardable columnar database engine which deals with such
problems. If you don't want to cite me; go ask Art Whitney, Stevan Apter or
Dennis Shasha, whose ideas I shamelessly steal. FWIIW around that timeframe I
beat a 84 thread spark cluster grinding on parquet files with 1 thread in J
(by a factor of approximately 10,000 -the spark job ran for days and never
completed), basically because I understand that, no matter how many papers get
written, data science problems are still IO bound.

~~~
gnufx
There are references somewhere under [http://nowlab.cse.ohio-
state.edu/](http://nowlab.cse.ohio-state.edu/) for instance.

------
chrisweekly
Slight tangent, but FYI for an accessible, authoritative, useful "book" on
networking from the perspective of a modern web/app developer, I highly
recommend "High-Performance Browser Networking".
[https://hpbn.co/](https://hpbn.co/)

~~~
chrisweekly
OP is about reinventing the network stack. My comment was a pointer to a great
reference about the current network stack. Someone decided that was downvote-
worthy (why?)... so its visibility is reduced to 0. I don't care about my
karma points per se (may well lose more for this one), but I feel compelled to
complain about having been muted/censored after posting a concise, relevant
comment. :/

------
rsmets
Napatech also has some state of the art programable NIC product offerings as
well.

[https://www.napatech.com/products/napatech-
smartnics/](https://www.napatech.com/products/napatech-smartnics/)

I know a number of cloud providers are customers of theirs. Nice to see
practical use cases for FPGAs.

~~~
shaklee3
Please don't post advertisements disguised as a real post. If you work there,
it should be disclosed.

