Hacker News new | past | comments | ask | show | jobs | submit login
Reinventing the Network Stack for Compute-Intensive Applications (darpa.mil)
171 points by shaklee3 15 days ago | hide | past | web | favorite | 76 comments

From the article -- “The true bottleneck for processor throughput is the network interface used to connect a machine to an external network, such as an Ethernet, therefore severely limiting a processor’s data ingest capability,” said Dr. Jonathan Smith, a program manager in DARPA’s Information Innovation Office (I2O). “Today, network throughput on state-of-the-art technology is about 1014 bits per second (bps) and data is processed in aggregate at about 1014 bps. Current stacks deliver only about 1010 to 1011 bps application throughputs.”

I don't disagree with this, but see the challenge not so much as the transmission speed between nodes, rather I see it as the 'semantics' for expressing addressing and placement. (and yes, I am a big fan of RDMA :-))

One of the things I helped invent/design back when I was at NetApp was some network attached memory or NAM. NetApp was among a number of people who realized that as memory sizes got larger, having them associated with a single node made less and less sense from a computational perspective.

One can imagine a "node" which lives in a 64 bit address space, has say 16GB of "L3" cache shared among anywhere from 12 to 12,000 instruction execution or 'compute' elements. More general purpose compute would look more like a processor of today, more specialized compute looks like a tensor element or a GPU element.

"RAM" or generally accessible memory would reside in its own unit and could be 'volatile' or 'involatile' (backed by some form of storage). With attributes like access time, volatility, redundancy, etc.

That sort of architecture starts to look more like a super computer (albeit a non-uniform one) than your typical computer. With nodes "booting" by going through a self-test and then mapping their memory from the fabric and restarting from where they left off.

I thought I might get a chance to build that system at Google but realized that for the first time in my career that I was aware of I was prevented by my lack of a PhD. And what was worse, given the politics at the time, no one with a PhD would sign on as a 'figurehead' (I reached out to a couple and we had some really great conversations around this) because of the way in which Google was evaluating performance at the time. (think grad school ++ but where only one author seems to get all the credit).

Now that I've had some free time I've built some of it in simulation to explore the trade-offs and the ability to predict performance based on the various channels. Slow going though.

Wow, a NetApp employee who's actually heard of performance as a feature!

I'm saying this because every time I hear that a customer has a NetApp I get giddy as a school girl, because their filer products have a literal turbo button that I get to press.

Thanks to NetApp I've had some career highlights of boosting storage performance of entire enterprises by at least a factor of two, and occasionally as high as tenfold or more.

I'm not saying that I'm some sort of storage performance tuning wizard! There is literally a single parameter that can be changed to at least double the performance of every NetApp filer that has been shipped in about two decades: Just increase the TCP Window size from the original 1990s era default of 17KB to something sane like 256KB and then everything gets magically faster.

This is why it's so bizarre hearing you talking about performance and working at NetApp. It's like hearing that someone was on the Trabant F1 team.

The default for both CIFS and NFS is 64k since ONTAP 8.2.1, released in 2013.


That's still way too low. A window size of 64KB will be a bottleneck on any 10 Gbps Ethernet connection, except perhaps a point-to-point cable. Any real-world network with switches, routers, and firewalls will require on the order of 256-512KB window sizes for decent performance. You need megabytes for any link that leaves the building!

For reference, 64KB was the default in Windows 2000-XP, and is dynamically raised up to 2MB in Vista and later, and up to 16MB on recent Windows Server versions.

Perhaps to avoid support tickets for people with equipment that doesn't properly support window scaling. The 64k setting is as high as you can go and avoid that.

What planet are you from where there is still such as unique and special flower that talks to only NetApp and no other operating system made in the last twenty years?

The planet where cheap or old or buggy routers, VPN software, etc don't properly support tcp window scaling: https://en.wikipedia.org/wiki/TCP_window_scale_option

I didn't say anything like what you're describing. Also not a fan of your unnecessary cheap shot.

How would you get around speed-of-light delays? The speed of light is about 1 foot/nanosecond, so maintaining a 1-cycle access time @ 3 GHz gives you about 4 inches to play with. Current server designs have some headroom (100 cycles = 33 feet, while the actual physical distance is more like 1-2 feet), but as soon as you start going to another rack and want to maintain normal RAM access speeds you run up against physical limits pretty quickly.

I assume you mean the speed of light in copper, or silicon instead of a vacuum? If so, may i warp your assumptions by the suggestion of using silicon photonics, and

[1] https://en.wikipedia.org/wiki/Photonic-crystal_fiber

theryby gaining 30% more speed?

Some article from 2013:

[2] https://www.extremetech.com/extreme/161687-darpa-creates-hol...

Some company from now:

[3] https://www.nktphotonics.com/lasers-fibers/product-category/...

It's very sad that lack of a PhD would cause such a decision to be made rather than that it would get debated on its merits.

Or that the merits of a suggestion were inversely proportional to an employee number, but cultures such as those frequently emerge in engineering driven organizations with a lot of smart people. I speculate that it is an offshoot off the imposter syndrome effect, when someone is feeling like an imposter and they are being asked to make a big decision, they search for externally generated metrics that would increase their confidence in the decision.

Degrees, low employee numbers, industry awards, are all signals that "someone else" thought this person was "good". And not even super geniuses understand everything about everything and they still have to make decisions.

Sad or not, in my experience working around gifted engineering teams it is not uncommon.

> Sad or not, in my experience working around gifted engineering teams it is not uncommon.

OK, so I had hoped that this was a problem unique to Google but I guess its not.

I just don't get this kind of thinking. I understand that a PhD is a good marker for intelligence... but not for ingenuity (maybe a tiny bit). I've worked with both PhD's and non PhD's... while PhD's are amazing to talk through different ideas, what may or may not work, the people I was most productive with have universally been people who get excited about technology (not saying they're mutually exclusive). Mind, not excited about fads, but those who can't wait to code a prototype to see if it really works as expected. Those who will jump in the middle of an outage and are excited to learn about what might have gone wrong.

Those are the attributes that I now look for whenever I have to make a choice between teams, between companies, whatever. The people that get excited by new designs and architectures, who love to talk through them, are the best people to work with.

I guess I do understand why Google is that way... I just hope that this kind of thinking doesn't infect other workplaces.

Google is a big company and maybe this was true but it feels false (having worked there & other large tech companies) that the reason for failing to get that project going was the lack of a PhD. Now having top engineers on board probably is key, but again that isn't tied to any advanced degree but due to Google hiring & promo committees which are intended to reward merit.

The value of this is multi-faceted. It shows technical validation by trusted engineering talent. Those engineers are able to expend their own political capital on the project. Those engineers have demonstrated an ability to navigate corporate hierarchy and deliver large-scale projects. One of the key requirements of higher levels of promotion is that you are focusing on strategic goals for the companies rather than strictly complex/important technical products.

Having interesting & exciting conversations is nothing. Everyone is smart & happy to nerd out about complex and interesting abstract concepts/designs. Doesn't mean that they think those ideas are worth pursuing or that they're worth pursuing right now.

Same here. What I noticed PhD people are more good candidate for research but not coding/finishing product. They put all their career in research tackling completely different set of problems. Most of time they are not aware of industry tools and applications like VCS, coding-styles etc and lack attitude to know context and more focused on (re)inventing something.

On other side engineer guys good to finish product, learn many half-assd tools and tacking different set of problems every day. This builds good intuition. But most of time they lack concept and they berate academics (at least in their early career days)

Now this is not universally applicable to everyone and may be wrong. I have seen some good people who rock both worlds.

A PhD is generally seen as "proof" that you can take a vague concept and perform high quality research. Obviously a PhD doesn't necessitate this and research is a VERY different skill than class work, but a PhD is likely seen to be lower risk for an employer (where any research is generally considered high risk).

These types of multi-tier compute systems/MITM processors are now being implemented. Microsoft publicly discusses their efforts here:


That looks like a lot of fun. I wonder if they would let me do a 1 year residency with that team. HP Enterprise was doing something as well with "the machine" (very large memory systems) which generated some interesting papers.

I feel like we do not yet have a calculus for analyzing the mix of "entanglement" of data across transactional expressions. It is a problem I've puzzled over since about 2001 when Steve Kleiman asked me to scale file systems without speeding up the processor.

As I recall the solution then was "buy Spinnaker". :) . Join us at HPE storage, we'll do it right this time!

If you want to get a PhD there's definitely a lot of research going into networks. Especially at the big computing labs (read DOE). So like LLNL, ORNL, Sandia, Argonne. I don't work in this area, but I do work on/with HPC stuff so there's a lot of talk about all this stuff. It definitely seems to be a big part of the ECP movement/funding. Data transport and memory storage are two of the biggest problems I hear being talked about in ECP.

This is kind of what nvswitch does, albeit in a proprietary architecture. All 16 GPUs in an NVSwitch can see the memory space of all others, and any reads/writes are transparently performed on the correct GPU. This effectively gives a 512GB address space of HBM2 memory.

Better: this works between/towards POWER 9 sockets, including knowing to execute atomics right at the destination.

That sort of node exists as KNL, for numbers of compute elements up to ~200 with threading on. Unfortunately the on-chip OPA (RDMA) interface didn't materialize as far as I know, and KNL is dead (unless you're Down Under Geosolutions?).

8 dual port 100G NICs (available for a while)or 2 dual port 400 gigabit NICs (not yet available) would out-bandwidth the memory controller on an Epyc 7742; how are NICs such a bottleneck that they need to be increased 100x to keep up when DDR5 only doubles the bandwidth?

I forget what the bandwidth of CPU cache is but I'm guessing it's not 10 terabit/second either.

HPC needs HBM or TCI memory, not DDR5. Systems using HBM[1] and TCI[2] can already push an aggregate bandwidth of 8 Tbps.

1. https://en.wikipedia.org/wiki/High_Bandwidth_Memory

2. https://www.hotchips.org/wp-content/uploads/hc_archives/hc26...

It's a good point that HBM has high aggregate bandwidth but I still don't think it makes sense to call it a 100x server/network boundary bottleneck when the already available NIC is faster than NVlink in the first place.

What "server/network boundary" is in this case might not be the classical boundary though so maybe they also mean the same thing I'm saying just from a different perspective.

«how are NICs such a bottleneck»

Simple: when the packet data does not need to be processed by the CPU. For example a router forwarding network packets at 10 Tbit/s. The data can stay in the NIC cache as it is being forwarded. No PCIe/CPU/RAM bottleneck here.

Also, EPYC Rome has 1.64 Tbps of RAM bandwidth today (eight DDR4-3200 channels). 10 Tbps is less than three doublings away. It's conceivable server CPUs can reach this bandwidth in 4-6 years.

32 port 400G (12.8 terabit/s) 1u routers already exist today, the conversation is about the NIC at the server <-> network boundary not switching ASICs. The only reason you don't see 400G NICs in servers is the lack of the servers ability to put that much bandwidth over PCIe (the real bottleneck location).

The ASR-9000 series can handle up to 3.2 Tbps of L3 traffic per linecard, but this is only achievable because of the dedicated routing ASICs.

You still need to account for NIC-to-NIC packet transfers for scenarios where pkt arrives on Phy NIC A and needs to egress via Phy NIC B. Obviously there are better options than just PCIe transport these days.

This is (sort of) a forget everything you know about ______ scenario. But I think the short answer is multi-die interconnects.



> I forget what the bandwidth of CPU cache is but I'm guessing it's not 10 terabit/second either.

L2/L3 have bandwidths around 1-1.5 TB/s these days. Which pretty much is 10 TBit/s ;)

Surely you don't want to stream through cache, though.

In today's processors all data goes through the cache. There isn't really any other alternative on the horizon.

The radeon VII already has 1TB/s memory bandwidth, using HBM2, with HBM2E offering almost double the bandwidth.

Also if we're looking a bit forward, Intel recently demoed, with Ayar Labs, a 2.5D chip with a photonic chiplet that can do optical I/O at 1tbps/mm2.

This is what RDMA solves. you are only limited by the amount of pcie switches you stack up in your topology, and not at all by the processor anymore. all of your data is either handled directly by the nic, or offloaded to an accelerator card. Modern systems can support about 1.6Tbps (100G NICs). When pcie 4 comes out, this should double.

You forget we need network performance to be wasted by badly designed software.

In principle you can use P2P-DMA to shunt the bulk of the data to a specialized device (e.g. GPU, FPGA, storage) without it ever touching main memory or the CPU.

I’ve long had the fantasy that machines that communicate a lot amongst themselves could develop an optimized “argot” just as humans do. For example we’re on the local lan and could dispense with the fragmentation infrastructure. Or there are only six of us so could we just use three-bit “nicknames” instead of 48 bits of MAC. Etc.

Perhaps some nice adaptive machine learning could drive this.

That’s well and good, but the limitation here is PCIe or other Bus bandwidth, and memory bandwidth. PCIe 5.0 gets us to 1Tbps per x16 slot. The top-of-the-line CPUs Max our at 6.4Tbps of memory bandwidth (Power10 when it is released).

This would require a large ASIC or FPGA to be useful, loaded with high-flocked HBM2.

They are looking at the entire system and they are willing to challenge (improve or replace) everything, including buses like PCIe. This is also an effort not meant for regular servers, but very high end. We already have 128 PCIe lanes or more in a single server, that means 8 Tbps; not in a single connection or slot, but aggregated.

Also dual socket CPU's are very popular and more than that is still accessible; multiply the memory bandwidth by the number of sockets. Think about total throughput, not per bus, per socket, per NIC, etc.

They are looking at the entire system and they are willing to challenge (improve or replace) everything

That's also what I thought; just put the NIC inside the processor and connect it to the internal fabric. (This still leaves plenty of software challenges.) But then DARPA says "The hardware solutions must attach to servers via one or more industry-standard interface points, such as I/O buses, multiprocessor interconnection networks, and memory slots, to support the rapid transition of FastNICs technology." Even if your "NIC" uses all 128 lanes of PCIe 5 that's only 4+4 Tbps. If you get rid of serdes and use something like IFOP that's ~600 Gbps per port then you'd still need something like 16 of those links.

You would also need to replace the operating system and all the software. The traditional operating system driver stack does way too much copying for any of this to work. Perhaps a better model would involve some way of time-sharing or multiplexing direct access to the hardware by user applications. I'm not an operating system expert so take my comments with a grain of salt.

The standards get ratified years (literally) before the first implementations ship out, but note that PCIe 6.0 is already slated to provide 4 TB/s in a x16 slot.

TBH, I’m kind of surprised x32 didn’t happen in the time between PCIe 3.0 and 4.0 (or maybe it did and I just didn’t hear about it), as there are now “plenty” of enterprise-class chips that have sufficient lanes to make it feasible to saturate such a pipe, although I’m guessing custom silicon already makes sense at that level of specialization where you can do x32 if you want to without waiting on a formalized interface.

No, it is just 128 GB/s in each direction for a x16 slot.

Well, yes. I guess it depends on the context but I suppose you’re right since when you’re sending or receiving data that fast, you likely are going point-to-point and not switching/routing it, meaning you’ll end up mostly doing more of one than the other.

I was saying you miscalculated the PCIe bandwidth.

I don’t know what I was thinking. I completely misread and didn’t sanity check the numbers.

400Gbps switches exist today [1]. They cost the annual budget of the NFL, but they do exist. 10Tbps "just" requires an increase of 25.

Now getting this in consumer-level hardware...

1: https://www.router-switch.com/n9k-c9316d-gx.html

In switching, not NICs (yet).

They are only available switch<->switch? There are no nics? What's infiband up to now?

Looks like it's up to 1.2Tbps https://en.m.wikipedia.org/wiki/InfiniBand

Nobody has made 400G NICs because a PCIe 4.0 x16 slot is only 256 Gbps. Infiniband is currently at 200 Gbps because 8x and 12x modes are no longer used.

Thanks for the info. I only had quickipedia reference.

400Gbps interface, or total capacity? I assume you mean per-interface. Yeah they are crazy expensive currently.

Yeah I think it has 6.4Tb switching capacity.

How much to they actually cost on a per unit basis? Roughly?

If you're going by retail sucker pricing, a 32x100G switch is ~$10K so 32x400G should be <$40K. Hyperscale pricing should be under $10K.

I'm guessing since you need to get a quote, the answer is, "if you need to ask..."

If you need to ask... you are attempting to gather information to make a rational decision. I really dislike that [gatekeeping] meme. People who rely on "if you need to ask..." canards are most likely not in your interest.

That is not really about buyer, but seller. If seller hides price, that is usually good indicator. They know that they don't compete on price, that showing price frightens potential clients.

There are some markets (like 400G switches) where all the sellers say "call for pricing" so that doesn't really give you any information.

Before 100gbe or anything remotely close on consumer systems most (all?) in kernel networking stacks are going to need to be scrapped.

By then you might as well redesign the OS to something less painful to use.

Excerpt: "Enabling this significant performance gain will require a rework of the entire network stack – from the application layer through the system software layer, down to the hardware."

Let's start with TCP.

TCP is an abstraction layer over IP, creating the abstraction of reliable, ordered, guaranteed data delivery -- over an unreliable network, which does not guarantee that any given packet will arrive, much less which order it will arrive in, if it even does arrive.

That abstraction is called a "Connection".

But connections come with a high maintenance overhead.

The network stack must check periodically if a connection is still active, must periodically send out keep-alive packets on connections that are, must allocate and deallocate memory for each connection's buffer, must order packets when they come in in each connection's buffer, and must do I/O to whatever subsystem(s) communicate with the network stack, etc.

Speeding up that infrastructure would mean rethinking all of that... Here are some of the most fundamental questions to that thought process:

1) What new set of criteria will constitute a Connection -- on IP's packet-based, connectionless nature?

2) Who will be permitted to connect?

3) How will you authenticate #2?

4) Where (outside NIC, inside NIC, computer network stack, ?) will you perform the algorithmic tasks necessary for #2, #3?

5) What are you willing to compromise for faster speed? E.g, you could use raw datagrams, but not only are they not guaranteed to arrive, but their source can be spoofed... how do you know that a datagram is from the IP address it claims to be without further verification, without the Connection (and further verification of the Connection level, like SSL/Certificates,etc.)?

In other words, rethinking TCP/IP brings with it no shortage of potential problems and security concerns...

It might be faster to simply make the NIC cards faster, as the article talks about...

Or have the Clients or Server software be more selective about what data they send or receive... or what they accept as Connections, from whom, and why...

That is, maybe it's not a speed problem... maybe it's a selectivity problem...

Still, I'm all for faster hardware if DARPA can realize it. :-)

It sounds like they don't go far enough. Perhaps the way to go is to redesign the CPU as well. CPUs could consume raw datagrams directly off the wire. To authenticate them we could use HMAC [1], which would presumably be built right into the CPU cache.

[1] https://en.wikipedia.org/wiki/HMAC

as a note TCP doesn't periodically check if a connection is still alive, or send out keep-alives

(unless you explictly turn it on)

You don't use TCP for communication in serious compute-intensive work -- you use RDMA technologies like Infiniband, with ~1μs latency.

>4) Where (outside NIC, inside NIC, computer network stack, ?) will you perform the algorithmic tasks necessary for #2, #3?

Any solution that doesn't perform this in software is doomed to failure. Imagine if upgrading the supported TLS version required a firmware upgrade, or purchasing new hardware.

I am genuinely curious to see what types of "compute-intensive" applications fit the bill here. Outside of storage workloads (syncing data, etc.), why would you need a 100x improvement in data transfer rates between the machines? (We have TPUs and their specialized network architecture for ML-like workloads ...)

Physical distance between the machines in a DC prevents RAM style "shared-memory" architectures, at least ones that aim to have 30~60ns access times (10-20 meters). Unless there are new paradigms for computation in a distributed setting, I don't see the benefit for this ...

Also, what are the fundamental limitations/research problems of todays hardware that prevent us from building a 400G NIC? I cannot think of anything outside of PCI-e bus getting saturated. We already have 400G ports on switches ...

>I am genuinely curious to see what types of "compute-intensive" applications fit the bill here.

Well, for example, physical simulations using finite element or boundary value approaches. Pretty much anything you'd use with MPI or do on a supercomputer is going to run better on a machine with a nice network stack like this.

Even large scale storage (think backtesting on petabytes of options data) that uses a map-reduce paradigm and is properly sharded for the data access paths and aggregates would benefit from something like this.

Do you have numbers or papers that support your argument? That these applications are bottlenecked by the network?

There is a 2015 paper [1] that argues that improving network performance isn't gonna help MapReduce/data analytics type of jobs much:

" .. none of the workloads we studied could improve by a median of more than 2% as a result of optimizing network performance. We did not use especially high bandwidth machines in getting this result: the m2.4xlarge instances we used have a 1Gbps network link."

Granted things might have changed by now, but I am curious to see how and by how much?

[1]: https://www.usenix.org/system/files/conference/nsdi15/nsdi15...

The 2015 paper is obviously wrong or selling something; it's virtually always IO bound. Yes, I know many people assert otherwise; they're wrong.

Some map reduce loads, especially the kind that people running spark clusters want to do, end up moving a lot of data around. Either because the end user isn't thinking about what they're doing (95% of the time they're some DS dweeb who doesn't know how computers work), or because they need to solve a problem they didn't think of when they laid their data down.

I guess I cite myself, having done this sort of thing any number of times, and helped write a shardable columnar database engine which deals with such problems. If you don't want to cite me; go ask Art Whitney, Stevan Apter or Dennis Shasha, whose ideas I shamelessly steal. FWIIW around that timeframe I beat a 84 thread spark cluster grinding on parquet files with 1 thread in J (by a factor of approximately 10,000 -the spark job ran for days and never completed), basically because I understand that, no matter how many papers get written, data science problems are still IO bound.

There are references somewhere under http://nowlab.cse.ohio-state.edu/ for instance.

I'm not sure about the typical demands on the fabric of FE-type codes, but a typical HPC cluster (university or national) is likely to spend much of its time on materials science work using DFT, which requires low latency rather than high bandwidth for short messages. Somewhere under http://archer.ac.uk there are usage statistics as an indication of the workload, though it varies month-to-month.

Slight tangent, but FYI for an accessible, authoritative, useful "book" on networking from the perspective of a modern web/app developer, I highly recommend "High-Performance Browser Networking". https://hpbn.co/

OP is about reinventing the network stack. My comment was a pointer to a great reference about the current network stack. Someone decided that was downvote-worthy (why?)... so its visibility is reduced to 0. I don't care about my karma points per se (may well lose more for this one), but I feel compelled to complain about having been muted/censored after posting a concise, relevant comment. :/

Napatech also has some state of the art programable NIC product offerings as well.


I know a number of cloud providers are customers of theirs. Nice to see practical use cases for FPGAs.

Please don't post advertisements disguised as a real post. If you work there, it should be disclosed.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact