I don't disagree with this, but see the challenge not so much as the transmission speed between nodes, rather I see it as the 'semantics' for expressing addressing and placement. (and yes, I am a big fan of RDMA :-))
One of the things I helped invent/design back when I was at NetApp was some network attached memory or NAM. NetApp was among a number of people who realized that as memory sizes got larger, having them associated with a single node made less and less sense from a computational perspective.
One can imagine a "node" which lives in a 64 bit address space, has say 16GB of "L3" cache shared among anywhere from 12 to 12,000 instruction execution or 'compute' elements. More general purpose compute would look more like a processor of today, more specialized compute looks like a tensor element or a GPU element.
"RAM" or generally accessible memory would reside in its own unit and could be 'volatile' or 'involatile' (backed by some form of storage). With attributes like access time, volatility, redundancy, etc.
That sort of architecture starts to look more like a super computer (albeit a non-uniform one) than your typical computer. With nodes "booting" by going through a self-test and then mapping their memory from the fabric and restarting from where they left off.
I thought I might get a chance to build that system at Google but realized that for the first time in my career that I was aware of I was prevented by my lack of a PhD. And what was worse, given the politics at the time, no one with a PhD would sign on as a 'figurehead' (I reached out to a couple and we had some really great conversations around this) because of the way in which Google was evaluating performance at the time. (think grad school ++ but where only one author seems to get all the credit).
Now that I've had some free time I've built some of it in simulation to explore the trade-offs and the ability to predict performance based on the various channels. Slow going though.
I'm saying this because every time I hear that a customer has a NetApp I get giddy as a school girl, because their filer products have a literal turbo button that I get to press.
Thanks to NetApp I've had some career highlights of boosting storage performance of entire enterprises by at least a factor of two, and occasionally as high as tenfold or more.
I'm not saying that I'm some sort of storage performance tuning wizard! There is literally a single parameter that can be changed to at least double the performance of every NetApp filer that has been shipped in about two decades: Just increase the TCP Window size from the original 1990s era default of 17KB to something sane like 256KB and then everything gets magically faster.
This is why it's so bizarre hearing you talking about performance and working at NetApp. It's like hearing that someone was on the Trabant F1 team.
For reference, 64KB was the default in Windows 2000-XP, and is dynamically raised up to 2MB in Vista and later, and up to 16MB on recent Windows Server versions.
I didn't say anything like what you're describing. Also not a fan of your unnecessary cheap shot.
theryby gaining 30% more speed?
Some article from 2013:
Some company from now:
Degrees, low employee numbers, industry awards, are all signals that "someone else" thought this person was "good". And not even super geniuses understand everything about everything and they still have to make decisions.
Sad or not, in my experience working around gifted engineering teams it is not uncommon.
OK, so I had hoped that this was a problem unique to Google but I guess its not.
I just don't get this kind of thinking. I understand that a PhD is a good marker for intelligence... but not for ingenuity (maybe a tiny bit). I've worked with both PhD's and non PhD's... while PhD's are amazing to talk through different ideas, what may or may not work, the people I was most productive with have universally been people who get excited about technology (not saying they're mutually exclusive). Mind, not excited about fads, but those who can't wait to code a prototype to see if it really works as expected. Those who will jump in the middle of an outage and are excited to learn about what might have gone wrong.
Those are the attributes that I now look for whenever I have to make a choice between teams, between companies, whatever. The people that get excited by new designs and architectures, who love to talk through them, are the best people to work with.
I guess I do understand why Google is that way... I just hope that this kind of thinking doesn't infect other workplaces.
The value of this is multi-faceted. It shows technical validation by trusted engineering talent. Those engineers are able to expend their own political capital on the project. Those engineers have demonstrated an ability to navigate corporate hierarchy and deliver large-scale projects. One of the key requirements of higher levels of promotion is that you are focusing on strategic goals for the companies rather than strictly complex/important technical products.
Having interesting & exciting conversations is nothing. Everyone is smart & happy to nerd out about complex and interesting abstract concepts/designs. Doesn't mean that they think those ideas are worth pursuing or that they're worth pursuing right now.
On other side engineer guys good to finish product, learn many half-assd tools and tacking different set of problems every day. This builds good intuition. But most of time they lack concept and they berate academics (at least in their early career days)
Now this is not universally applicable to everyone and may be wrong. I have seen some good people who rock both worlds.
I feel like we do not yet have a calculus for analyzing the mix of "entanglement" of data across transactional expressions. It is a problem I've puzzled over since about 2001 when Steve Kleiman asked me to scale file systems without speeding up the processor.
I forget what the bandwidth of CPU cache is but I'm guessing it's not 10 terabit/second either.
What "server/network boundary" is in this case might not be the classical boundary though so maybe they also mean the same thing I'm saying just from a different perspective.
Simple: when the packet data does not need to be processed by the CPU. For example a router forwarding network packets at 10 Tbit/s. The data can stay in the NIC cache as it is being forwarded. No PCIe/CPU/RAM bottleneck here.
Also, EPYC Rome has 1.64 Tbps of RAM bandwidth today (eight DDR4-3200 channels). 10 Tbps is less than three doublings away. It's conceivable server CPUs can reach this bandwidth in 4-6 years.
L2/L3 have bandwidths around 1-1.5 TB/s these days. Which pretty much is 10 TBit/s ;)
Also if we're looking a bit forward, Intel recently demoed, with Ayar Labs, a 2.5D chip with a photonic chiplet that can do optical I/O at 1tbps/mm2.
Perhaps some nice adaptive machine learning could drive this.
This would require a large ASIC or FPGA to be useful, loaded with high-flocked HBM2.
Also dual socket CPU's are very popular and more than that is still accessible; multiply the memory bandwidth by the number of sockets. Think about total throughput, not per bus, per socket, per NIC, etc.
That's also what I thought; just put the NIC inside the processor and connect it to the internal fabric. (This still leaves plenty of software challenges.) But then DARPA says "The hardware solutions must attach to servers via one or more industry-standard interface points, such as I/O buses, multiprocessor interconnection networks, and memory slots, to support the rapid transition of FastNICs technology." Even if your "NIC" uses all 128 lanes of PCIe 5 that's only 4+4 Tbps. If you get rid of serdes and use something like IFOP that's ~600 Gbps per port then you'd still need something like 16 of those links.
TBH, I’m kind of surprised x32 didn’t happen in the time between PCIe 3.0 and 4.0 (or maybe it did and I just didn’t hear about it), as there are now “plenty” of enterprise-class chips that have sufficient lanes to make it feasible to saturate such a pipe, although I’m guessing custom silicon already makes sense at that level of specialization where you can do x32 if you want to without waiting on a formalized interface.
Now getting this in consumer-level hardware...
Looks like it's up to 1.2Tbps
By then you might as well redesign the OS to something less painful to use.
Let's start with TCP.
TCP is an abstraction layer over IP, creating the abstraction of reliable, ordered, guaranteed data delivery -- over an unreliable network, which does not guarantee that any given packet will arrive, much less which order it will arrive in, if it even does arrive.
That abstraction is called a "Connection".
But connections come with a high maintenance overhead.
The network stack must check periodically if a connection is still active, must periodically send out keep-alive packets on connections that are, must allocate and deallocate memory for each connection's buffer, must order packets when they come in in each connection's buffer, and must do I/O to whatever subsystem(s) communicate with the network stack, etc.
Speeding up that infrastructure would mean rethinking all of that... Here are some of the most fundamental questions to that thought process:
1) What new set of criteria will constitute a Connection -- on IP's packet-based, connectionless nature?
2) Who will be permitted to connect?
3) How will you authenticate #2?
4) Where (outside NIC, inside NIC, computer network stack, ?) will you perform the algorithmic tasks necessary for #2, #3?
5) What are you willing to compromise for faster speed? E.g, you could use raw datagrams, but not only are they not guaranteed to arrive, but their source can be spoofed... how do you know that a datagram is from the IP address it claims to be without further verification, without the Connection (and further verification of the Connection level, like SSL/Certificates,etc.)?
In other words, rethinking TCP/IP brings with it no shortage of potential problems and security concerns...
It might be faster to simply make the NIC cards faster, as the article talks about...
Or have the Clients or Server software be more selective about what data they send or receive... or what they accept as Connections, from whom, and why...
That is, maybe it's not a speed problem... maybe it's a selectivity problem...
Still, I'm all for faster hardware if DARPA can realize it. :-)
(unless you explictly turn it on)
Any solution that doesn't perform this in software is doomed to failure. Imagine if upgrading the supported TLS version required a firmware upgrade, or purchasing new hardware.
Physical distance between the machines in a DC prevents RAM style "shared-memory" architectures, at least ones that aim to have 30~60ns access times (10-20 meters). Unless there are new paradigms for computation in a distributed setting, I don't see the benefit for this ...
Also, what are the fundamental limitations/research problems of todays hardware that prevent us from building a 400G NIC? I cannot think of anything outside of PCI-e bus getting saturated. We already have 400G ports on switches ...
Well, for example, physical simulations using finite element or boundary value approaches. Pretty much anything you'd use with MPI or do on a supercomputer is going to run better on a machine with a nice network stack like this.
Even large scale storage (think backtesting on petabytes of options data) that uses a map-reduce paradigm and is properly sharded for the data access paths and aggregates would benefit from something like this.
There is a 2015 paper  that argues that improving network performance isn't gonna help MapReduce/data analytics type of jobs much:
" .. none of the workloads we studied could improve
by a median of more than 2% as a result of optimizing
network performance. We did not use especially high
bandwidth machines in getting this result: the m2.4xlarge
instances we used have a 1Gbps network link."
Granted things might have changed by now, but I am curious to see how and by how much?
Some map reduce loads, especially the kind that people running spark clusters want to do, end up moving a lot of data around. Either because the end user isn't thinking about what they're doing (95% of the time they're some DS dweeb who doesn't know how computers work), or because they need to solve a problem they didn't think of when they laid their data down.
I guess I cite myself, having done this sort of thing any number of times, and helped write a shardable columnar database engine which deals with such problems. If you don't want to cite me; go ask Art Whitney, Stevan Apter or Dennis Shasha, whose ideas I shamelessly steal. FWIIW around that timeframe I beat a 84 thread spark cluster grinding on parquet files with 1 thread in J (by a factor of approximately 10,000 -the spark job ran for days and never completed), basically because I understand that, no matter how many papers get written, data science problems are still IO bound.
I know a number of cloud providers are customers of theirs. Nice to see practical use cases for FPGAs.