
Datacenter RPCs can be general and fast - feross
https://blog.acolyer.org/2019/03/18/datacenter-rpcs-can-be-general-and-fast/
======
baybal2
Been working on something similar as a sub-sub-sub contractor for Alibaba's
DCs.

The physical latency is nowhere near as scary as latency of a shitty NIC and
software... 20 metre roundtrip is only 200 nanoseconds, or about the same as
pulling something from DRAM.

~~~
CoolGuySteve
The Linux network stack needs to get its shit in order. Kernel bypass
libraries like EVFI transmit UDP in 250ns whereas the kernel typically takes
several usec.

Transitioning into kernel space shouldn’t take that long so a lot of needless
instructions must be getting executed.

~~~
ignoramous
For any meaningful elastic deployment at scale, Linux's (or rather any OS)
TCP/IP stack will have to be bypassed through user-space programs: It has all
of the upsides and few of the downsides.

Ref: [https://opennetworking.org](https://opennetworking.org)

Not just the datacenters, but even 5G deployments, which would be exclusively
IPv6 based, would sport user-space processing of packets and flows.

Ref: [https://opencord.org](https://opencord.org)

~~~
CoolGuySteve
Exactly my point, the kernel shouldn’t have to be bypassed. It should just be
fast.

~~~
convolvatron
there are a couple pretty core assumptions that make that hard:

read() interfaces expect the kernel to put the data where you asked. its hard
but not impossible to coordinate with the NIC to demultiplex the packet and
get it into the right place without a copy. almost all of the time you just
want to see the data, you dont care that it lands in this particular spot.

getting control flow in and out of the kernel is expensive. target address
need to be checked. registers often need to be saved.

things like the kernel firewall take a certain amount of work to determine the
packet status

there are lots of mitigations, but at some point you run into a conflict
between general purpose multi-process functions and performance.

if you ran your application on a dedicated arm next to the nic I bet you'd see
a difference. maybe that shouldn't be so hard.

~~~
CoolGuySteve
That still doesn't add up though. System call overhead is something like 500ns
to 1usec depending on the call. Copying a large 1500 byte UDP MTU from ram is
pessimistically along the lines of 2.5 usec (1500 bytes / 64 byte cacheline *
100ns L3 access).

So a simple UDP recv() + copy from network card buffer to userspace should be
about 3-4usec. In the real world, it's more like 6 or 7usec the last time I
timed it.

Admittedly, the unix API could use a refresh. A lot of unix calls like
getaddrinfo are perfectly happy to allocate and give you a buffer. A
"fast_recv" could easily do the same and just hand you a page from a kernel
page pool directly that you then have to free.

By avoiding the buffer copying, you could get something in the same ballpark
as kernel bypass but without all the bespoke code and frameworks in every
application. And in the end, that's kind of the point of having an "Operating
System" in the first place.

~~~
ignoramous
> So a simple UDP recv() + copy from network card buffer to userspace should
> be about 3-4usec. In the real world, it's more like 6 or 7usec the last time
> I timed it.

Not sure where the Linux kernel lags (may be at scale it kind of tails out?)--
if you read Google's paper on their network load balancer, Maglev, where
section 5.2.1 specifically calls out 30% performance degradation without
bypass on smaller packet sizes.

> By avoiding the buffer copying, you could get something in the same ballpark
> as kernel bypass but without all the bespoke code and frameworks in every
> application.

May be this article from Cloudflare helps paint a proper picture of why
certain kind of applications may need to keep bypassing the kernel forever
because their requirements are very specific and not because the kernel can't
be made to go faster (ex: a virtual router/switch, or a load-balancer):
[https://blog.cloudflare.com/why-we-use-the-linux-kernels-
tcp...](https://blog.cloudflare.com/why-we-use-the-linux-kernels-tcp-stack/)

------
zrb
But where is the RPC? This just seems like messaging

EDIT: Found this... [https://github.com/erpc-io/eRPC](https://github.com/erpc-
io/eRPC)

------
peterwwillis
Am I missing something, or is this post effectively saying all you need is a
perfect network and suddenly ethernet is fast?

~~~
shereadsthenews
The more interesting thing in this paper is the gulf between state of the art
and mainstream practice. Most of you people are happy getting ~100K QPS in a
32-core box and this guy is getting 10M QPS per core.

~~~
diffserv
That is hardly the state of the art: they are basically sacrificing all
abstractions that are rightly so required in the name of speed. This is no
different than using vanilla DPDK with no congestion and flow control and
being able to process 20 mil packets per core (better than the numbers in the
paper). Getting 75Gbps per core using RDMA is hardly hard or new.

You can't use this abstraction for anything ... really unless you are on a
completely lossless fabric that has enough capacity to avoid congestion.

Yes, Google/MS/FB are going towards such fabrics but the hard part is not
building the RPC abstraction suggested in this paper---the hard part is
getting to that fabric.

Just to put things into perspective, if you had a quantum computer you could
do all sort of crazy stuff with it. You could "parallelize loops! and be super
fast" is what this paper is suggesting (I specifically chose parallelizing
loops cause there is nothing new in it).

~~~
anujkaliaitd
I'm the main author of eRPC, and I wanted to clarify some things.

Your comment suggests (please correct me if you meant differently) that (a)
eRPC does not perform congestion control, and (b) eRPC requires a lossless
fabric. In fact, eRPC implements congestion control, and it works well in a
lossy network. Those are the two main contributions of the paper.

We get 75 Gbps with only UDP/Ethernet packet I/O, without RDMA support.

eRPC implements transport-layer functionality atop a fast packet I/O engine
like DPDK, so comparing eRPC to DPDK isn't apples-to-apples.

~~~
diffserv
Hey, Anuj.

Your repo is really nice for an academic paper. Thank you for that. It's rare
to see a "networked system's" repositories that has readable code. I mainly
checked large-tput example:

A few questions—

1) For your 75Gbps, what percentage of the payload of the RPC do you touch?
I.e., what portion of the message is used on that core?

More directly, say you have a service that can sustain 100kQPS, if they switch
to eRPC, what can they expect? Asked differently, what is the base overhead of
today's RPC libraries? Especially ones that bypass kernel.

2) The congestion and flow control is debatable, and their efficacy is up for
debate. Especially in a DC setting. Can you claim that eRPC would work for any
types of the workload in a DC setting? How would it play out with other
connections? At the end of the day, if you are forced to play nice, you may
eventually add up branches in your code. Your fast path gets split depending
on the connection type, etc. Is that something that you think is preventable?

3) How do you distribute the load across different cores at 75Gbps? How does
the CPU ring, contention, etc. come into play? I.e., can you do useful work
with that 75Gbps? or should I just read it as a "wow" number? Asking a
different question, if I have a for loop that can do 10 billion loops per
second and by just adding a function that drops down to 10k loops per second,
why would I care about that 10 billion iterations?

4) You claim that it works well in a lossy network, yet your goodput drops to
18~2.5Gbps at 10^-4/10^-3 packet loss---I am still assuming the library is
still flooding the network at 75Gbps. How does this play out in scale?

All in all, I do appreciate your work. My issue is that academic people like
to make big claims, especially in an academic setting. People in the industry
are aware of fast-paths. Kernel networking stack uses fast-paths rigorously.
Sure it is heavy and it comes with a lot of bulk, but you can as easily cut it
down.

~~~
anujkaliaitd
Thank you for the questions.

1\. Our throughput benchmark is designed to measure data transfer bandwidth,
and the comparison point is RDMA writes. In the benchmark, eRPC at the server
internally re-assembles request UDP frames into a buffer that is handed to the
server in the request handler. The request handler does not re-touch this
buffer, similarly to RDMA writes.

We haven't compared against RPC libraries that use fast userspace TCP.
Userspace TCP is known to be a fair bit slower than RDMA, whereas eRPC aims
for performance RDMA-like performance.

2\. eRPC uses congestion control protocols (Timely or DCQCN) that have been
deployed at large scale. The assumption is that other applications are also
using some congestion control to keep switch queueing low, but we haven't
tested co-existence with TCP yet.

3\. 75 Gbps is achieved with one core, so there's no need to distribute load.
We could insert this data into an in-memory key-value store, or persist it to
NVM, and still get several tens of Gbps. The performance depends on
computation-communication ratio, and we have tons of communication-intensive
application

4\. Packet loss in real datacenters is rare, and we can make it rarer with BDP
flow control. Congestion control kicks in during packet loss, so we don't
flood the network. Our packet loss experiment uses one connection, which is
the worst case. An eRPC endpoint likely participates in many connections, most
of which are uncongested.

~~~
diffserv
Thanks for the answers:

1) Are you using or relying on DMA or SPDK to copy packet data? A single core,
to my understanding, (assuming 10 concurrent cache lines in flight and 70~90ns
of memory access time) doesn't have the bandwidth to copy that much data from
the NIC to memory (assuming the CPU is in the middle). If so, RDMA and the
copy methodology are not so different in how they operate.

I didn't look at the paper where you explained how you perform the copying.

Also, IMHO, RDMA itself is a pet project of a particular someone at somewhere
that is looking for promotions :) . I don't really know if it's a good
baseline. It could be more reasonable to look at the benchmarks of other RPC
libraries and compare against the feature set they are providing.

2) As far as I remember talking with random people from Microsoft, Google, and
Facebook, none of them use Timely or DCQCN in production. Microsoft may be
using RDMA for storage like workloads and relying heavily on isolating that
traffic but nothing outside that (?) . I could be wrong.

3) There definitely is a need to distribute the load unless you are assuming
that the single core can "process" the data. That may work for Key/value store
workloads but what percentage of the workloads in a DC have that
characteristic? You say there are tons of communication-intensive
applications, care to name a few? I can think of KV stores. Maybe machine
learning workload but the computational model is very different there and you
rely on tailored ASICs (?) What else? Big data workloads aren't bottlenecked
by the network BW.

4) I won't dive into the CC/BDP discussion cause it's very hard to judge
without actually deploying it. Sure, a lot of older works made a lot of claims
about how X and Y are better for Z and W, but once people tested them they
would fall flat for various reasons.

