
MegaPipe: A New Programming Interface for Scalable Network I/O - aespinoza
http://highscalability.com/blog/2013/6/19/paper-megapipe-a-new-programming-interface-for-scalable-netw.html
======
RyanZAG
This is definitely the correct solution to the problem with BSD sockets -
nobody actually treats network connections as files. All of that file overhead
and unix integration is entirely wasted. All of the advances that have been
made in message passing systems/buses have passed over BSD sockets because
everyone is of the opinion that BSD sockets are already perfect. Most of that
view comes from the very simple fact that everything uses BSD sockets - even
Windows.

Hopefully this research spurs on others to create new implementations, because
I bet that there are many improvements possible in the basic socket idea.

~~~
wittrock
The real problem is the entrenched legacy software that uses BSD sockets. I
don't want to even imagine the cost of rewriting all of that to use a
different networking paradigm. POSIX certainly isn't the best way to do
things, especially with the move to the cloud, but much of today's high-
performance software today does fine with sockets. There are absolutely some
hacks that are used to get around some inadequacies, but BSD sockets work, by
and large.

~~~
RyanZAG
Agreed there - maybe a possible option would be to flag a specific port as
being non-BSD sockets? So when the kernel reads the port, some short circuit
logic could trigger and dump out to a different socket implementation? That
would allow legacy to run fine, and then let apps trigger a special flag when
binding a socket to allow for direct access. The routing part of tcp/ip
happens before BSD sockets are hit, so this should do an end-run around the
BSD socket overhead.

Then you could have your system running as normal, but allowing your http
server special access to network i/o. Any kernel hackers around who can
comment?

~~~
zhemao
I'm not certain, but it seems from the article that they implemented MegaPipe
to run alongside the BSD API. You do have to keep the BSD sockets API
alongside the new implementation. Not only because there is a lot of existing
code that depends on it, but also because abstracting a socket as a file is
actually useful in many cases.

------
piscisaureus
The way they do mp_read() is not right IMO.

If you have to pass a target buffer every time you call mp_read, that ends up
using a lot of memory in a scenario where there are a lot of connections open
but traffic is sparse. Since data might arrive at any open connection, you'll
have to have at least one "pending" mp_read operation for every socket at all
times. This quickly adds up in terms of memory usage: if you were to read into
64k-sized buffers (that's somewhat arbirary - but using smaller buffers tends
to be very bad for througput) you’d need 625mb of memory for just these
buffers alone to handle 10.000 connections. In a C10M scenario
([http://c10m.robertgraham.com/p/blog-
page.html](http://c10m.robertgraham.com/p/blog-page.html)) the same would need
a staggering 625GB of memory. The good old select model lets you allocate
these read buffers "just in time" before you read data into it.

This was a major issue when implementing libuv (node.js) for windows. Windows
overlapped I/O has exactly the same problem (and so has that new RIO thing).

A better model would be to have a "buffer pool" that is kept reasonably full
by the user-mode application. The kernel could then take a buffer from the
pool as soon as data actually comes in from the network.

~~~
zzzcpan
You can probably keep buffers very small with megapipe, since syscalls are
batched together, so they can't impact throughput very much in case of many
concurrent connections. You still get a lot of memory taken out of kernel per
syscall, but spread across many connections, if I understood megapipe
correctly.

~~~
piscisaureus
Well sure you could do that. But I don't believe that reading say 50 bytes at
a time is very efficient even when you factor in automatic syscall batching.

But this is all speculation. I didn't any consideration for this problem in
the paper at all so I don't have the impression it was on the researchers'
radar.

------
bcoates
It sounds like they're trying to solve the same problems that I/O Completion
Ports and Registered I/O are, in many of the same ways. It would be nice to
see a head-to-head comparison for CPU overhead and scalability between the two
models.

------
cpeterso
How does MegaPipe compare to Van Jacobson's "netchannels" for Linux?

[https://lwn.net/Articles/192767/](https://lwn.net/Articles/192767/)

------
contingencies
This matters in so few applications, and they already have alternatives
(mostly message queue systems). Bulk data transfer is rarely operationally
time-sensitive. To justify bothering investing time to investigate this, maybe
we could see some benchmarks versus popular MQ solutions _and_ traditional
sockets for different messaging scenarios. Also factor in the various TCP
optimizations, etc. (Actually, something like this should exist anyway... it
would be a great resource.)

~~~
RyanZAG
As far as I can tell, the main usage for this MegaPipe system would be
creating a better API for webservers like apache and nginx to plug into to
bypass some of the overhead inside BSD sockets. Most popular MQ solutions also
use TCP/IP and hence BSD sockets, so they're not alternatives - they would
actually replace their BSD socket API calls with MegaPipe API calls and
receive performance benefits (in very large message/recipient situations).

I think either you have misunderstood the article, or I have.

~~~
lcampbell
For static file serving, Nginx can bypass BSD socket overhead with the
sendfile (or sendfile-like) syscall, though it is disabled by default. On
FreeBSD, the sendfile syscall accepts not only the fd to pipe, but can
additionally be passed a header/footer such that you don't need to call
write(2) at all, though this is not true on either Linux or Windows (AFAIK).

The whitepaper's macrobenchmarks make no mention of sendfile; I presume their
tests with Nginx run without sendfile enabled (since it would defeat the point
of the benchmark by nature of it skipping all of the write(2) calls). The
paper does not appear to detail the configuration used in their benchmarks. I
believe that the performance benefit of Nginx+MegaPipe, when compared to an
Nginx server configured to use sendfile, is much, much less than +75%
throughput.

That said, sendfile has very narrow uses (it's effectively only for sending
static files) -- I wish the paper benchmarked Nginx against Nginx+MegaPipe as
a reverse proxy rather than a static file server.

~~~
justincormack
Linux has splice now which can support more generic uses (and tee). You can
feed one network stream into another and add headers etc. They are not yet
widely used but some decent performance figures I believe.

But on the other hand user space networking is also performing.

------
riobard
How does it compare with ZeroMQ? Looks to me they have some overlaps for the
message batching part.

------
cobrabyte
I hate to say it but, with 'Mega' in the title, I figured this was another Kim
Dotcom offering. Not that it's a bad thing but that's instantly where my mind
went.

------
escaped_hn
how does this compare to windows overlapped io completion ports.

