
BSD socket API revamp - rumcajz
https://raw.githubusercontent.com/sustrik/dsock/master/rfc/sock-api-revamp-01.txt
======
cafxx
Am I missing something? It seems to propose to ditch all non-blocking APIs
under the assumption that "runtimes" will provide lightweight concurrency. But
AFAIK most (all?) runtimes that provide lightweight concurrency make extensive
use of the non-blocking network APIs to work - including Golang, that is even
cited as an example:

    
    
       During the decades since BSD sockets were first introduced the way
       they are used have changed significantly.  While in the beginning the
       user was supposed to fork a new process for each connection and do
       all the work using simple blocking calls nowadays they are expected
       to keep a pool of connections, check them via functions like poll()
       or kqueue() and dispatch any work to be done to one of the worker
       threads in a thread pool.  In other words, user is supposed to do
       both network and CPU scheduling. 
       [...]
       To address this problem, this memo assumes that there already exists
       an efficient concurrency implementation where forking a new
       lightweight process takes at most hundreds of nanoseconds and context
       switch takes tens of nanoseconds.  Note that there are already such
       concurrency systems deployed in the wild.  One well-known example are
       Golang's goroutines but there are others available as well.

~~~
tmyklebu
I'm not sure we have a sufficiently efficient concurrency mechanism that
doesn't simply give the user control over scheduling decisions.

Depending on how many tens we're talking about, those tens of nanoseconds for
a context switch might themselves blow your time budget even on ordinary
hardware. 10 gigabits per second means one byte is about 800ps long, so a
64-byte ethernet frame (plus the gunk before and after) is about 7 tens of
nanoseconds. To keep up with this traffic pattern, we need to be able to
process packets at least this quickly.

A goroutine context switch plus one unbuffered channel read and one unbuffered
channel write on my machine seems to take about 18 tens of nanoseconds on my
machine when Go's runtime is made to use the same thread for everything. (I'm
not sure how to better isolate context switch overhead in Go, but a context
switch, channel read, and channel write per packet does seem to be the
anticipated use case.)

~~~
rumcajz
The question to ask is: Can you do better?

Say you use a state machine instead of using coroutines. Will one state
transition of the machine be faster than one context switch?

180ns can definitely be improved on (with the C implementation in the project
it's more likely to be 60ns or so) there's no fundamental reason why one
approach should be faster than the other, given that both are doing basically
the same thing, albeit on the different layers of the stack.

~~~
tmyklebu
I'm not clear on why coroutines or an explicit state machine are necessary to
implement network protocols. We have been doing it for decades by writing
straightforward imperative C code and for years by writing C++ template
classes that are templated on the downstream type (CRTP). Both of these
approaches result in near-zero overhead (computers are good at fetching and
decoding consecutive instructions) or even negative apparent overhead (if the
compiler sees caller and callee, inlining and dead code elimination can
potentially eliminate lower-level protocol handling irrelevant to the higher-
level protocol).

I would guess that transforming coroutine-based code into state machines
hampers performance over the long term by scaring the programmer away from
wanting to touch it. However, I would also guess that the direct cost of a
coroutine context switch (save and restore all callee-save registers and stack
pointer then indirect-jump somewhere unpredictable) outweighs the direct cost
of a state machine transition (update some variable).

I do not want to pass judgment on the approach beyond saying that its
attendant overhead is too high for some workloads.

~~~
rumcajz
Ok, but doesn't "straight imperative code" rely on blocking API underneath it?
If it does not, the implementation has to store the current state in case it
cannot proceed immediately, then restore it later -- which is a context
switch.

As for state machine vs. coroutine performance, agreed. But the cost is
somewhere around 10-15ns per context switch. Which may be worth paying to get
a maintainable code.

~~~
tmyklebu
You can use the usual BSD socket API (in which case the context switches
to/from the kernel for calling read() puts the contemplated workload out of
reach) or any number of vendor-specific userland network stacks. There is no
need to context switch. (Function calls and function returns are not context
switches.)

------
shabbyrobe
Didn't notice this on my first read, but this is by Martin Sustrik. Sustrik is
a former ZeroMQ contributor and, more recently, author of the abandoned zmq
alternative nanomsg [1], as well as the libmill [2] and libdill [3]
lightweight concurrency libraries.

    
    
      [1] http://nanomsg.org/
      [2] http://libmill.org/
      [3] http://libdill.org/

~~~
jey
Is nanomsg truly abandoned? I was hoping it was simply converging toward
stability.

~~~
shabbyrobe
Hmm true, my language was a bit ambiguous. I meant it was abandoned by Sustrik
[1], even if some of the community are doing the best they can manage to keep
it alive.

    
    
      [1] http://www.freelists.org/post/nanomsg/New-fork-was-Re-Moving-forward

~~~
jey
That seems a bit outdated; looks like main nanomsg repo has regular commits
from gdamore now[1], so maybe the fork was resolved?

1\.
[https://github.com/nanomsg/nanomsg/commits/master](https://github.com/nanomsg/nanomsg/commits/master)

~~~
shabbyrobe
Yep it was eventually resolved midway through 2016, D'Amore stepped up, then
stepped down, then stepped up again. The link I cited is not a commentary on
the current status of nanomsg, it is more to illustrate Sustrik's role in the
situation. If, ultimately, this discussion is about Sustrik's IETF proposal,
it is important to consider his relationship with previous similar
(zmq/nanomsg) or related (libmill/libdill) projects which have borne his name.

------
Animats
Why is this being proposed to the IETF? It's not an issue visible at the wire
protocol level. It doesn't affect interoperability. Sockets belong to the
POSIX spec, and to some extent to language specs.

~~~
luchs
There are already other RFCs regarding the socket API, such as RFC3493 (IPv6
socket API).

~~~
dfox
There even are RFCs (RFC 2783 comes to mind) that specify OS-level APIs that
are only tangentialy related to networking.

------
comex
Very interesting idea overall. I've long thought the current segregation of
network protocols between userland and the kernel is rather arbitrary and
inflexible - why should reliability (TCP) be in the kernel but security (TLS)
be in userland?

I don't think the idea of requiring Go-style lightweight threading is viable.
Lightweight threading has many nice properties, but also inherently requires
more expensive operations to deal with split stacks, and tends to use more
memory than a manual approach. It's also rather poorly supported in general by
existing languages and OSes (other than Go), while any new OS-level socket API
should be as universally accessible as the existing one. In particular, _many_
scripting-ish languages either have no concept of threading at all or only
support isolated threads whose communication is limited to relatively slow
message passing. In theory the language implementations could write a C shim
to expose a non-blocking interface around blocking underlying operations -
this is what libuv does today for file I/O, for instance - but the result
would be a lot of unnecessary overhead.

~~~
tptacek
How would a design with TCP done entirely in userland work? A given socket
address might in one instant belong to uid 10, and in the next, after uid=10's
connection closes, to uid 20.

(I'm not challenging you, just asking what you think the semantics should look
like).

Worth adding for everyone else, though I'm sure you know: high speed network
implementations already pull a lot of this stuff into userland.

~~~
comex
There are a few different ways to answer that question.

First, I believe that for most applications, it's not that TCP should be in
userland; rather, TLS should be in the kernel, or otherwise system managed.
See the wall of text I just replied to someone else with.

Second, if you _do_ want to customize the transport layer, which is of course
fairly common today - multiplayer games often go for some custom "reliable
UDP" protocol that drops ordering guarantees to improve latency, and then
there's Chromium with QUIC, and uTorrent ‎µTP, and WebRTC... well, the reality
of networking today is that that has to be done on top of UDP rather than IP
directly. I doubt there's much point writing a new API for TCP itself that
exposes all the bells and whistles, because the design of TCP is dated.
Rather, there should be more standardized APIs for applications to more easily
use replacements for TCP that run over UDP. Ideally I should be able to switch
from TCP to QUIC just by changing a few lines of code.

(Edit: By the way, if we could redo networking protocols from scratch, I think
ports ought to exist at the IP layer rather than being replicated in TCP and
UDP, and then UDP could be abolished entirely. But we can't.)

Actually, that ties into high performance networking too. As you know, those
alternate stacks require userland to talk directly or near-directly to
hardware, meaning you just have to give up on different ports belonging to
different applications. But beyond that, one thing holding those stacks back
is that they require a lot of custom code and don't work with existing
applications. With a standardized networking API that supported userspace
plugins, it could be possible to add "DPDK+rumptcpip" as an option next to
"kernel TCP" and "QUIC", and configure any random process to use it.

~~~
zrm
> I doubt there's much point writing a new API for TCP itself that exposes all
> the bells and whistles, because the design of TCP is dated. Rather, there
> should be more standardized APIs for applications to more easily use
> replacements for TCP that run over UDP.

The advantage TCP has over UDP is that middleboxes know what TCP FIN is and so
are willing to use much longer timeouts for TCP sessions. For example the
default Linux connection tracking timeout for established TCP connections is
five days but for UDP streams it's three minutes.

So if you need a long-lived session to receive event-based messages your
choices are to use UDP with a mapped port using NAT-PMP or PCP (ideal but not
always available), use UDP with frequent keepalives (expensive), or use TCP.

Being able to do TCP in userspace would be very useful for any VPN-like thing
because you could get the long timeouts but still deliver packets immediately
even if an earlier one was lost, and avoid the TCP-over-TCP performance
degradation by deferring congestion control to the inner packets.

------
HeadlessChild
OT: How is text files like this (or like any RFC) written? Is there any
standard format and is it done in a specific way?

EDIT: Found it! There was actually a RFC for that.
[https://tools.ietf.org/html/rfc7322](https://tools.ietf.org/html/rfc7322)

~~~
hueving
So RFCs are self-hosting? ;)

------
eschaton
I'd like to see some comparison with the STREAMS layered/stackable API from
SVR4 which let a developer do just the sort of layering this API proposes.
What makes it better than STREAMS, does it share the same pitfalls, etc.?

~~~
yuhong
Also Plan 9.

------
astrange
There doesn't seem to be much of anything interesting here. I do think there's
a lot to consider in a new system design, but the fact (brought up at the
start) that you can't send any new protocol over the internet, and can't send
anything but HTTP through half of it, really limits you.

If I were designing an API, the RX side would be asynchronous and batch
multiple connections, and the TX side would let you assemble your own TCP
packets. This rules out ideas like sendfile() which IMO became nonstarters
when we moved to HTTPS.

And obviously the accidental complexity like fcntl vs. setsockopt,
bind/connect, shutdown/close, multithreading+signals, have to go somehow. It
doesn't really matter though. I can't imagine any major world crises are being
caused by BSD sockets.

------
johnsmith21006
A very long time ago and coming from VMS a big reason preferred BSD over Sys V
was non blocking sockets. Now that was over 20 years ago but now going away?

------
luchs
Something this API gets right is having a unified interface for both IPv4 and
IPv6. With the sockets API, you have to decide for one of them. Changing isn't
easy as the constants and structures are all named differently.

While it's possible to use IPv6 sockets for IPv4 connections, this doesn't
cover all use-cases. For example, you can't do IPv4 broadcast with an IPv6
socket. Additionally, as most examples are written for the classic IPv4 API,
that's what everyone uses per default. Later on, when people complain about
missing IPv6 support, they are turned down because it's a ton of work to
change.

~~~
dfox
For majority of applications, supporting IPv6 boils down to using
getaddrinfo()/getnameinfo() instead of gethostbyname()/gethostbyaddr(), which
results in code that supports both IPv4 and IPv6 and is simpler than the
IPv4-only original.

~~~
rwmj
The catch is you also have to be prepared to listen on multiple sockets. There
are lots of servers out there which get this wrong, including qemu-nbd.

For reference here's my implementation of this which (I think) gets it right:
[https://github.com/libguestfs/nbdkit/blob/master/src/sockets...](https://github.com/libguestfs/nbdkit/blob/master/src/sockets.c#L105)

------
bsder
The problem with proposals like this is that they paper over the fundamental
problem:

Time is an input variable and really should be part of the API.

The problem is that sometimes you want the end user to have control over time
and sometimes you might not want the end user to have control over time.

If I'm on Linux, I certainly should not be able to manipulate the "time" of
the communication stack in order to send faster/more aggressively than I
should be allowed.

If I'm on embedded, I _really_ want to be able to manipulate the "time" of the
communication stack depending upon what I am doing.

~~~
jstimpfle
Could you clarify? I don't get where you are heading with that.

At least the concept of a "stream" isn't really linked with time. It's just a
sequence of bytes.

~~~
bsder
The issue is that abstractions "leak".

Let's take your "stream", for instance.

It's a sequence of bytes.

But is it a "maximum bandwidth" stream of bytes (gigantic file transfer) or is
it a "minimum latency" (audio packet) stream of bytes?

If I know or can control what the "time" the stack is working toward, the
stack doesn't have to know the difference between those two.

In addition, when on embedded, I quite often want a stack API which is
"foo_init(), foo_deinit(), foo_queue_send(), foo_queue_action(), etc." which
all update internal state

And then "foo_make_incremental_progress(foo_state, foo_now, ...) where _ONLY_
that incremental progress function _actually_ carries out actions like reading
hardware, writing hardware, timing out, etc. Now, I can use whatever
concurrency construct I like without the stack getting in the way.

Now, that isn't necessarily the fastest performance as the stack needs to be
structured such that it only carries out one "action" at a time on each
incremental call. However, it's very flexible. It also has the wonderful
property of being _repeatable_ and _testable_. Something which TCP stacks are
notoriously resistant to.

------
jheriko
> To make the API less tedious to use, short protocol name, e.g. "ws", SHOULD
> be preferred to the long name, e.g. "websockets".

Well that's a classic dumb mistake.

Programming isn't about typing.

Why not save everyone the effort of having to learn pointless things or look
things up? There is no ambiguity with "websockets" where as "ws" describes
nothing without context.

~~~
rumcajz
Another example: use "tcp_connect" instead of
"transmission_control_protocol_connect".

~~~
jheriko
i'd make an exception for that, but it has nothing to do with the length of
the function name.

TCP is much more well known as an abbreviation than long hand (most people who
use it do not actually know what it stands for)

communication is about getting the intent across and not obscuring it after
all...

~~~
rumcajz
Yet one example: "dccp_send" or "datagram_congestion_control_protocol_send" ?

------
dbmikus
Specifying a deadline timestamp instead of a countdown could run into problems
if a given machine has an improperly configured clock.

I think the risk of any machine having an incorrect timestamp is greater than
a library implementer not recalculating the countdown correctly. Whenever
possible, I prefer to stay away from timestamps. They're a fickle thing.

~~~
slrz
Why would it matter? The deadline would be specified in terms of a monotonic
clock that's unaffected by wall time adjustments. So even if some crude NTP
implementation steps your clock by seconds (or a leap second is inserted), it
shouldn't matter for the deadline calculation.

~~~
Bino
I think most of my timeout/deadlines relative to the monotonic time. Hence
specifying the absolute monotonic time would just involve yet another syscall
(to query the time). Is there a really good argument to specify a absolute
monotonic time?

------
notaplumber
To clarify, none of the BSD projects have anything to do with this "revamp"
not-a-real-RFC.

~~~
rumcajz
"BSD sockets" is the name of the existing socket API.

------
Ono-Sendai
Sounds good. Please add a flushHint() call as well:
[http://www.forwardscattering.org/post/3](http://www.forwardscattering.org/post/3)

~~~
rumcajz
The interesting question here is: Once we have flush() function does that
means that protocols are allowed to delay data indefinitely if flush() is not
called?

~~~
Ono-Sendai
Yeah that is an interesting question. I think if the provider of the socket
interface could be sure that the client code using the interface would know
about flush(), then it wouldn't need to set timers or anything like that to
flush after X ms etc. However there wouldn't be any point, I think, to
buffering up more than e.g. a MTU worth of bytes. So it wouldn't be delaying
indefinitely.

------
dcow
There's a typo in section 3.4 paragraph 4 s/bystream/bytestream--unless it's
an attempt to be clever.

~~~
rumcajz
Fixed.

