
Accidentally nonblocking - adamnemecek
http://www.tedunangst.com/flak/post/accidentally-nonblocking
======
jacquesm
I don't know how many times I looked at the output of c-preprocessors and
compilers to figure out what the heck was going on. One choice example of this
was a pretty complex system that managed to call a top level routine from
somewhere deep down in the stack if an error occurred (which promptly led to a
stack overflow that would only very rarely trigger).

The 'nonblocking' here is just a symptom of a much larger problem: abstraction
is a great way to build complex stuff out of simple parts but it is also a
great way to introduce all kinds of effects that you weren't aiming for in the
first place and this particular one is easier than most to catch. You can find
the same kind of problems at all levels of software systems, all the way up to
the top where dosomethingcomplex() and if it fails dosomethingcomplex() again
is the cure.

Writing easy to understand code is a big key to solving this kind of problem,
I've always tried (but probably never succeeded) in writing code in the
simplest way possible, as soon as I find myself reaching for something clever
I feel it is a mistake. Either circumstance (some idiot requirement, such as
having to use the wrong tools for the job) or need may be used occasionally to
transgress the rule but if you do it with any regularity at all (and without
documenting the particular exception and the motivation to go outside of the
advised lines) you are almost certainly going to regret it. (Or your successor
may one day decide to pay you a house-call with a blunt object...)

~~~
catnaroek
> abstraction is a great way to build complex stuff out of simple parts but it
> is also a great way to introduce all kinds of effects that you weren't
> aiming for in the first place and this particular one is easier than most to
> catch.

This isn't a problem when abstractions don't leak. Polishing abstractions
until they don't leak is super hard, though.

~~~
ArkyBeagle
Ignoring all that - in order to abstract something, you have to either 1) make
assumptions or 2) establish a method for the configuration of those
assumptions.

We all (naively) want it just to be handled for us. But sometimes that doesn't
work out. We are the ones who have to learn that; the Second Law of
Thermodynamics ( which is the lynchpin of the Two Generals Problem ) is
unlikely to change to accommodate our foolishness :)

As I understand you, "polishing abstractions until they don't leak" is
equivalent to "doing the whole job, not just part of it." Economically, this
is a pain point for the people we work for. It sounds expensive. The
accounting for it is very difficult. "Can't you just make it _work_ " is not
unreasonable.

Narrow is the way.

~~~
catnaroek
> in order to abstract something, you have to either 1) make assumptions or 2)
> establish a method for the configuration of those assumptions.

Most importantly, you need to _enforce_ the assumptions. The lack of
enforcement is where abstraction leaks come from.

> the Second Law of Thermodynamics ( which is the lynchpin of the Two Generals
> Problem ) is unlikely to change to accommodate our foolishness :)

The second law of thermodynamics is fundamental to understanding how the
physical world works, but software is a purely logical artifact.

> As I understand you, "polishing abstractions until they don't leak" is
> equivalent to "doing the whole job, not just part of it."

It means redesigning the abstraction until there are no cases uncovered by the
abstraction's enforcement mechanisms.

> Economically, this is a pain point for the people we work for. It sounds
> expensive.

Make no mistake, it _is_ expensive. But dealing with abstraction leaks is even
more expensive.

> The accounting for it is very difficult. "Can't you just make it _work_ " is
> not unreasonable.

It doesn't work if it breaks.

~~~
ArkyBeagle
Enforcement is the entire point. A failed return from a recv() may be an
application problem. It doesn't compress.

> ..software is a purely logical artifact

No. No, sir , it is not. There is no magical unicorn version of communications
in which you can simply assume it all always gets there instantly and in
order. We can get _close_ \- severely underutilized Ethernet & 802.11 spoil us
- but nuh uh.

> Make no mistake, it is expensive.

And you wonder why they are like they are :) "you can't afford it, honey." :)

> It doesn't work if it breaks.

Indeed.

~~~
catnaroek
> No. No, sir , it is not. There is no magical unicorn version of
> communications in which you can simply assume it all always gets there
> instantly and in order. We can get close - severely underutilized Ethernet &
> 802.11 spoil us - but nuh uh.

That simply means you want an unimplementable abstraction. (Perfectly reliable
sequential communication over a computer network.) Of course it doesn't make
sense to want impossible things.

> And you wonder why they are like they are :) "you can't afford it, honey."
> :)

This brokenness can't be fixed at the level of business applications.
Languages and standard libraries need to be fixed first.

~~~
ArkyBeagle
I forget what the thing you just did is called, but you've managed to switch
sides. :) I'm the one who _said_ there is no unicorn version etc. ....

You can't fix that in a library. There is a sequence of escalation. Failures
are formally checked-for and counters are incremented, alarms are sent,
actions are taken...

You may not be interested in the Second Law, but the Second Law is interested
in you.

~~~
catnaroek
> I forget what the thing you just did is called, but you've managed to switch
> sides. :)

I didn't switch sides. I stand by my assertion that software is a purely
logical artifact. The laws of thermodynamics have no bearing on whether
redirecting the control flow to a far-away exception handler (or, even worse,
undefined behavior) is a reasonable way to deal with unforeseen circumstances.

> I'm the one who said there is no unicorn version etc. ....

I'm not talking about unicorns, only about abstractions that don't leak. That
being said, I'll admit that sometimes there are good reasons for using leaky
abstractions. My favorite examples of this is garbage collection. The
abstraction is “you can always allocate memory and you don't need to bother
deallocating it”. The second part is tight, because precise collectors
guarantee objects will be reclaimed a bounded number of cycles after they
become unused. But the first part is leaky, because the case “you've exhausted
all memory” is uncovered. The reason why this isn't a problem in practice is
that most programs don't come anywhere near exhausting all available memory,
and, if it ever happens, ultimately the only possible fix is to add more RAM
to the computer.

FWIW, I don't consider TCP a leaky abstraction, because it doesn't promise
that actual communication will take place. It only promises that, _if messages
are received_ , they will be received in order by the user of the abstraction.
That being said, most TCP _implementations_ are leaky, as is pretty much
anything written in C.

~~~
ArkyBeagle
Quoth Spolsky: "All nontrivial abstractions are leaky."

This means you still have to deal with it.

~~~
catnaroek
> Quoth Spolsky: "All nontrivial abstractions are leaky."

Well, no, that's wrong. Abstractions can be made tight, but that requires
discipline and hard work.

------
wahern
Lest somebody get the wrong idea from his post, note that he's not arguing to
use poll on sockets that aren't non-blocking (i.e. without the O_NONBLOCK flag
on the open file table entry).

When a socket polls for readiness in Unix, it does not mean that a subsequent
read will succeed. The obvious case is when another thread reads from the
socket before you do. A less obvious case is that some kernels, such as Linux,
implement lazy checksum verification. Linux will wake up any waiting threads
when a packet comes in (including marking an open file table entry as
readable), but the checksum isn't verified until an actual read is attempted.
If the checksum fails, the packet is silently discarded. If the socket wasn't
in non-blocking mode, your application will stall until the next packet is
received.

The JRE had (and maybe still has) a bug like this, where it assumed poll meant
that a subsequent read was guaranteed to succeed or fail immediately.

This particular issue is less common today with checksum hardware offloading,
but the correctness and robustness of your software probably shouldn't depend
on particular network chipsets.

Another bug I've seen several times is assuming that a write to a UDP socket
won't block. You can usually get away with this on Linux because the default
buffers are so huge. As with the above issue, it really only shows when your
application (and thus the network) is under significant load.

One conclusion I draw from this is that while people go to great lengths to
implement a supposedly scalable architecture, most of the time developers
never see the kinds of heavy load that such architectures are designed for. If
they had, they would have discovered these sorts of issues. Fortunately or
unfortunately for me, I discovered both of the above issues the hard way.

[1] If you're wondering why I kept writing "open file table entry" instead of
descriptor, it's because they're not the same thing. And some day I expected a
few CVEs to be issued related to overlooking such distinctions. For example,
on BSDs /dev/fd/N duplicate a descriptor point to the same file table entry,
just as dup(2) does. On Linux /dev/fd is a symlink to /proc/self/fd.
/proc/self/fd creates a new file table entry. In the former case, software
setting or unsetting O_NONBLOCK effects all other references to that entry.

~~~
comex
> When a socket polls for readiness in Unix, it does not mean that a
> subsequent read will succeed.

Yikes, I didn't know Linux did that. That sounds like a serious spec violation
to me. POSIX says:

> POLLIN

> Data other than high-priority data may be read without blocking.

[http://pubs.opengroup.org/onlinepubs/009695399/functions/pol...](http://pubs.opengroup.org/onlinepubs/009695399/functions/poll.html)

It's hard to interpret that other than as a promise not to block. Oh, and the
Linux poll(2) man page doesn't even mention the caveat. The select man page
does (I assume the actual behavior applies to poll too), but here POSIX is
even more explicit:

> A descriptor shall be considered ready for reading when a call to an input
> function with O_NONBLOCK clear would not block, whether or not the function
> would transfer data successfully. (The function might return data, an end-
> of-file indication, or an error other than one indicating that it is
> blocked, and in each of these cases the descriptor shall be considered ready
> for reading.)

------
geofft
Cory Benfield's PyCon talk last week, "Building Protocol Libraries the Right
Way"
([https://www.youtube.com/watch?v=7cC3_jGwl_U](https://www.youtube.com/watch?v=7cC3_jGwl_U)),
makes the argument that a large number of problems can be traced to not
cleanly separating responsibilities of actually physically doing I/O and
making semantic sense of the bytes. His primary worry was about reimplementing
things like HTTP many times, once for each I/O framework (why do Twisted,
Tornado, and asyncio all have their own HTTP implementation?). But it seems
the same problem can be seen here: every single part of the code thinks it
knows how to actually retrieve data from the network, so it interacts with the
network on its own, causing nested polling and similar awkwardness. If every
part of the event-processing code thinks it knows how to do network I/O, you
have many more opportunities for getting network I/O wrong.

If xterm were designed so that e.g. xevents() had only the responsibility of
fetching bytes from the X socket and do_xevents() and everything else had only
the responsibility of handling bytes from an buffer, there would be no
temptation to poll in two different functions. Only one function would even
know that the byte source is a socket; the rest just know about the buffer.

~~~
yetihehe
The more I see such problems the more I like erlang. Most socket handling
libraries split handling into protocol handling layer and application layer.
Protocol layer ensures there is full message available and application layer
handles only full messages. Most of the time it's the simplest and most
natural way to do anything in erlang.

~~~
jacquesm
Erlang really gets this right. Abstract out all the generic server stuff and
have it coded up by experts, then have the application programmers concentrate
on the application. A bit like programming a plug-in for Apache but then
extrapolated to just about anything you could do with a server. Erlang is a
very interesting eco-system, the more I play around with it the more I like it
and the way it is put together. If it had a shallower learning curve it would
put a lot of other eco-systems out of business. But then again, the fact that
it doesn't makes it something of a secret weapon for those shops and
individuals that have managed to really master it.

~~~
raarts
> if it had a shallower learning curve

Elixir is meant to address that.

------
osivertsson
_Imagine for a moment how programs would be different if all polls had
timeouts and all sockets were blocking. For a little while, there’d be some
unpleasant stalls. But these would not be insurmountable problems. With a
little concentration, it’s possible to rearchitect the program with a much
more robust design that neither loses events nor requires speculative
guesses._

Yes please, I'd like that. The code that ends up on my desk would be easier to
understand and refactor.

 _Instead of a proper fix, the developer changes the socket to nonblocking_

Sometimes yes, and sometimes the developer decide to spawn a thread, and now
you have lots of problems...

~~~
curried_haskell
n.da nyouow hlotsof probave. . lems

------
ArkyBeagle
I've seen EAGAIN as well l as EBADF errors as a "normal" part of operation
against TCP sockets. I say "normal" because I've only seen EBADF once and it
was because the client side started talking too early. IOW, when
select()/poll() tell you socket 13 is ready and recv() tell you to EBADF, then
the socket is just not ready to go just yet. Go around again.

The client side grew up against serial ports ( yes, those are still a thing )
, where you don't have this problem. The owner of the client side was more or
less in incredulous terror when I broached this subject. _Sigh_. So I just
ignored them. _Big sigh_. If it failed, there were retries so the only cost
was a little delay now and again.

You cannot fragment UDP unless you're prepared to add some method of
sequencing as part of the application protocol. Each UDP PDU needs to be fully
atomic otherwise.

For cases analogous to SNMP row creation ( in which multiple varbinds[1]
determine the outcome ), there is the "as if simultaneous" rule as a heuristic
- all PDUs related to creating a row must be cached and only applied when the
row state is set. And sometimes you can configure things to send all varbinds
in one PDU.

[1] a varbind is a triple of the set/get/next/multi operator, the object id
and if applicable a value as encoded by the SNMP Basic Encoding Rules.

So your little serialization protocol? It suffers all the heartache of a full-
on transactional database processing system.

These things are this way because communications are like that.

------
mark242
This is where I think Scala really, really shines, regardless of whether
you're using Akka or not. Once you've gotten into that mode of using futures,
turning your code from blocking to nonblocking is as simple as never writing
that Await statement, but having your methods return a Future[MyClass] instead
of MyClass.

The funny thing is that writing nonblocking code doesn't have to be as hard as
it is, you just have to get into the mindset. It's easy to say "well, I have
to have the result coming back from my JDBC/REST/etc call before I know what
to do with it" and that's not the case at all, especially when you're working
in a strongly-typed environment.

------
janvidar
The problem here is that most libraries need to integrate into some sort of
main loop somehow, and unfortunately there are lots of different ways of doing
the main loop of the application. Some libraries integrate other things which
are not directly poll()-able, but expose the same interface while doing so.

Now you have the problem when trying to combine multiple such libraries. For
example, you can try using GTK+ and QT in an application at the same time.

One thing that has always bugged me is that there is a lack of standardized
(cross platform) non-blocking DNS lookup functionality. This adds to the main-
loop complexity, since you have to poll() certain types of resources, and have
to deal with threads or subprocessed in order to look for DNS results.

Well written frameworks like Qt abstract away this complexity, but that may
not always play nice when mixing libraries.

~~~
ArkyBeagle
But the complexity of handling error code returns from UDP/TCP stacks is
fundamentally irreducible. IMO, and it's just that - I'd rather deal
tactically with fully nonblocking sockets than gamble on the writer of a
library's wrapping of it. If it turns out the library works then bonus - but
an error code at the socket layer may ripple all the way up to the ... UX
layer. The socket handler is the central artifact.

If you can, try a socket thing in Tcl. It (SFAIK ) completely asbtracts all
the ugly away. Stuff built properly ( see the Brent Welch book for "properly"
) in Tcl will be - again, SFAIK, after ... hundreds of these things ) fully
reference grade. And they're very cheap to build.

I strongly recommend at least being able to build socket handlers in Tcl
because eventually, you'll get into a "he said/she said" over a comms link and
using Tcl to test your side is extremely convenient. I had a boss who liked
Wireshark and I told him - I don't need Wireshark; I have this test driver. I
said this because say you have 2GB of Wireshark spoor. Now what? Just print it
out and use a highlighter?

------
nathanappere
"So the net result was that this optimization really resulted in an extraneous
recvfrom call per request, which returned EAGAIN. What was I thinking?"

That only happens if somehow you have received data that is EXACTLY the size
of the buffer you send to recvfrom. If you have read less, you know that there
is no need to call it again. If your buffer is full, then odds are there is
more data to read so the next call won't be useless.

I suspect you actually had an extremely small proportion of "extraneous"
calls.

~~~
spc476
For UDP packets, requesting a read of less than a full packet will discard the
rest of the packet. This does not happen for TCP packets though.

Darn those leaky abstractions.

------
Matt3o12_
I was lost after the second code example (I'm not a c programmer but I'm quite
comfortable in C-like languages (Go, Python, Java).

    
    
        void
        xevents(void)
        {
            if (poll() || poll())
                 while (poll()) {
                 /* ... */
                 }
        }
    

What advantages does this code produce and how is it related to the first
example? Why would calling poll 3 times have any advantages if either call 1
or 2 must be true and call 3, and the following ones must be true as well?

~~~
svantana
That was the whole point, that convoluted source code can hide inefficiencies
like this.

------
Bino
insightful, however most people should probably look into libraries when doing
non-blocking io, which should remove these kinds of caveats

~~~
ksherlock
Using a library doesn't remove the caveat of layers of abstractions. Quite the
opposite, in fact.

------
EGreg
Async structures aren't always availablein every environment. Take JS for
instance.

Without ES6 and Babel, what's the way to write if statements with async?

    
    
      if (x) {
        _afterX(null, x);
      } else {
        getX(_afterX);
      }
      function _afterX(err, x) {
        if (err) return;
        // use x
      }
      // do stuff that doesn't depend on x

