
Something Rotten in the Core - jmah
http://www.codersnotes.com/notes/something-rotten-in-the-core/
======
jitl
It seems like commenters here on HN are caught on the point that GDB _does_
provide a reasonable machine interface. That's not the author's point;
responding specifically to that fact is missing the forest for one mis-labeled
non-tree.

Here's two paragraphs from later on that sum much better:

> We've seen it hundreds of times in all kinds of software. Functions that
> return bool instead of an error code. Where did the precise error vanish to?
> Poof, it's gone! What used to be a useful error message became false, and if
> you're lucky you'll get a generic "Unexpected error" appearing on screen.
> And that's if your program is using a library. If it's calling out to a
> command-line worker, the most likely case is it won't get checked at all and
> will just get printed out into a log file you'll never find, and then never
> seen again.

> (image) Homer's typing bird: [http://www.codersnotes.com/notes/something-
> rotten-in-the-cor...](http://www.codersnotes.com/notes/something-rotten-in-
> the-core/pecking.gif)

> We're making systems that are fragile, because they're just glued on rather
> than bolted together. We're wrapping complex things up a wrappers that don't
> take the same responsibilities as the things they rely on. Like Homer's
> pecking bird in the Simpsons, they work just fine when everything is as
> expected, but when the slightest change in situation happens then everything
> breaks.

I think the article is worth reading in full.

~~~
alexandercrohde
I agree that GDB isn't substantive to the author's point, it's merely an
example and validity of the argument as a whole isn't contingent on that one
example.

However, I guess I'm not clear about the argument as a whole. The author is
essentially claiming, "We write wrappers that are shoddy in that they don't
consider failure cases."

That is true. What the author doesn't address is: what's the solution?

If the answer is "Code better," or "Think more," I feel like that's a straw-
man.

I think the real question is how can we make a wrapper (e.g. a complex setup
script) provide visibility into its problems (permissions issue, harddrive
space, network issue, etc) without doubling the work involved?

------
jepler
I couldn't tell whether the author was making a good point, because their
facts were badly askew.

Historically, there were multiple UNIX debuggers. It's true, gdb largely
exterminated them (and anyway, when people say UNIX they often mean
linux+freebsd+macos, where gdb was the defacto standard). However, there is
now a second common UNIX debugger from the llvm/clang alternate universe
(lldb).

In the very bad old days, debugger wrappers did drive gdb via the exact same
interface that a human would, with all the terribleness that entails. but gdb
has gone through several iterations of "GDB/MI", alternate APIs for "machine
interfaces"; the current iteration seems to be "mi2".

Besides this, GDB has long since standardized the "debugger stub", which does
low level operations like reading memory from the debugged process,
starting/interrupting, etc.

As you can see, there are a plurality of interfaces to the debugger, several
of them oriented specifically to use by other programs, rather than being
repurposed human interfaces.

Besides this, gdb has also become internally extensible in Python, which is
pretty great.

In short, invoking gdb as a subprocess is A LOT more like using an API than it
is like trying to parse the output of "ls -l" as the only way to get a listing
of files and their properties. It just happens to involve a second address
space (a third, when you count the debugged program). It's just stream
oriented, and doesn't look like JSON (or name your preferred hotness) because
it was developed independently of, and quite possibly before, that preferred
hotness.

(I'm pretty sure lldb's interface is also designed to be driven by other
programs from the start, and it may also support a same-address-space
embedding mode, but I don't have any actual experience using lldb, I just know
it's out there)

~~~
jepler
"The LLDB debugger APIs are exposed as a C++ object oriented interface in a
shared library. The lldb command line tool links to, and uses this public API…
The entire API is also then exposed through Python script bindings which allow
the API to be used within the LLDB embedded script interpreter, and also in
any python script that loads the lldb.py module in standard python script
files. See the Python Reference page for more details on how and where Python
can be used with the LLDB API."

~~~
AstralStorm
And therefore three is no stable ABI because C++ lacks one. Maybe in the
future...

------
Animats
This is a classic problem with open-source GUIs - they're a wrapper around a
command-line program. Such programs typically have no clue what happened at
the command line level - they just present whatever the command line program
prints to the user.

A few days ago, there was a UI designer on here who was looking for an open
source program to work on. I suggested "git gui". Git's default GUI is a Tk
wrapper around the command line program. Lots of buttons, corresponding to
command line options. No understanding at the GUI level of what's safe, what's
useful right now, and what the state of the project is.

The original Macintosh deliberately lacked a command line, so programmers had
to figure out a usable GUI for everything. They had the right idea.

One of the original design misfeatures of UNIX is that programs take in
command line parameters and environment variables, but all they give back is a
numeric error code. If they gave back a list of strings and a set of
name/value pairs, and there was some convention about what should come back,
scripts and GUI front ends would be less dumb.

~~~
hossbeast
They also write a descriptive message (and nothing else) on stderr as a
convention.

~~~
Animats
Which is usually useless to a GUI or a script.

~~~
hossbeast
Other than for informing the user what went wrong (which is specifically one
of the things you called out as a problem)

------
klodolph
> We've seen it hundreds of times in all kinds of software. Functions that
> return bool instead of an error code. Where did the precise error vanish to?
> Poof, it's gone!

A thousand times yes.

Earlier this year, two of us spent a full day debugging a problem with some of
our automation. Our team has pretty good automation, for the most part, but
this particular problem was in kind of a dark corner. A shell script in the
automation would start up a process in the background, and then send commands
to that process. The background process could be slow to start up at times, so
to deal with this, the commands running in the foreground had long timeouts if
any. Guess what happens if the background process dies?

Well, the shell script doesn't care, that's for sure, it wasn't watching for
error codes in background processes. It was a bit of an adventure following
the path from the foreground commands to the missing background process, and
finding the log files for the background process.

It's one of the things I like about writing this quick-and-dirty automation in
Go--the error handling is so _explicit_ that you'll usually end up with good
logs explaining what went wrong and what the program was trying to do at the
time. Much better than dealing with shell scripts. Shell scripts are quick to
write but you're often left in a bad position when they fail in unexpected
ways, or even in expected ways.

(The actual bug we hunted down was traced to one missing line in a
configuration file, but the problems with that piece of automation are far
larger.)

~~~
candiodari
So what you mean to say is that the problem is not so much that there isn't
any error reporting, but that, in C, it's being ignored ?

Never do

    
    
      printf("here's a number: %d", 11);
    

Always do:

    
    
      int attempts = 0
      int ret = printf("here's a number: %d", 11);
      while (attempts++ < max_attempts && ret < 0) {
        switch (ret) {
          case EINTR:
          case EAGAIN:
            ret = printf("here's a number: %d", 11);
          default:
            // At this point you, as a programmer, should STOP AND THINK.
            // What would be a reasonable reaction here ? How will it affect
            // everything else the program does ? What is the correct way to
            // proceed ?
            //
            // P.S. Anyone doing "return -1;" at this point should be taken out and shot.
            // and yes, that's the C equivalent of what every Go programmer always does.
    
            panic("printf error", ret); // for example, crash the program.
        }
      }
    

Needless to say, you should do this on EVERY printf statement.

There. Isn't explicit erroring great ? NO IT ISN'T.

Needless to say, this has an almost direct translation to Go. Does anyone do
this ? Of course not. In Go, like in C, like in shell scripting, in the vast
majority of programs nearly all errors are ignored.

That's why exceptions are so very superior to explicit error handling : it
accomplishes many things :

1) it alerts the user that an error occured. "Explicit error handling" like C,
Go, most C++, ... do will simply silently attempt to proceed, likely turning a
small error or a typo into a disaster or catastrophe. Silent database
corruption, here we come !

2) It provides information about where the error occured. Stop me if this
sounds familiar: "when an error is printed, and the program crashes, I
download the source and grep it for what I think is a unique word in the error
message. When it turns out it isn't I get cranky. When it turns out there
isn't a unique word in the error I just sit down in a quiet corner and softly
cry".

3) It allows for "layered" error management strategies. I'm not saying it gets
it up to OCaml levels, but it is far superior to C or Go error management. In
the main function, you catch any Exception for the various parts of the
program you start, log it in a reasonable manner, alert if necessary, and
restart the relevant portion of the program. Inside the parts of the program
you catch finer grained exceptions with more explicit management.

4) it's far more concise.

So "explicit" error management ? Let's just be truthful here (just look at
Github examples of C and Go code): it's really just ignoring errors.

You can find coding errors involving ignored errors in the Go standard library
in minutes. Examples:

1)
[https://github.com/golang/go/blob/master/src/bufio/bufio.go#...](https://github.com/golang/go/blob/master/src/bufio/bufio.go#L239)

2)
[https://github.com/golang/go/blob/master/src/bufio/bufio.go#...](https://github.com/golang/go/blob/master/src/bufio/bufio.go#L270)

3)
[https://github.com/golang/go/blob/master/src/flag/flag.go#L5...](https://github.com/golang/go/blob/master/src/flag/flag.go#L528)

So even the Go core developers themselves can't be trusted to not ignore
errors.

~~~
klodolph
> So what you mean to say is that the problem is not so much that there isn't
> any error reporting, but that, in C, it's being ignored ?

...huh? No, I'm not saying that. I'm saying that if you write a shell script
there's a risk of not detecting errors that you care about. I also said that I
_like_ to rewrite these overgrown shell scripts in Go, which is apparently a
Wrong Opinion and some bad C code will somehow convince me of this.

First, the nitpicks: EAGAIN should not be handled here. EAGAIN shouldn't be
retried in a loop, that will just spin the CPU for no good reason. If printf()
returns EAGAIN it means that you made stdout non-blocking and hopefully you
would know if you did that, but that's unusual except in language runtimes.
There's also a missing break; in the switch.

Beyond that, I don't really care about error handling for printf() when I'm
logging output or running interactive programs.

Compare this with the behavior for C++:

    
    
        #include <fcntl.h>
        #include <iostream>
        #include <unistd.h>
        int main() {
            close(STDOUT_FILENO);
            std::cout << "Hello, world!\n";
            return 0;
        }
    

Try it yourself.

As an example of the errors we see in our logs, they often look like this:

    
    
        some_file.go:399: could not realign warp core coupling b502:
          plasmaManifold.PhaseInterplex(): host not found: m19d.eng.ncc1701d
    

The "ignored errors in the Go standard library" aren't really ignored errors.
Look at the bufio code a little bit more closely, you'll see that those errors
are properly returned.

~~~
candiodari
> First, the nitpicks: EAGAIN should not be handled here. EAGAIN shouldn't be
> retried in a loop, that will just spin the CPU for no good reason. If
> printf() returns EAGAIN

Manpage seems to imply that's not the only reason:

[http://man7.org/linux/man-pages/man3/errno.3.html](http://man7.org/linux/man-
pages/man3/errno.3.html)

And I'm pretty sure that the manpage is right : with creative redirects you
can make that happen for other reasons too. You can redirect stdout to a file
on NFS, or to a tcp socket that may have a full buffer, lots of evil ideas
come to mind.

I'll take another good look at the bufio error. Thing is, I'm also pretty sure
that I'd want bufio to correctly handle EINTR and EAGAIN and it seems to me
very unlikely that this golang runtime code is correct for those cases. But
I'll spend some time trying to make it fail. Maybe I'll learn something.

------
ajross
Lost interest right here:

> We're not talking about calling out to a library here. We're talking about
> actually launching an instance of GDB, passing it commands, and parsing the
> results it prints out. And this is where we get led down a dangerous path.

That's just wrong. GDB has a reasonably well specified control protocol, which
is what everything uses. Yes, it's ASCII and readable. No, it's not just
"parsing gdb output".

Come on.

------
citrin_ru
> So much user-facing network software is built on top of other programs, like
> ssh or rsync, and when those things fail they just don't know what to do.
> And so much of the problem is precisely because they're not using them as
> libraries, they're using them as command-line utilities.

Unix CLI utilities have well defined way to return an error - non zero exit
code. It is even possible to return different errors as different exit codes,
though it is rarely done.

------
clhodapp
I know it was just the specific example a commonly-wrapped program that the
author happened to use but it does bear noting that gdb actually ships with a
perfectly usable raw mode interface built in. It even supports split panes.

------
gfody
reminiscent of Spolsky's law of leaky abstractions, definitely another side of
the same problem

