Uh... but why?!?

atemerev · on March 23, 2016

Software is unreliable. Bugs happen. Always. There are bugs in avionics, medical devices firmware, nuclear power plants monitoring software, bank transfers backends, all places.

Once upon a time it was common to think that we can design software without bugs, or at least almost. That didn't work at all! What did work is redundant systems, invariant testing and fail-fast with restarts. This is how reliable systems are written these days.

Bugs are common; we have to learn to work around them.

chris_wot · on March 23, 2016

> Bugs are common; we have to learn to work around them.

Or we could, you know, fix them.

I wasn't asking for a justification. I was just asking why this is occurring. If you don't know, that's cool. I mean, one of the reasons I ask is because I'd like to know if VMWare are going to fix this bug.

So thank you for explaining that software has bugs. I'm sure I'll remember that the next time I fix a regression in LibreOffice, as I did with the issue with EMF dashed lines not displaying correctly or when I fixed the issue where JPEG exports didn't export the DPI value correctly...

mikeash · on March 23, 2016

Just for future reference, something like "Do we know exactly what the bug in VMWare is, and whether they're going to fix it?" would be way more effective at getting the answer you're looking for here. "Uh... but why?!?" sounds like cursing at the sky, and gets a response appropriate for that.

chris_wot · on March 23, 2016

Fair point.

mikewilliams · on March 23, 2016

/* fixed nullreferenceexception based on black box crash report */

cordite · on March 23, 2016

    void segfault_sigaction(int signal, siginfo_t *si, void *arg)
    {
        //Pretend it never happened
        return;
    }

atemerev · on March 23, 2016

Of course, the only right way is panic() and wait for supervisor to restart the process / VM.

Idiomatic Erlang doesn't differentiate between "system" / "environment" errors and local bugs. If it has failed — restart it!

geofft · on March 23, 2016

I'll bite. What about Erlang makes it so that a restarted process doesn't run into the same bug when it gets to the same point, and panic again in an infinite loop?

The only way I can imagine this working is if Erlang is so buggy and nondeterministic that it inserts crashes sometimes but not all of the time. But that's obviously absurd.

toast0 · on March 23, 2016

If it's some weird race condition crash, restarting (hopefully?) puts you in a known good state and you're unlikely to hit it again.

If it quickly repeats, you've isolated the failure to happening within a narrow scope.

This part isn't really Erlang magic, apache in pre-fork mode has a lot of the same properties. There may be some magic in supervision strategies, but I think the real magic is the amount of code you get to leave out by accepting the possibility of crashes and having concise ways to bail out on error cases.

For example, to do an mnesia write and continue if successful and crash if not, you can write

  ok = mnesia:write(Record)

Similarly, when you're writing a case statement (like a switch/case in C), if you expect only certain cases, you can leave out a default case, and just crash if you get weird input.

I also find the catch Expression way of dealing with possible exceptions is often nicer than try/catch. It returns the exception so you can do something like

  case catch Expression of
    something_good -> ok;
    {'EXIT', badarg} -> not_so_great
  end

and handle the errors you care about in the same place as where you handle the successes.

Edited to add, re: failwhale, your HTTP entrypoints can usually be something like

  try
    real_work_and_output()
  catch
    E:R ->
      log_and_or_page(E,R)
      output_failwhale()
  end.

As long as the failure in real_work_and_output is quick enough, you'll get your failwhale. Of course, if the problem is processing is too slow, you might want to set a global failwhale flag somewhere, but your ops team can hotload a patch if they need to fix the performance of the failwhale ;)

simoncion · on March 23, 2016

"It returns the exception so you can do something like

  case catch Expression of"

Something to be aware of is the cost of a bare catch when an exception of type 'error' is thrown:

"[W]hen the exception type is 'error', the catch will build a result containing the symbolic stack trace, and this will then in the first case [1] be immediately discarded, or in the second case matched on and then possibly discarded later. Whereas if you use try/catch, you can ensure that no stack trace is constructed at all to begin with." [0]

Stack trace construction isn't free, so it makes sense to avoid it if you're not going to use it. I know that in either Erlang 17 or Erlang 18, parts of Mnesia were slightly refactored to move from bare catch to try/catch for this very reason.

[0] http://erlang.org/pipermail/erlang-questions/2013-November/0...

[1] He's referring back to an example in the email

toast0 · on March 23, 2016

Thanks, I don't follow the mailing lists, so I probably wouldn't have known to think about that.

cordite · on March 23, 2016

Wondered this too, it naively only makes sense for that case with neutrinos screwing your RAM over.

liveoneggs · on March 23, 2016

see section 3.4 here: http://erlang.org/documentation/doc-4.9.1/doc/design_princip...

"3.4 The Restart Frequency Limit Mechanism"

geofft · on March 23, 2016

Well, okay, so your process crashes, you restart it, it crashes a few more times, then you kill it. What's the advantage there? How does this increase availability, beyond killing it the first time it crashes?

It seems actively worse to allow users to retry requests that are doomed to failure than to put up a fail-whale or similar while the ops team is being paged.

atemerev · on March 23, 2016

Because most production bugs are infrequent (otherwise they would be noticed by testing). They have to be logged and fixed, but not allowed to move the system into inconsistent state. Restart first, fix later.

geofft · on March 23, 2016

Are they? The bug discussed in this comment was extremely deterministic. There's a difference between infrequent in the sense that, across lots of users and lots of requests it happens rarely, and infrequent in the sense that, for one particular use, it only triggers sometimes.

Also, the bug discussed in this article wasn't causing crashes. What would you propose be crashed and restarted in this case?

chris_wot · on March 23, 2016

Yeah, that the way. To improve a latency issue, issue a panic. rolls eyes

atemerev · on March 23, 2016

What's the point in low latency if results are incorrect?

chris_wot · on March 23, 2016

What's the point in rebooting a system if the results remain incorrect after the reboot?