
Redis crashes - a small rant about software reliability - hnbascht
http://antirez.com/news/43
======
jgrahamc
His point about logging registers and stack is interesting. Many years ago I
worked on some software that ran on Windows NT 4.0 and we had a weird crash
from a customer who sent in a screen shot of a GPF like this:
<http://pisoft.ru/verstak/insider/cwfgpf1.gif>

From it I was able to figure out what was wrong with the C++ program. Notice
that the GPF lists the instructions at CS:EIP (the instruction pointer of the
running program) and so it was possible by generating assembler output from
the C++ program to identify the function/method being executed. From the
registers it was possible to identify that one of the parameters was a null
pointer (something like ECX being 00000000) and from that information work
back up the code to figure out under what conditions that pointer could be
null.

Just from that screenshot the bug was identified and fixed.

~~~
malkia
Unfortunately address space randomization techniques make this much harder.

~~~
geal
Not necessarily. The screenshot indicates the bytes pointed by the IP, so it
would still be possible to find them a binary you just built, and debug it
from there.

~~~
malkia
only if this is nota relocated code, and still there can be much code
duplication, especially with C++ templates/inlines

------
dap
Great post, showing admirable dedication to software reliability and a solid
understanding of memory issues.

One of the suggestions was that the kernel could do more. Solaris-based
systems (illumos, SmartOS, OmniOS, etc.) do detect both correctable and
uncorrectable memory issues. Errors may still cause a process to crash, but
they also raise faults to notify system administrators what's happened. You
don't have to guess whether you experienced a DIMM failure. After such errors,
the OS then removes faulty pages from service. Of course, none of this has any
performance impact until an error occurs, and then the impact is pretty
minimal.

There's a fuller explanation here:
[https://blogs.oracle.com/relling/entry/analysis_of_memory_pa...](https://blogs.oracle.com/relling/entry/analysis_of_memory_page_retirement)

~~~
ComputerGuru
I don't think enough people appreciate just how awesome of an OS Solaris was.
I never had opportunity to deploy it full-scale for any projects, but I
lamented the loss of great potential when it "died."

~~~
dap
It didn't die. It was forked by the community when Oracle close-sourced it.
The community fork (called illumos) is being actively developed by multiple
companies, which have done significant new feature work (e.g.,
<http://dtrace.org/blogs/wdp/2011/03/our-zfs-io-throttle/>).

------
CrLf
I find this idea of a lack of ECC memory on servers disturbing... This is the
default on almost all rack mountable servers from the likes of HP or IBM. Of
course, people use all kinds of sub-standard hardware for "servers" on the
cheap, and they get what they pay for.

I haven't seen a server without ECC memory for years. I don't even consider
running anything in production without ECC memory, let alone VM hypervisors. I
find it pretty hard to believe that EC2 instances run on non-ECC memory hosts,
risking serious data loss for their clients.

Memory errors can be catastrophic. Just imagine a single bit flip in some in-
memory filesystem data structure: the OS just happily goes on corrupting your
files, assuming everything's OK, until you notice it and half your data is
already lost.

Been there (on a development box, but nevertheless).

~~~
antirez
I hope there is a way to get some official statement from Amazon, Linode, and
other very used VM providers about the kind of memory used in their servers.
This would help users understanding the real risks.

~~~
CrLf
I think they don't mention it because they think it to be obvious (I hope).
However, with all the special built servers that big providers use to reduce
costs, there is some margin to doubt.

I think Google may be able to get away with it. With enough checksums along
the way, memory (and other hardware) errors can be detected in software pretty
easily if you have independent machines checking the data and can afford the
processing penalty.

Now, for virtualization I seriously doubt it. Not unless their instances run
simultaneously on more than one machine to check for inconsistencies between
them (something that the mainframes do since the dawn of time, but that I
don't see as feasible in a distributed environment).

------
shin_lao
This is an interesting post, especially the part about memory testing.

We have a simple policy: ECC memory is required to run our software in
production. Failure to do so voids the warranty.

~~~
minimax
What if your customers want to run on EC2 instances?

~~~
antirez
It is covered in the blog post. (This is not a critique, just an hint, I
understand that reading a very long blog post is time consuming).

~~~
minimax
I did read the whole post. It was very informative. I wasn't aware that EC2
did not have ECC RAM. My question was directed at shin_lao's policy about not
providing a "warranty" for his customers running on non-ECC hardware.

~~~
antirez
Oh sorry I get it now...

------
jimwhitson
At IBM, we were very keen on what we called 'FFDC' - 'first- failure data
capture'. This meant having enough layers of error-detection, ideally all the
way down to the metal, so that failures could be detected cleanly and logged
before (possibly) going down, allowing our devs to reproduce and fix customer
bugs. Naturally it wasn't perfect, and it depending on lots of very tedious
planning meetings, but on the stuff I worked with (storage devices mainly) it
was remarkably effective.

In my experience in more 'agile' firms - startups, web dev shops and so on -
it would be very hard to make a scheme like this work well, because of all the
grinding bureaucracy, fiddly spec-matching and endless manual testing
required, as well as the importance of controlling - and deeply understanding
- the whole stack. Nonetheless, for infrastructure projects like Redis, I can
see value in having engineering effort put explicitly into making 'prettier
crashes'.

~~~
ricardobeat
spec-matching is a specialty of good agile companies, but web-dev shops don't
usually write their own db/server software.

------
apaprocki
Can't agree with this more.. And he is just talking about logging crashes. One
of the best debugging tools you have at your disposal in a large system (a lot
of programmers contributing code -- bugs can be anywhere) is logging the same
stack information in a quick fashion under normal operation in strange
circumstances so as not to slow down the production software. The slowest part
of printing that information out is the symbol resolution in the binary of the
stack addresses to symbol names. This part of the debugging output can be done
"offline" in a helper viewer binary and does not need to be done in the
critical path. We frequently output stack traces as strings of hex addresses
detectable by a regex appended to a log message. The log viewer transforms
this back into an actual symbolic stack trace at viewing time to avoid the hit
of resolving all the symbols in the hot path.

------
js2
It's crazy that an application should have to test memory. It should simply be
handled by the HW and OS. e.g. Some details about how Sun/Solaris deal with
memory errors:

<http://learningsolaris.com/docs/DRAM_errors.pdf>

Note the section on DRAM scrubbing, which I was reminded of from the original
article's suggestion on having the kernel scan for memory errors. (I remember
when Sun implemented scrubbing, I believe in response to a manufacturing issue
that compromised the reliability of some DIMMs.)

------
erichocean
Although we use ECC in our servers already, I've recently been experimenting
with hashing object contents _in memory_ using a CityHash variant. The hash is
checked when the object moves on chip (into cache), and re-computed before the
object is stored back into RAM when it's been updated.

Although our production code is written in C, I'm not particularly worried
about detecting wild writes, because we use pointer checking algorithms to
detect/prevent them in the compiler. (Of course, that could be buggy too...)

What I'm trying to catch are wild writes from _other devices_ that have access
to RAM. Anyway, this is far from production code so far, but hashing has
already been very successful at keeping data structures on disk consistent (a
la ZFS, git), so applying the same approach to memory seems like the next
step.

The speed hit is surprisingly low, 10-20%, and when you put it that way, it's
like running your software on a 6 month old computer. So much of the safety
stuff we refuse to do "for performance" would be like running on top-of-the-
line hardware three years ago, but safely. That seems like a worthwhile trade
to me...

P.s. Are people really not burning in their server hardware with memtest86? We
run it for 7 days on all new hardware, and I figured that was pretty
standard...

~~~
aidenn0
1) Yes, lots of people don't run memtest86 at all.

2) Even those that do run it typically run it for no more than 24 hours

3) Many people don't build their own hardware these days, its a VPS or EC2

4) If you've selected ECC RAM then you know way more about memory failures
than >99% of Redis users

------
grundprinzip
I totally like this post, because main-memory based software systems will
become the future for all kinds of applications. Thus, handling errors on this
side will become more important as well.

Here are my additional two cents: At least on X86 systems, to check small
memory regions without effects on the CPU cache can be implemented using non-
temporal writes that will directly force the CPU to write the memory back to
memory. The instruction required for this is called movntdq and is generated
by the SSE2 intrinsic _mm_stream_si128().

------
codeflo
In theory, there's nothing stopping the OS from remapping the pages of your
address space to different physical RAM locations at any point during your
test. So even if you have a reproducible bit error that caused the crash,
there's a chance that the defect memory region is not actually touched during
the memory test.

Now, this may not be such a huge problem in practice because the OS is
unlikely to move pages around unless it's forced to swap. But that depends on
details of the OS paging algorithm and your server load.

------
nicpottier
This kind of attention to detail is all too rare these days. I love Redis,
because I have never, not once, ever had to wonder whether it was doing its
job. It is like a constant, always running, always doing a good job and
getting out of the way.

It only does a few things, but it does them exceedingly well. Just like nginx,
I know it will be fast and reliable, and it is this kind of crazed attention
to detail that gets it there.

------
ComputerGuru
Page is down. Here is a formatted copy: <https://gist.github.com/4154289>

~~~
antirez
Sorry, the Sinatra based site is deployed with "ruby app.rb". Probably not
enough...

~~~
irahul
> Sorry, the Sinatra based site is deployed with "ruby app.rb". Probably not
> enough..

Since you are running it as "ruby app.rb", I take it you aren't interested in
doing an app server/web server/cache deployment. But if you aren't using thin,
that's only a "gem install thin" away.

------
chewxy
And people wonder why I recommend redis. Having run redis for over 1.5 years
on production systems as a heavy cache, a named queue and memoization tool (on
the same machine), redis has never once failed me. It's clear with antirez's
blog post, his attention to detail.

This post is fantastic.

------
tylerneylon
The memory check algorithm is a nice solution of the challenges he presents -
easy to understand and effective.

Here is a variation which, unless I'm missing something, would be a little
simpler still and require less full-memory loops:

1\. Count #1's in memory (possibly mod N to avoid overflow). 2\. Invert
memory. 3\. Count #0's in memory. 4 Invert memory.

I think this would catch the same errors (stuck-as-0 or stuck-as-1 bits).

One difficulty is that multiple errors could cancel each other out, at which
point you can do things like add checkpoints in the aggregation, or track more
signals such as number of 01's vs number of 10's. In the end, this is like an
inversion-friendly CRC.

------
lucian1900
Perhaps using safer languages (and languages with better error reporting)
would be a solution to these kinds of problems.

~~~
wheaties
Um, no. C is a perfectly valid language and some of the best, most robust
systems in the world are written in it (Linux, Git, etc.) Some languages are
even built to run atop C (Cython.) Even the JVM deals with pointers, memory
allocation issues, and such so you don't have to but it's still there!

So using a higher level or "safer" language isn't going to stop these kinds of
problems.

~~~
pilgrim689
Sorry if I'm breaking your bubble, but Linux and git are not "the best and
most robust systems in the world". If they were, the state of the art of safe
and reliable software systems would be quite pitiful.

edit: that doesn't detract your point however that C is used nowadays on
"robust systems"... in terms of popular robust kernels though you'll want to
look at something like L4 or QNX Neutrino. There's a kernel that is actually
formally verified (seL4) that was first written in Haskell, then verified,
then translated to C (for speed).

~~~
javert
Because it's super widely deployed and has a very mature development process,
I would expect the Linux kernel to be among the most robust software in the
world.

I am interested to hear what you think is more robust than Linux, setting
aside seL4. Do you think QNX Neutrino is more robust? If so, why? And what
else?

I would expect vxWorks and other RTOSs to generally be less robust than Linux,
despite typically going through various certifications.

~~~
erichocean
The major source of errors in a kernel is device drivers, and Linux typically
is running many more drivers than the embedded kernels you mentioned. Look at
any Linux point release: the majority of churn is in driver code, to fix bugs.

Thus, it stands to reason, Linux is likely less stable than an embedded kernel
without all that driver code.

I run a pre-emptive embedded kernel (QK) that's extremely tiny and, in fact,
was validated by myself with KLEE (a symbolic checker) to exhaustively verify
correctness. I'm certain it's more reliable than Linux, which carries no such
guarantee (and is orders of magnitude larger -- even excluding driver code).

Bug rates correlate _extremely_ close with lines of code. All else being
equal, a large system has more bugs, simply because it has more opportunity
for them.

If you truly care about correctness, doing formal verification, model
checking, etc. is the way to go, _not_ "lots of people use it so it must be
stable".

To benefit from formal verification, you have to design your code around that,
and most systems today are not. It's hard to retrofit verification on top of a
legacy codebase like Linux.

~~~
javert
I don't really understand your driver argument, because when deploying Linux
on an embedded system, you would only use the drivers you need... so you'd
have the same amount of driver code as with any other kernel.

You say that you managed to "verify correctness" on QK, but clearly, that's
not an accurate statement, since AFAIK seL4 is the only kernel that has ever
been "proven correct" (and even for seL4, there are some gotchas there,
AFAIK).

 _If you truly care about correctness, doing formal verification, model
checking, etc._ ...

The common perception is that model checking and formal verification are still
just research areas that can't be used practically beyond toy problems. Again,
seL4 is the only kernel I know of that has been "proven correct," and that has
taken an insane amount of manpower that is not scalable to anything more
complex than seL4. That seems to be evidence that there is something to the
"common perception" I stated above.

 _lots of people use it so it must be stable_

That's not really my argument. With Linux, there is an insane amount of
testing going on all the time (I mean "informal" testing.. just people using
it and reporting problems if they encounter any.. though there are also farms
set up that test Linux, as well). Linux must be by far the most highly-tested
software in history. On the other hand, you could use formal methods on some
small kernel and maybe get some help (but again, far short of proving there
are no bugs, AFAIK), but the level of testing will be many orders of magnitude
less. Clearly, which one wins out depends on how good the actual state of the
art in formal methods is, but my impression is that many orders of magnitude
more testing will win out, unless the problem you are solving is very, very
small.

~~~
erichocean
The entire QK[1] kernel, including all library code, is less that two thousand
lines of C (and most of the library code isn't even used by the kernel itself,
which is just 234 LOC). QK is a single-processor kernel (although still pre-
emptive) and very well written and tested (in, literally, hundreds of millions
of devices).

It was not difficult at all to run QK and it's supporting code through KLEE
and exhaustively verify the properties of each function, thanks to a super-
simple design and the many included assertions, preconditions, and
postconditions, which KLEE helpfully proves are satisfied automatically. If I
wanted a certified optimizing C compiler, I'd use CompCert[2] to compile QK,
which would give me a certified kernel all the way to machine code. (I
actually use Clang.)

I am familiar with the verified L4 kernel, and it is _far_ more complex than
the QK kernel I have verified. That people have already verified a kernel far
more complex should be sufficient proof that a much simpler kernel can also
have its correctness verified.

Stepping back, it's 2012: people should no longer be surprised when a small-
but-meaningful codebase is certified correct. It's common enough now that you
don't get published in a journal just because you did it.

[1] <http://www.state-machine.com/qp/index.php>

[2] <http://compcert.inria.fr/>

~~~
javert
By the way, I am going to look into actually applying KLEE to some of my own
code. I had heard of it before, but thanks for bringing it up, I hadn't really
given it much thought until now.

In my research, I'm trying to write a "fancy" task scheduler at the user
level, so that someone can get "fancy" scheduling policies without modifying
the RTOS kernel. By "fancy" I mean fully preemptive and allowing "interesting"
scheduling policies/synchronization protocols to be implemented. "Interesting"
generally means multicore, in my particular research community.

If you don't mind me asking, what kind of projects do you/have you used QK
for?

------
pnathan
there is an approach to hard real time software where antirez's idea for a
memory checker is done.

------
BoredAstronaut
This post reminded me of my time as a consulting systems support specialist.
Lots of weird problem turned out to be bad hardware. Usually memory or disk,
sometimes bad logic boards. For end users, this would often lead to complete
freezing of the computer, so it was less likely to be blamed on broken
software, but there were still many times it was hard to be sure. Desktop OS
software can flake out in strange ways due to memory problems. I used to run a
lot of memory tests as a matter of course.

I think the title of the article could be more accurate, considering how much
is devoted not to issues about software reliability per se, but to
distinguishing between unreliable software and unreliable hardware. I think an
implicit assumption in most discussions about software reliability is that the
hardware has been verified.

I personally do not think that it is the responsibility of a database to
perform diagnostics on its host system, although I can sympathize with the
pragmatic requirement.

When I am determining the cause of a software failure or crash, the very first
thing I always want to know is: is the problem reproducible? If not, the bug
report is automatically classified as suspect. It's usually not feasible to
investigate a failure that only happened once and cannot be reproduced.
Ideally, the problem can be reproduced on two different machines.

What we're always looking for when investigating a bug are ways to increase
our confidence that we know the situation (or class of situation) in which the
bug arises. And one way to do this is to eliminate as many variables as
possible. As a support specialists trying to solve a faulty computer or
program, I followed the same course: isolate the cause by a process of
elimination. When everything else has been eliminated, whatever you are left
with is the cause.

I'm still all jonesed up for a good discussion about software reliability.
antirez raised interesting questions about how to define software that is
working properly or not. While I'm all for testing, there are ways to design
and architect software that makes it more or less amenable to testing. Or more
specifically, to make it easier or harder to provide full coverage.

I've always been intrigued by the idea that the most reliable software
programs are usually compilers. I believe that is because computer languages
are amongst the most carefully specified kind of program input. Whereas so
many computer programs accept very poorly specified kinds of input, like user
interface actions mixed with text and network traffic, which is at higher risk
of having ambiguous elements. (For all their complexity, compilers have it
easier in some regards: they have a very specific job to do, and they only run
briefly in batch operations, producing a single output from a single input.
Any data mutations originate from within the compiler itself, not from the
inputs they are processing.)

In any case, I believe that the key to reliable programs depends upon the a
complete and unambiguous definition of any and all data types used by those
programs, as well as complete and unambiguous definitions of the legitimate
mutations that can be made to those data types. If we can guarantee that only
valid data is provided to an operation, and guarantee that each such operation
produces only legitimate data, then we reduce the chances of corrupting our
data. (Transactional memory is such an awesome thing. I only wish it was
available in C family languages.)

One of my crazy ideas is that all programs should have a "pure" kernel with a
single interface, either a text or binary language interface, and this kernel
is the only part that can access user data. Any other tool has to be built on
top of this. So this would include any application built with a database back-
end.

I suppose that a lot of Hacker News readers, being web developers, already
work on products featuring such partitioning. But for desktop software
developers who work with their own in-memory data structures and their own
disk file formats, it's not so common or self-evident. Then again, even
programs that do rely on a dedicated external data store also keep a lot of
other kinds of data around, which may not be true user data, but can still be
corrupted and cause either crashes or program misbehaviour.

In any case, I suspect that this is going to be an inevitable side-effect of
various security initiatives for desktop software, like Apple's XPC. The same
techniques used to partition different parts of a program to restrict their
access to different resources often lead to also partitioning operations on
different kinds of data, including transient representations in the user
interface.

Can a program like Redis be further decomposed into layers to handle tasks
focussed on different kinds of data to achieve even better operational
isolation, and thereby make it easier to find and fix bugs?

~~~
barrkel
_I've always been intrigued by the idea that the most reliable software
programs are usually compilers._

I don't think this is necessarily true; I used to maintain the Delphi
compiler, and there were hundreds of bugs in the backlog that never really got
looked at owing to workarounds, low impact and high cost of fixing.

What compilers usually have going for them is that they are batch processes
rather than online processes, so they don't have time to build up crud in data
structures; they have highly reproducible inputs - code that causes a crash
normally causes a crash every run of the program, no weird mouse clicks or
timing needed, and this code can usually be sent back to the vendor; and all
customer code is effectively a unit test, so feedback from betas etc. is
immediate and loud.

