From it I was able to figure out what was wrong with the C++ program. Notice that the GPF lists the instructions at CS:EIP (the instruction pointer of the running program) and so it was possible by generating assembler output from the C++ program to identify the function/method being executed. From the registers it was possible to identify that one of the parameters was a null pointer (something like ECX being 00000000) and from that information work back up the code to figure out under what conditions that pointer could be null.
Just from that screenshot the bug was identified and fixed.
One of the suggestions was that the kernel could do more. Solaris-based systems (illumos, SmartOS, OmniOS, etc.) do detect both correctable and uncorrectable memory issues. Errors may still cause a process to crash, but they also raise faults to notify system administrators what's happened. You don't have to guess whether you experienced a DIMM failure. After such errors, the OS then removes faulty pages from service. Of course, none of this has any performance impact until an error occurs, and then the impact is pretty minimal.
There's a fuller explanation here:
Pro-tip: use ECC memory on servers. The end.
I haven't seen a server without ECC memory for years. I don't even consider running anything in production without ECC memory, let alone VM hypervisors. I find it pretty hard to believe that EC2 instances run on non-ECC memory hosts, risking serious data loss for their clients.
Memory errors can be catastrophic. Just imagine a single bit flip in some in-memory filesystem data structure: the OS just happily goes on corrupting your files, assuming everything's OK, until you notice it and half your data is already lost.
Been there (on a development box, but nevertheless).
I think Google may be able to get away with it. With enough checksums along the way, memory (and other hardware) errors can be detected in software pretty easily if you have independent machines checking the data and can afford the processing penalty.
Now, for virtualization I seriously doubt it. Not unless their instances run simultaneously on more than one machine to check for inconsistencies between them (something that the mainframes do since the dawn of time, but that I don't see as feasible in a distributed environment).
Thus, a configuration that relies on the availability of a single machine is already risking serious outage or data loss by not being machine-redundant. Reliable systems require the coordination of many machines (at least two), and the replication of data across them if data's involved.
It is useful to have component-level redundancy (e.g., RAID or ECC memory), but in some environments it may be cheaper overall to have machine-level redundancy using inexpensive machines. It also only takes the failure of a single critical subsystem for a machine to suffer an outage. You might have ECC memory and RAID, but do you have only a single Ethernet card and power supply? Single machine availability is a "weakest link" phenomenon from its components.
I acknowledge that building software to run across a fleet of machines is more difficult than software that runs on only a single machine, but (1) the software development cost is largely a fixed cost, not a variable cost in the number of machines (2) building a distributed system is sometimes needed for scaling reasons anyway.
If you scale a single machine vertically (i.e., get a bigger box), its cost rises faster than its capabilities; so an efficient high-scale system typically also means running a fleet of cheap machines (scale horizontally). I think these effects contribute to the rise of commodity-server computing, and cost is a reason not to consider it disturbing.
In other words, crunch the numbers and see when it makes sense :-)
The problem with memory errors is that they are silent. You won't notice them until something goes misteriously wrong. And that can be anything, from the innocent invalid memory access to data corruption. This just can't be tolerated anywhere data is being processed, data you don't want to lose that is...
RAID does nothing if the OS thinks that its in-memory filesystem datastructures are correct, and just goes ahead and updates the superblock with bad data, or writes over other files' pages. You just get a nice, redundant, corrupted filesystem. The same goes for multiple machines sharing data anywhere, filesystems or databases alike.
It's the error detection part that's important, not the correction part. And ECC main memory is just a part of the picture, you want to be notified of errors as soon as possible. And this is the important bit: "be notified". So, you want parity checks and CRCs on disk caches and data buses and everywhere else it's feasible. It's not an accident that server-class hardware costs more than your average PC.
The "correction" part is just a welcome by-product. I, for one, replace memory modules as soon as they trigger more than one ECC event. And this happens occasionally even with an universe of machines in the low dozens, with supposedly high-quality components. Now think what may be happening silently with all those borderline memory modules from anonymous manufacturers in China...
Besides, like I mentioned before, it isn't easy to find non-ECC memory servers from the usual vendors. Only their very low-end machines have it. Machines that aren't meant to do anything more that shoving packets around or other usage patterns where either silent data corruption can be tolerated (easy to replace appliances that don't process/store important data) or checksums are already a part of the job (network stuff like firewalls or routers).
I thought ECC events were triggered by environment, rather than hardware faults? Or you just figure some sticks are by chance more susceptible?
It isn't difficult to tell these two possibilities apart. Sometimes I get an ECC event on some server, and then it never happens again (or it happens in a different module), which doesn't warrant a replacement. Now, if the same module triggers another event, what's the chance of two "cosmic rays" hitting the same module twice and flipping a bit on it? It's better to just replace it (which is covered by warranty or maintenance contracts, so it costs us no additional charge).
So, yes, yield varies.
We have a simple policy: ECC memory is required to run our software in production. Failure to do so voids the warranty.
For desktop computers, Intel charges a premium on any ECC-capable gear (their Xeon line), so it's really only available in workstation class computers. Most AMD gear (AM2/3/3+ sockets, not A-series) can take ECC RAM, if there is BIOS support.
ECC RAM costs about 10-30% more per DIMM, but as memory is so incredibly cheap these days, its probably the cheapest safety net you can buy.
The last two machines I built had bad modules that needed weeding out, and I follow anti-static precautions fairly carefully. I used to be a PC technician and I built probably over a hundred PCs in the 90s. Memory was never as fragile and fault-prone as it is these days.
Nevertheless, we would do our best to please a customer looking to host our software on an EC2 cluster, with the appropriate warnings. ;)
A bit of context: we sell a "real time" non-relational database (http://www.quasardb.net/). Our customers come to us for speed and reliability and therefore build dedicated farms to host our database.
How do you stack up against the most common open source NoSQL systems? Redis, Cassandra, Mongo, Couchbase? Is your db eventually consistent, or partitioned, or replicated, or what?
quasardb is a key/value store.
It is (a lot) faster in a multi-client context that the engines you listed and can handle entries of any size (provided you have enough space on the servers, of course!).
It's fully symmetric which means the load is equally distributed and replicated on all the nodes (no master node).
If you have more question feel free to mail us (don't want to highjack this thread).
You mention EC2 in your blog post, but he asked the person requiring ECC memory or voiding the product warrany what _they'd_ do if the customer wants to run on EC2.
In fact, the GP probably used the EC2 part of your blog entry to come up with the question in the first place.
In my experience in more 'agile' firms - startups, web dev shops and so on - it would be very hard to make a scheme like this work well, because of all the grinding bureaucracy, fiddly spec-matching and endless manual testing required, as well as the importance of controlling - and deeply understanding - the whole stack. Nonetheless, for infrastructure projects like Redis, I can see value in having engineering effort put explicitly into making 'prettier crashes'.
Note the section on DRAM scrubbing, which I was reminded of from the original article's suggestion on having the kernel scan for memory errors. (I remember when Sun implemented scrubbing, I believe in response to a manufacturing issue that compromised the reliability of some DIMMs.)
Although our production code is written in C, I'm not particularly worried about detecting wild writes, because we use pointer checking algorithms to detect/prevent them in the compiler. (Of course, that could be buggy too...)
What I'm trying to catch are wild writes from other devices that have access to RAM. Anyway, this is far from production code so far, but hashing has already been very successful at keeping data structures on disk consistent (a la ZFS, git), so applying the same approach to memory seems like the next step.
The speed hit is surprisingly low, 10-20%, and when you put it that way, it's like running your software on a 6 month old computer. So much of the safety stuff we refuse to do "for performance" would be like running on top-of-the-line hardware three years ago, but safely. That seems like a worthwhile trade to me...
P.s. Are people really not burning in their server hardware with memtest86? We run it for 7 days on all new hardware, and I figured that was pretty standard...
2) Even those that do run it typically run it for no more than 24 hours
3) Many people don't build their own hardware these days, its a VPS or EC2
4) If you've selected ECC RAM then you know way more about memory failures than >99% of Redis users
Here are my additional two cents: At least on X86 systems, to check small memory regions without effects on the CPU cache can be implemented using non-temporal writes that will directly force the CPU to write the memory back to memory. The instruction required for this is called movntdq and is generated by the SSE2 intrinsic _mm_stream_si128().
Now, this may not be such a huge problem in practice because the OS is unlikely to move pages around unless it's forced to swap. But that depends on details of the OS paging algorithm and your server load.
It only does a few things, but it does them exceedingly well. Just like nginx, I know it will be fast and reliable, and it is this kind of crazed attention to detail that gets it there.
By the way, check the page first, give bloggers the traffic their content deserves!
Btw here the problem was mine, I was running the Sinatra app wit "ruby app.rb", and Apache was mod_proxing to this running on port 4567.
By default mod proxy will suspend the connection 60 seconds with an error if the proxed thing returns something wrong. Idiotic default that can be avoided just with:
ProxyPass / http://127.0.0.1:4567/ retry=0
Since you are running it as "ruby app.rb", I take it you aren't interested in doing an app server/web server/cache deployment. But if you aren't using thin, that's only a "gem install thin" away.
This post is fantastic.
Here is a variation which, unless I'm missing something, would be a little simpler still and require less full-memory loops:
1. Count #1's in memory (possibly mod N to avoid overflow).
2. Invert memory.
3. Count #0's in memory.
4 Invert memory.
I think this would catch the same errors (stuck-as-0 or stuck-as-1 bits).
One difficulty is that multiple errors could cancel each other out, at which point you can do things like add checkpoints in the aggregation, or track more signals such as number of 01's vs number of 10's. In the end, this is like an inversion-friendly CRC.
A better solution to all the reliability problems is better quality hardware i.e. not X86. X86 has very few reliability features built in past ECC. If you look at UltraSparc based machines, they can predict failures and offline chunks of the hardware (CPUs, RAM regions, IO devices) so they can be replaced without disrupting the system.
Prevention is better than debugging :)
It's true that there are certain classes of errors that "safe" languages make less likely or impossible. I'm not convinced there are enough fewer of these to make up for the additional classes of errors introduced by such languages.
For example, you could write a secure Java app (let's assume this is possible), but if the JVM has a security flaw (and it did), it doesn't really matter, because the runtime is mandatory.
Over the years, I've become more and more distrustful of large, complicated systems designed "for my safety", and migrated towards tiny systems + formal verification/model checking/exhaustive symbolic testing.
100% certified and bug free (in the sense of "doesn't implement the spec") is where I think at least the safety and security critical part of the industry will be in 30 years. The tools available today are incredibly powerful, but systems designed over 5-10 years ago are unable to take advantage of them. There's no reason programs cannot be bug free with respect to their specifications today, given the tools we have at our disposal.
It would be great to formally verify the runtime and task infrastructure. In fact people have started to create (small, incomplete) Promela models of the message passing infrastructure. However, it's lower priority than figuring out what works from a pragmatic standpoint and implementing it at the moment.
Basically, making reliable software is hard. Changing the language doesn't bring anything. There are a lot of tools to make sure your C/C++ programs doesn't have obvious errors. The problem are non-obvious errors, and these errors exist in all the languages, with different forms.
Another way to put it: "You cannot reduce risk, you can only replace it with another".
I don't understand why you would say this. Is it not the case that using a language with automatic memory management (say, Python) is less risky than using one with manual memory manamgement (say, C or C++)?
There are tradeoffs (e.g. performance, having less control over various things), but they are not tradeoffs between one risk and another risk, they are tradeoffs between risk and something else.
Sure, they don't solve the problem of magically and instantly conjuring up the program you want; nobody is claiming that. All these little things add up nevertheless.
"You cannot reduce risk, you can only replace it with another" -> I just ask you to think about it, try to have an open mind about what I may imply. I'm sorry for this mysterious answer but this is typically a topic over which we could talk past each other. The first time I was told you I had the same reaction until I had the epiphany about what it really means.
Even if that's true, it doesn't support your claim.. The language still matter, even if it turns out that it's not the most important part.
Your second paragraph is on the border of being condescending without any actual argumentation.
See also: http://blogs.msdn.com/b/oldnewthing/archive/2005/04/12/40756...
So using a higher level or "safer" language isn't going to stop these kinds of problems.
edit: that doesn't detract your point however that C is used nowadays on "robust systems"... in terms of popular robust kernels though you'll want to look at something like L4 or QNX Neutrino. There's a kernel that is actually formally verified (seL4) that was first written in Haskell, then verified, then translated to C (for speed).
I am interested to hear what you think is more robust than Linux, setting aside seL4. Do you think QNX Neutrino is more robust? If so, why? And what else?
I would expect vxWorks and other RTOSs to generally be less robust than Linux, despite typically going through various certifications.
Thus, it stands to reason, Linux is likely less stable than an embedded kernel without all that driver code.
I run a pre-emptive embedded kernel (QK) that's extremely tiny and, in fact, was validated by myself with KLEE (a symbolic checker) to exhaustively verify correctness. I'm certain it's more reliable than Linux, which carries no such guarantee (and is orders of magnitude larger -- even excluding driver code).
Bug rates correlate extremely close with lines of code. All else being equal, a large system has more bugs, simply because it has more opportunity for them.
If you truly care about correctness, doing formal verification, model checking, etc. is the way to go, not "lots of people use it so it must be stable".
To benefit from formal verification, you have to design your code around that, and most systems today are not. It's hard to retrofit verification on top of a legacy codebase like Linux.
You say that you managed to "verify correctness" on QK, but clearly, that's not an accurate statement, since AFAIK seL4 is the only kernel that has ever been "proven correct" (and even for seL4, there are some gotchas there, AFAIK).
If you truly care about correctness, doing formal verification, model checking, etc. ...
The common perception is that model checking and formal verification are still just research areas that can't be used practically beyond toy problems. Again, seL4 is the only kernel I know of that has been "proven correct," and that has taken an insane amount of manpower that is not scalable to anything more complex than seL4. That seems to be evidence that there is something to the "common perception" I stated above.
lots of people use it so it must be stable
That's not really my argument. With Linux, there is an insane amount of testing going on all the time (I mean "informal" testing.. just people using it and reporting problems if they encounter any.. though there are also farms set up that test Linux, as well). Linux must be by far the most highly-tested software in history. On the other hand, you could use formal methods on some small kernel and maybe get some help (but again, far short of proving there are no bugs, AFAIK), but the level of testing will be many orders of magnitude less. Clearly, which one wins out depends on how good the actual state of the art in formal methods is, but my impression is that many orders of magnitude more testing will win out, unless the problem you are solving is very, very small.
It was not difficult at all to run QK and it's supporting code through KLEE and exhaustively verify the properties of each function, thanks to a super-simple design and the many included assertions, preconditions, and postconditions, which KLEE helpfully proves are satisfied automatically. If I wanted a certified optimizing C compiler, I'd use CompCert to compile QK, which would give me a certified kernel all the way to machine code. (I actually use Clang.)
I am familiar with the verified L4 kernel, and it is far more complex than the QK kernel I have verified. That people have already verified a kernel far more complex should be sufficient proof that a much simpler kernel can also have its correctness verified.
Stepping back, it's 2012: people should no longer be surprised when a small-but-meaningful codebase is certified correct. It's common enough now that you don't get published in a journal just because you did it.
In my research, I'm trying to write a "fancy" task scheduler at the user level, so that someone can get "fancy" scheduling policies without modifying the RTOS kernel. By "fancy" I mean fully preemptive and allowing "interesting" scheduling policies/synchronization protocols to be implemented. "Interesting" generally means multicore, in my particular research community.
If you don't mind me asking, what kind of projects do you/have you used QK for?
people should no longer be surprised when a small-but-meaningful codebase is certified correct
For the size of codebase you're talking about, I'm not really that surprised. But I still think that formal methods don't scale to multicore RTOSs, which is the area I study (I'm a CS grad student).
I have talked to people in the preempt_rt Linux community that believe Linux will come to dominate the RTOS market just like it has a lot of other stuff, and I think they have a compelling argument. Once preempt_rt is mature enough, it's hard for me to see any reason for going with something like QNX.
To correct myself: except that the Linux development process is not compatible with current cerfitication standards. Despite that, it's possible that the Linux dev process is more mature and better "tracked", and just better, than what the certs actually require, so it may eventually be possible to certify it somehow. (Anyone with thoughts on this, please pipe up...) I get the impression that some preempt_rt people have looked into this.
So, I don't understand what you mean by Linux not being safe and reliable...
I even go further and say that Linux is more robust than many of these other OSes, since it is exposed to a wider range of environments and hardware combinations of varying quality and whatever bugs they trigger. If you reduce Linux to the kind of footprint that, for instance, VxWorks or QNX have, I don't believe it would be any less reliable.
But Linux isn't predictable. Many safety-critical systems only rely on the system not crashing or misbehaving, but many others rely on real-time characteristics. Those Linux doesn't have.
The point isn't about abstracting away the machine, but about reducing the amount of code that has no safety guarantees.
Having said that, I wouldn't downvote. The question is sparking a clarifying conversation so it's arguably worthwhile even though the premise if wrong.
C has a powerful/unsafe feature in that it allows you to directly address memory and read and write data. From a high-level, there are potential problems that can happen when doing this: (1) You might overwrite data in a memory location that you were using for something else or (2) you can write data that, say, represents a string and then read it back and try to interpret it as data that represents a number.
You can easily avoid the above two problems by using a library, while still having the option to optimize the code if the need arises. Since a product like Redis needs to be very performant in both speed and memory usage, the option to optimize code can't be emphasized enough: It is a critical feature of Redis and thus C is an excellent language to use for it's development.
What is your actual question?
(edit: The edit link is no longer available on my other post. Here's the question I was responding to: "Perhaps using safer languages (and languages with better error reporting) would be a solution to [the kinds of problems mentioned in the article.]")
Edit: huh. 2 downvotes. Why not explain why you think I'm wrong, rather than just downvoting because you don't agree?
I think the title of the article could be more accurate, considering how much is devoted not to issues about software reliability per se, but to distinguishing between unreliable software and unreliable hardware. I think an implicit assumption in most discussions about software reliability is that the hardware has been verified.
I personally do not think that it is the responsibility of a database to perform diagnostics on its host system, although I can sympathize with the pragmatic requirement.
When I am determining the cause of a software failure or crash, the very first thing I always want to know is: is the problem reproducible? If not, the bug report is automatically classified as suspect. It's usually not feasible to investigate a failure that only happened once and cannot be reproduced. Ideally, the problem can be reproduced on two different machines.
What we're always looking for when investigating a bug are ways to increase our confidence that we know the situation (or class of situation) in which the bug arises. And one way to do this is to eliminate as many variables as possible. As a support specialists trying to solve a faulty computer or program, I followed the same course: isolate the cause by a process of elimination. When everything else has been eliminated, whatever you are left with is the cause.
I'm still all jonesed up for a good discussion about software reliability. antirez raised interesting questions about how to define software that is working properly or not. While I'm all for testing, there are ways to design and architect software that makes it more or less amenable to testing. Or more specifically, to make it easier or harder to provide full coverage.
I've always been intrigued by the idea that the most reliable software programs are usually compilers. I believe that is because computer languages are amongst the most carefully specified kind of program input. Whereas so many computer programs accept very poorly specified kinds of input, like user interface actions mixed with text and network traffic, which is at higher risk of having ambiguous elements. (For all their complexity, compilers have it easier in some regards: they have a very specific job to do, and they only run briefly in batch operations, producing a single output from a single input. Any data mutations originate from within the compiler itself, not from the inputs they are processing.)
In any case, I believe that the key to reliable programs depends upon the a complete and unambiguous definition of any and all data types used by those programs, as well as complete and unambiguous definitions of the legitimate mutations that can be made to those data types. If we can guarantee that only valid data is provided to an operation, and guarantee that each such operation produces only legitimate data, then we reduce the chances of corrupting our data. (Transactional memory is such an awesome thing. I only wish it was available in C family languages.)
One of my crazy ideas is that all programs should have a "pure" kernel with a single interface, either a text or binary language interface, and this kernel is the only part that can access user data. Any other tool has to be built on top of this. So this would include any application built with a database back-end.
I suppose that a lot of Hacker News readers, being web developers, already work on products featuring such partitioning. But for desktop software developers who work with their own in-memory data structures and their own disk file formats, it's not so common or self-evident. Then again, even programs that do rely on a dedicated external data store also keep a lot of other kinds of data around, which may not be true user data, but can still be corrupted and cause either crashes or program misbehaviour.
In any case, I suspect that this is going to be an inevitable side-effect of various security initiatives for desktop software, like Apple's XPC. The same techniques used to partition different parts of a program to restrict their access to different resources often lead to also partitioning operations on different kinds of data, including transient representations in the user interface.
Can a program like Redis be further decomposed into layers to handle tasks focussed on different kinds of data to achieve even better operational isolation, and thereby make it easier to find and fix bugs?
I don't think this is necessarily true; I used to maintain the Delphi compiler, and there were hundreds of bugs in the backlog that never really got looked at owing to workarounds, low impact and high cost of fixing.
What compilers usually have going for them is that they are batch processes rather than online processes, so they don't have time to build up crud in data structures; they have highly reproducible inputs - code that causes a crash normally causes a crash every run of the program, no weird mouse clicks or timing needed, and this code can usually be sent back to the vendor; and all customer code is effectively a unit test, so feedback from betas etc. is immediate and loud.
Very interesting idea... sounds like something someone would design an operating system around, it would require some sort of highly optimized external call thing.
But I am not sure I get it. What would stop an upper layer bug from simply passing bad instructions to the kernel?