Hacker News new | past | comments | ask | show | jobs | submit login
Redis crashes - a small rant about software reliability (antirez.com)
332 points by hnbascht on Nov 27, 2012 | hide | past | web | favorite | 107 comments

His point about logging registers and stack is interesting. Many years ago I worked on some software that ran on Windows NT 4.0 and we had a weird crash from a customer who sent in a screen shot of a GPF like this: http://pisoft.ru/verstak/insider/cwfgpf1.gif

From it I was able to figure out what was wrong with the C++ program. Notice that the GPF lists the instructions at CS:EIP (the instruction pointer of the running program) and so it was possible by generating assembler output from the C++ program to identify the function/method being executed. From the registers it was possible to identify that one of the parameters was a null pointer (something like ECX being 00000000) and from that information work back up the code to figure out under what conditions that pointer could be null.

Just from that screenshot the bug was identified and fixed.

I remember generating .map files as part of the build process that were invaluable in figuring out where Windows desktop programs were crashing. It was about 30 minutes of work that made 3-hour debugging sessions into 10 minute debugging sessions from then on.

Unfortunately address space randomization techniques make this much harder.

Not necessarily. The screenshot indicates the bytes pointed by the IP, so it would still be possible to find them a binary you just built, and debug it from there.

only if this is nota relocated code, and still there can be much code duplication, especially with C++ templates/inlines

Great post, showing admirable dedication to software reliability and a solid understanding of memory issues.

One of the suggestions was that the kernel could do more. Solaris-based systems (illumos, SmartOS, OmniOS, etc.) do detect both correctable and uncorrectable memory issues. Errors may still cause a process to crash, but they also raise faults to notify system administrators what's happened. You don't have to guess whether you experienced a DIMM failure. After such errors, the OS then removes faulty pages from service. Of course, none of this has any performance impact until an error occurs, and then the impact is pretty minimal.

There's a fuller explanation here: https://blogs.oracle.com/relling/entry/analysis_of_memory_pa...

I don't think enough people appreciate just how awesome of an OS Solaris was. I never had opportunity to deploy it full-scale for any projects, but I lamented the loss of great potential when it "died."

It didn't die. It was forked by the community when Oracle close-sourced it. The community fork (called illumos) is being actively developed by multiple companies, which have done significant new feature work (e.g., http://dtrace.org/blogs/wdp/2011/03/our-zfs-io-throttle/).

The first and only time I used Solaris, I tried to run our application and got the error "System out of colors" or some such. Swore then and there never to use it again if I could help it.

Thank you for the interesting link dap.

I take it you know about /var/log/mcelog ?

Linux kernels also do that memory thing (and have done it for years), just FYI. Either through MCE or EDAC. Not really that special.

Pro-tip: use ECC memory on servers. The end.

I find this idea of a lack of ECC memory on servers disturbing... This is the default on almost all rack mountable servers from the likes of HP or IBM. Of course, people use all kinds of sub-standard hardware for "servers" on the cheap, and they get what they pay for.

I haven't seen a server without ECC memory for years. I don't even consider running anything in production without ECC memory, let alone VM hypervisors. I find it pretty hard to believe that EC2 instances run on non-ECC memory hosts, risking serious data loss for their clients.

Memory errors can be catastrophic. Just imagine a single bit flip in some in-memory filesystem data structure: the OS just happily goes on corrupting your files, assuming everything's OK, until you notice it and half your data is already lost.

Been there (on a development box, but nevertheless).

I hope there is a way to get some official statement from Amazon, Linode, and other very used VM providers about the kind of memory used in their servers. This would help users understanding the real risks.

I think they don't mention it because they think it to be obvious (I hope). However, with all the special built servers that big providers use to reduce costs, there is some margin to doubt.

I think Google may be able to get away with it. With enough checksums along the way, memory (and other hardware) errors can be detected in software pretty easily if you have independent machines checking the data and can afford the processing penalty.

Now, for virtualization I seriously doubt it. Not unless their instances run simultaneously on more than one machine to check for inconsistencies between them (something that the mainframes do since the dawn of time, but that I don't see as feasible in a distributed environment).

A single machine is never going to be completely reliable. At any time it can halt for a variety of reasons: power loss, hardware failure, disaster in the data center like flooding, etc.

Thus, a configuration that relies on the availability of a single machine is already risking serious outage or data loss by not being machine-redundant. Reliable systems require the coordination of many machines (at least two), and the replication of data across them if data's involved.

It is useful to have component-level redundancy (e.g., RAID or ECC memory), but in some environments it may be cheaper overall to have machine-level redundancy using inexpensive machines. It also only takes the failure of a single critical subsystem for a machine to suffer an outage. You might have ECC memory and RAID, but do you have only a single Ethernet card and power supply? Single machine availability is a "weakest link" phenomenon from its components.

I acknowledge that building software to run across a fleet of machines is more difficult than software that runs on only a single machine, but (1) the software development cost is largely a fixed cost, not a variable cost in the number of machines (2) building a distributed system is sometimes needed for scaling reasons anyway.

If you scale a single machine vertically (i.e., get a bigger box), its cost rises faster than its capabilities; so an efficient high-scale system typically also means running a fleet of cheap machines (scale horizontally). I think these effects contribute to the rise of commodity-server computing, and cost is a reason not to consider it disturbing.

In other words, crunch the numbers and see when it makes sense :-)

You are factually correct. However, availability isn't the problem ECC memory intends to solve.

The problem with memory errors is that they are silent. You won't notice them until something goes misteriously wrong. And that can be anything, from the innocent invalid memory access to data corruption. This just can't be tolerated anywhere data is being processed, data you don't want to lose that is...

RAID does nothing if the OS thinks that its in-memory filesystem datastructures are correct, and just goes ahead and updates the superblock with bad data, or writes over other files' pages. You just get a nice, redundant, corrupted filesystem. The same goes for multiple machines sharing data anywhere, filesystems or databases alike.

It's the error detection part that's important, not the correction part. And ECC main memory is just a part of the picture, you want to be notified of errors as soon as possible. And this is the important bit: "be notified". So, you want parity checks and CRCs on disk caches and data buses and everywhere else it's feasible. It's not an accident that server-class hardware costs more than your average PC.

The "correction" part is just a welcome by-product. I, for one, replace memory modules as soon as they trigger more than one ECC event. And this happens occasionally even with an universe of machines in the low dozens, with supposedly high-quality components. Now think what may be happening silently with all those borderline memory modules from anonymous manufacturers in China...

Besides, like I mentioned before, it isn't easy to find non-ECC memory servers from the usual vendors. Only their very low-end machines have it. Machines that aren't meant to do anything more that shoving packets around or other usage patterns where either silent data corruption can be tolerated (easy to replace appliances that don't process/store important data) or checksums are already a part of the job (network stuff like firewalls or routers).

I, for one, replace memory modules as soon as they trigger more than one ECC event.

I thought ECC events were triggered by environment, rather than hardware faults? Or you just figure some sticks are by chance more susceptible?

ECC events are triggered by any memory error, be it the occasional cosmic ray or a not so good memory module.

It isn't difficult to tell these two possibilities apart. Sometimes I get an ECC event on some server, and then it never happens again (or it happens in a different module), which doesn't warrant a replacement. Now, if the same module triggers another event, what's the chance of two "cosmic rays" hitting the same module twice and flipping a bit on it? It's better to just replace it (which is covered by warranty or maintenance contracts, so it costs us no additional charge).

Manufacturing memory from silicon wafers is similar to baking cookies. Some cookies are great, some turn out OK, and some are burnt depending on the characteristics of the ingredients, the oven, and the chaotic thermodynamic properties of the system.

So, yes, yield varies.

This is an interesting post, especially the part about memory testing.

We have a simple policy: ECC memory is required to run our software in production. Failure to do so voids the warranty.


For desktop computers, Intel charges a premium on any ECC-capable gear (their Xeon line), so it's really only available in workstation class computers. Most AMD gear (AM2/3/3+ sockets, not A-series) can take ECC RAM, if there is BIOS support.

ECC RAM costs about 10-30% more per DIMM, but as memory is so incredibly cheap these days, its probably the cheapest safety net you can buy.

It's a pain in the neck. By not supporting ECC RAM, Intel is IMO indirectly responsible for millions of dollars worth of lost work from crashes on consumer hardware in workplaces worldwide. ECC RAM should be standard, given modern memory capacities.

The last two machines I built had bad modules that needed weeding out, and I follow anti-static precautions fairly carefully. I used to be a PC technician and I built probably over a hundred PCs in the 90s. Memory was never as fragile and fault-prone as it is these days.

What if your customers want to run on EC2 instances?

On production servers using virtual machines to run our software is not advised.

Nevertheless, we would do our best to please a customer looking to host our software on an EC2 cluster, with the appropriate warnings. ;)

A bit of context: we sell a "real time" non-relational database (http://www.quasardb.net/). Our customers come to us for speed and reliability and therefore build dedicated farms to host our database.

Wow, that product page is completely lacking any meaningful technical information about your product. :-D

How do you stack up against the most common open source NoSQL systems? Redis, Cassandra, Mongo, Couchbase? Is your db eventually consistent, or partitioned, or replicated, or what?

Thanks for the feedback, this is currently a landing page we give to our customers we meet face to face. We're working on something more consistent to answer questions like yours.

quasardb is a key/value store.

It is (a lot) faster in a multi-client context that the engines you listed and can handle entries of any size (provided you have enough space on the servers, of course!).

It's fully symmetric which means the load is equally distributed and replicated on all the nodes (no master node).

If you have more question feel free to mail us (don't want to highjack this thread).

Do you have a blog? Maybe you could do a write-up.. This is the kind of geek catnip that HN likes.

We do have a blog - the subject is vast. Do you have anything in particular you would like to read about?

Benchmarks against popular alternatives, explanations of how yours avoids the mistakes of MongoDB, your principles vs the other guys, your approach re: CAP, do you use MVCC and if not why is your solution better, the workloads that your solution excels at.

It is covered in the blog post. (This is not a critique, just an hint, I understand that reading a very long blog post is time consuming).

I did read the whole post. It was very informative. I wasn't aware that EC2 did not have ECC RAM. My question was directed at shin_lao's policy about not providing a "warranty" for his customers running on non-ECC hardware.

Oh sorry I get it now...

Nope? You seem to have misread (well, or it's me of course).

You mention EC2 in your blog post, but he asked the person requiring ECC memory or voiding the product warrany what _they'd_ do if the customer wants to run on EC2.

In fact, the GP probably used the EC2 part of your blog entry to come up with the question in the first place.

Does Amazon or other provider offers this warranty, I mean, ECC servers (or something similar?)

At IBM, we were very keen on what we called 'FFDC' - 'first- failure data capture'. This meant having enough layers of error-detection, ideally all the way down to the metal, so that failures could be detected cleanly and logged before (possibly) going down, allowing our devs to reproduce and fix customer bugs. Naturally it wasn't perfect, and it depending on lots of very tedious planning meetings, but on the stuff I worked with (storage devices mainly) it was remarkably effective.

In my experience in more 'agile' firms - startups, web dev shops and so on - it would be very hard to make a scheme like this work well, because of all the grinding bureaucracy, fiddly spec-matching and endless manual testing required, as well as the importance of controlling - and deeply understanding - the whole stack. Nonetheless, for infrastructure projects like Redis, I can see value in having engineering effort put explicitly into making 'prettier crashes'.

spec-matching is a specialty of good agile companies, but web-dev shops don't usually write their own db/server software.

Can't agree with this more.. And he is just talking about logging crashes. One of the best debugging tools you have at your disposal in a large system (a lot of programmers contributing code -- bugs can be anywhere) is logging the same stack information in a quick fashion under normal operation in strange circumstances so as not to slow down the production software. The slowest part of printing that information out is the symbol resolution in the binary of the stack addresses to symbol names. This part of the debugging output can be done "offline" in a helper viewer binary and does not need to be done in the critical path. We frequently output stack traces as strings of hex addresses detectable by a regex appended to a log message. The log viewer transforms this back into an actual symbolic stack trace at viewing time to avoid the hit of resolving all the symbols in the hot path.

It's crazy that an application should have to test memory. It should simply be handled by the HW and OS. e.g. Some details about how Sun/Solaris deal with memory errors:


Note the section on DRAM scrubbing, which I was reminded of from the original article's suggestion on having the kernel scan for memory errors. (I remember when Sun implemented scrubbing, I believe in response to a manufacturing issue that compromised the reliability of some DIMMs.)

Although we use ECC in our servers already, I've recently been experimenting with hashing object contents in memory using a CityHash variant. The hash is checked when the object moves on chip (into cache), and re-computed before the object is stored back into RAM when it's been updated.

Although our production code is written in C, I'm not particularly worried about detecting wild writes, because we use pointer checking algorithms to detect/prevent them in the compiler. (Of course, that could be buggy too...)

What I'm trying to catch are wild writes from other devices that have access to RAM. Anyway, this is far from production code so far, but hashing has already been very successful at keeping data structures on disk consistent (a la ZFS, git), so applying the same approach to memory seems like the next step.

The speed hit is surprisingly low, 10-20%, and when you put it that way, it's like running your software on a 6 month old computer. So much of the safety stuff we refuse to do "for performance" would be like running on top-of-the-line hardware three years ago, but safely. That seems like a worthwhile trade to me...

P.s. Are people really not burning in their server hardware with memtest86? We run it for 7 days on all new hardware, and I figured that was pretty standard...

1) Yes, lots of people don't run memtest86 at all.

2) Even those that do run it typically run it for no more than 24 hours

3) Many people don't build their own hardware these days, its a VPS or EC2

4) If you've selected ECC RAM then you know way more about memory failures than >99% of Redis users

I totally like this post, because main-memory based software systems will become the future for all kinds of applications. Thus, handling errors on this side will become more important as well.

Here are my additional two cents: At least on X86 systems, to check small memory regions without effects on the CPU cache can be implemented using non-temporal writes that will directly force the CPU to write the memory back to memory. The instruction required for this is called movntdq and is generated by the SSE2 intrinsic _mm_stream_si128().

In theory, there's nothing stopping the OS from remapping the pages of your address space to different physical RAM locations at any point during your test. So even if you have a reproducible bit error that caused the crash, there's a chance that the defect memory region is not actually touched during the memory test.

Now, this may not be such a huge problem in practice because the OS is unlikely to move pages around unless it's forced to swap. But that depends on details of the OS paging algorithm and your server load.

This kind of attention to detail is all too rare these days. I love Redis, because I have never, not once, ever had to wonder whether it was doing its job. It is like a constant, always running, always doing a good job and getting out of the way.

It only does a few things, but it does them exceedingly well. Just like nginx, I know it will be fast and reliable, and it is this kind of crazed attention to detail that gets it there.

Page is down. Here is a formatted copy: https://gist.github.com/4154289

Formatted more nicely: http://gist.io/4154289

By the way, check the page first, give bloggers the traffic their content deserves!

Is he not caching his blog with Redis?

The blog uses Redis as primary data store, unfortunately there is Ruby between the user and the DB ;-)

Btw here the problem was mine, I was running the Sinatra app wit "ruby app.rb", and Apache was mod_proxing to this running on port 4567.

By default mod proxy will suspend the connection 60 seconds with an error if the proxed thing returns something wrong. Idiotic default that can be avoided just with:

    ProxyPass / retry=0
See "retry=0".

Sorry, the Sinatra based site is deployed with "ruby app.rb". Probably not enough...

> Sorry, the Sinatra based site is deployed with "ruby app.rb". Probably not enough..

Since you are running it as "ruby app.rb", I take it you aren't interested in doing an app server/web server/cache deployment. But if you aren't using thin, that's only a "gem install thin" away.

Use Redis man; I heard is really good.

The irony is too thick to read through.

And people wonder why I recommend redis. Having run redis for over 1.5 years on production systems as a heavy cache, a named queue and memoization tool (on the same machine), redis has never once failed me. It's clear with antirez's blog post, his attention to detail.

This post is fantastic.

The memory check algorithm is a nice solution of the challenges he presents - easy to understand and effective.

Here is a variation which, unless I'm missing something, would be a little simpler still and require less full-memory loops:

1. Count #1's in memory (possibly mod N to avoid overflow). 2. Invert memory. 3. Count #0's in memory. 4 Invert memory.

I think this would catch the same errors (stuck-as-0 or stuck-as-1 bits).

One difficulty is that multiple errors could cancel each other out, at which point you can do things like add checkpoints in the aggregation, or track more signals such as number of 01's vs number of 10's. In the end, this is like an inversion-friendly CRC.

Perhaps using safer languages (and languages with better error reporting) would be a solution to these kinds of problems.

Much of the article is dedicated to hardware level memory problems. Most of the 'safer' languages I know about would not really help you in those situations. Maybe there are some odd mainframe languages that read and write everything twice or something like that, but most 'safer' languages I know about are more oriented towards preventing programmer errors. What antirez is saying that, yes, they get some of those, but many others, due to how widely used the system is, that are genuine memory failures, that cost a lot of time to track down.

Not really. We get all the same sorts of errors in our very high level C#/Asp.Net/VMware deployments and it's a shit load harder to debug with all the extra baggage that a VM and hypervisor throw on top as well...

A better solution to all the reliability problems is better quality hardware i.e. not X86. X86 has very few reliability features built in past ECC. If you look at UltraSparc based machines, they can predict failures and offline chunks of the hardware (CPUs, RAM regions, IO devices) so they can be replaced without disrupting the system.

Prevention is better than debugging :)

I was thinking more of something like Rust, where only a small subset of your code would be poking at memory manually. Then when you do get a mysterious crash, you only need to look at the unsafe portions of your application (or perhaps at the compiler).

Maybe this was just the bias of the post, but it sounded from the post that most time is spent ruling out actual memory errors. Programming in a higher-level language doesn't make such errors less likely or easier to identify.

It's true that there are certain classes of errors that "safe" languages make less likely or impossible. I'm not convinced there are enough fewer of these to make up for the additional classes of errors introduced by such languages.

What additional classes of errors does e.g. Rust introduce?

I assume he's talking about errors in the runtime itself.

For example, you could write a secure Java app (let's assume this is possible), but if the JVM has a security flaw (and it did), it doesn't really matter, because the runtime is mandatory.

Over the years, I've become more and more distrustful of large, complicated systems designed "for my safety", and migrated towards tiny systems + formal verification/model checking/exhaustive symbolic testing.

100% certified and bug free (in the sense of "doesn't implement the spec") is where I think at least the safety and security critical part of the industry will be in 30 years. The tools available today are incredibly powerful, but systems designed over 5-10 years ago are unable to take advantage of them. There's no reason programs cannot be bug free with respect to their specifications today, given the tools we have at our disposal.

Your point is definitely valid. But Rust in particular does not have a large, complicated runtime. It's pretty small (it's just thread support, basically), and it's going to get smaller over time.

It would be great to formally verify the runtime and task infrastructure. In fact people have started to create (small, incomplete) Promela models of the message passing infrastructure. However, it's lower priority than figuring out what works from a pragmatic standpoint and implementing it at the moment.

Rust isn't production ready.

I never claimed it was. I'm just thinking that perhaps the approach it (and Haskell, for that matter) takes is better.

I think you're falling in to the "silver bullet" trap.


Basically, making reliable software is hard. Changing the language doesn't bring anything. There are a lot of tools to make sure your C/C++ programs doesn't have obvious errors. The problem are non-obvious errors, and these errors exist in all the languages, with different forms.

Another way to put it: "You cannot reduce risk, you can only replace it with another".

I agree that nothing solves all possible problems. But it's a huge jump from "there is no silver bullet" to "changing the language doesn't bring ANYTHING". Likely, it is quite a bit more subtle in practice. Preventing "obvious" errors (which really aren't always that obvious) that safe languages could minimize is already a great start, as many systems are rife with them.

> You cannot reduce risk, you can only replace it with another".

I don't understand why you would say this. Is it not the case that using a language with automatic memory management (say, Python) is less risky than using one with manual memory manamgement (say, C or C++)?

There are tradeoffs (e.g. performance, having less control over various things), but they are not tradeoffs between one risk and another risk, they are tradeoffs between risk and something else.

This is obviously not true. Language matters. Are you as productive in BrainFuck as in C++? I don't think so. C/C++ isn't the end of all progress in language design. Modern languages eliminate or vastly reduce whole classes of concerns like memory management, wild pointers, null pointers, incomprehensible template errors, etc. Some of these problems can be patched up for perhaps 30% by a post hoc static analysis, but the effectiveness of that is much less than a language that has been designed to make these errors impossible by construction. Modern languages also make lots of stuff easier through a good standard library, language features like closures & type inference, a REPL, a good IDE (the feasibility of which is also a function of language design), etc.

Sure, they don't solve the problem of magically and instantly conjuring up the program you want; nobody is claiming that. All these little things add up nevertheless.

Please read the wikipedia entry which addresses your specific concern regarding programming languages.

I was responding to this: "Changing the language doesn't bring anything." and this: "You cannot reduce risk, you can only replace it with another", not to the Wikipedia entry. The Wikipedia entry is very reasonable, and does not make the same outrageous claims.

"Changing the language doesn't bring anything." -> This is because the largest contributor to code quality is the coder. C is already a "safe" language as opposed to assembly or even writing the opcodes yourself.

"You cannot reduce risk, you can only replace it with another" -> I just ask you to think about it, try to have an open mind about what I may imply. I'm sorry for this mysterious answer but this is typically a topic over which we could talk past each other. The first time I was told you I had the same reaction until I had the epiphany about what it really means.

"This is because the largest contributor to code quality is the coder"

Even if that's true, it doesn't support your claim.. The language still matter, even if it turns out that it's not the most important part.

Then we can only agree that we disagree. I think the language matters very little, you think the opposite.

Your first paragraph is repeating what you already said, which is exactly what I responded to. To summarize it again: that C is high level or safe compared to assembler does not mean that other languages aren't significantly higher level or safer than C.

Your second paragraph is on the border of being condescending without any actual argumentation.

I think what you're missing is that antirez has gotten his C codebase perfect enough that hardware failures are the cause of most of his crash reports.

See also: http://blogs.msdn.com/b/oldnewthing/archive/2005/04/12/40756...

Um, no. C is a perfectly valid language and some of the best, most robust systems in the world are written in it (Linux, Git, etc.) Some languages are even built to run atop C (Cython.) Even the JVM deals with pointers, memory allocation issues, and such so you don't have to but it's still there!

So using a higher level or "safer" language isn't going to stop these kinds of problems.

Sorry if I'm breaking your bubble, but Linux and git are not "the best and most robust systems in the world". If they were, the state of the art of safe and reliable software systems would be quite pitiful.

edit: that doesn't detract your point however that C is used nowadays on "robust systems"... in terms of popular robust kernels though you'll want to look at something like L4 or QNX Neutrino. There's a kernel that is actually formally verified (seL4) that was first written in Haskell, then verified, then translated to C (for speed).

Because it's super widely deployed and has a very mature development process, I would expect the Linux kernel to be among the most robust software in the world.

I am interested to hear what you think is more robust than Linux, setting aside seL4. Do you think QNX Neutrino is more robust? If so, why? And what else?

I would expect vxWorks and other RTOSs to generally be less robust than Linux, despite typically going through various certifications.

The major source of errors in a kernel is device drivers, and Linux typically is running many more drivers than the embedded kernels you mentioned. Look at any Linux point release: the majority of churn is in driver code, to fix bugs.

Thus, it stands to reason, Linux is likely less stable than an embedded kernel without all that driver code.

I run a pre-emptive embedded kernel (QK) that's extremely tiny and, in fact, was validated by myself with KLEE (a symbolic checker) to exhaustively verify correctness. I'm certain it's more reliable than Linux, which carries no such guarantee (and is orders of magnitude larger -- even excluding driver code).

Bug rates correlate extremely close with lines of code. All else being equal, a large system has more bugs, simply because it has more opportunity for them.

If you truly care about correctness, doing formal verification, model checking, etc. is the way to go, not "lots of people use it so it must be stable".

To benefit from formal verification, you have to design your code around that, and most systems today are not. It's hard to retrofit verification on top of a legacy codebase like Linux.

I don't really understand your driver argument, because when deploying Linux on an embedded system, you would only use the drivers you need... so you'd have the same amount of driver code as with any other kernel.

You say that you managed to "verify correctness" on QK, but clearly, that's not an accurate statement, since AFAIK seL4 is the only kernel that has ever been "proven correct" (and even for seL4, there are some gotchas there, AFAIK).

If you truly care about correctness, doing formal verification, model checking, etc. ...

The common perception is that model checking and formal verification are still just research areas that can't be used practically beyond toy problems. Again, seL4 is the only kernel I know of that has been "proven correct," and that has taken an insane amount of manpower that is not scalable to anything more complex than seL4. That seems to be evidence that there is something to the "common perception" I stated above.

lots of people use it so it must be stable

That's not really my argument. With Linux, there is an insane amount of testing going on all the time (I mean "informal" testing.. just people using it and reporting problems if they encounter any.. though there are also farms set up that test Linux, as well). Linux must be by far the most highly-tested software in history. On the other hand, you could use formal methods on some small kernel and maybe get some help (but again, far short of proving there are no bugs, AFAIK), but the level of testing will be many orders of magnitude less. Clearly, which one wins out depends on how good the actual state of the art in formal methods is, but my impression is that many orders of magnitude more testing will win out, unless the problem you are solving is very, very small.

The entire QK[1] kernel, including all library code, is less that two thousand lines of C (and most of the library code isn't even used by the kernel itself, which is just 234 LOC). QK is a single-processor kernel (although still pre-emptive) and very well written and tested (in, literally, hundreds of millions of devices).

It was not difficult at all to run QK and it's supporting code through KLEE and exhaustively verify the properties of each function, thanks to a super-simple design and the many included assertions, preconditions, and postconditions, which KLEE helpfully proves are satisfied automatically. If I wanted a certified optimizing C compiler, I'd use CompCert[2] to compile QK, which would give me a certified kernel all the way to machine code. (I actually use Clang.)

I am familiar with the verified L4 kernel, and it is far more complex than the QK kernel I have verified. That people have already verified a kernel far more complex should be sufficient proof that a much simpler kernel can also have its correctness verified.

Stepping back, it's 2012: people should no longer be surprised when a small-but-meaningful codebase is certified correct. It's common enough now that you don't get published in a journal just because you did it.

[1] http://www.state-machine.com/qp/index.php

[2] http://compcert.inria.fr/

By the way, I am going to look into actually applying KLEE to some of my own code. I had heard of it before, but thanks for bringing it up, I hadn't really given it much thought until now.

In my research, I'm trying to write a "fancy" task scheduler at the user level, so that someone can get "fancy" scheduling policies without modifying the RTOS kernel. By "fancy" I mean fully preemptive and allowing "interesting" scheduling policies/synchronization protocols to be implemented. "Interesting" generally means multicore, in my particular research community.

If you don't mind me asking, what kind of projects do you/have you used QK for?

Interesting. I had never heard of QK, and I'm slightly surprised about that.

people should no longer be surprised when a small-but-meaningful codebase is certified correct

For the size of codebase you're talking about, I'm not really that surprised. But I still think that formal methods don't scale to multicore RTOSs, which is the area I study (I'm a CS grad student).

So I think it'd be fair to say that Linux does ok in the robustness / lines of code or robustness / devices and real world interfaces, however, in terms of absolute numbers, a smaller system is almost always going to do better.

I was going to mention safety certification[1], but if you're saying we should disregard those, then I'm not sure what you're looking for. QNX was built with real-time and safety-critical constraints in mind. Linux was not. I'm not saying QNX is generally better than linux... just that if I want to browse the web at home I'd use Linux, and if I want a kernel to control the car I drive I'd choose QNX.

[1] http://www.qnx.com/products/neutrino-rtos/safe-kernel.html

AFAIK, the kind of safety certificaitons applied to RTOSs are all about development process, and have nothing to do with actually proving correctness in a "mathematical" sense. I know very little about IEC 61508, but for DO-178B/C, it is all about development process. No certification agency wants to claim that software is actually correct, because they can't actually show that, and then they'd be blamed if something bad happened.

I have talked to people in the preempt_rt Linux community that believe Linux will come to dominate the RTOS market just like it has a lot of other stuff, and I think they have a compelling argument. Once preempt_rt is mature enough, it's hard for me to see any reason for going with something like QNX.

> Once preempt_rt is mature enough, it's hard for me to see any reason for going with something like QNX

To correct myself: except that the Linux development process is not compatible with current cerfitication standards. Despite that, it's possible that the Linux dev process is more mature and better "tracked", and just better, than what the certs actually require, so it may eventually be possible to certify it somehow. (Anyone with thoughts on this, please pipe up...) I get the impression that some preempt_rt people have looked into this.

I don't even remember the last time I saw a Linux server crash... at least I can't think of any occasion in the last 10 years or so that wasn't directly related to some hardware failure (firmware bug or actual dead hardware).

So, I don't understand what you mean by Linux not being safe and reliable...

I'm not saying it is not reliable, just that it is not robust relative to safety-critical software. It can be reliably used as a server without a problem, but whether or not it can be used in a manned spacecraft, for example, is another question of robustness entirely.

That has little to do with robustness. Linux may very well be as robust or even more robust than any OS considered for those purposes, but the question isn't robustness, it's predictability.

I even go further and say that Linux is more robust than many of these other OSes, since it is exposed to a wider range of environments and hardware combinations of varying quality and whatever bugs they trigger. If you reduce Linux to the kind of footprint that, for instance, VxWorks or QNX have, I don't believe it would be any less reliable.

But Linux isn't predictable. Many safety-critical systems only rely on the system not crashing or misbehaving, but many others rely on real-time characteristics. Those Linux doesn't have.

Putting quotes around "safer" doesn't make it any harder to make mistakes in C.

The point isn't about abstracting away the machine, but about reducing the amount of code that has no safety guarantees.

I'm not sure why you are being downvoted, as it is still an open question whether or not safer languages make it easier to write safer programs. So your comment may be a valid one.

They're being downvoted because they're wrong - the stuff talked about here will affect any kind of program at the machine level.

Having said that, I wouldn't downvote. The question is sparking a clarifying conversation so it's arguably worthwhile even though the premise if wrong.

(Forgive me if I've got some technical details wrong and I'm not a C programmer, but I do feel like this is a valid question and isn't trying to tweak the noses of C programmers. C programmers please feel free to add/correct anything I'm missing.)

C has a powerful/unsafe feature in that it allows you to directly address memory and read and write data. From a high-level, there are potential problems that can happen when doing this: (1) You might overwrite data in a memory location that you were using for something else or (2) you can write data that, say, represents a string and then read it back and try to interpret it as data that represents a number.

You can easily avoid the above two problems by using a library, while still having the option to optimize the code if the need arises. Since a product like Redis needs to be very performant in both speed and memory usage, the option to optimize code can't be emphasized enough: It is a critical feature of Redis and thus C is an excellent language to use for it's development.

* I do feel like this is a valid question*

What is your actual question?

The "nested" formatting of this page has vertically separated my post from the post I was responding to. I'll edit my post and quote the question I'm aswering for clarity.

(edit: The edit link is no longer available on my other post. Here's the question I was responding to: "Perhaps using safer languages (and languages with better error reporting) would be a solution to [the kinds of problems mentioned in the article.]")

Well, not to the memory problems. But I do think modern systems languages should be better about printing out stuff in the case of segfaults, for example. Not sure if Go and Rust do this stuff.

Such languages also allow you to write code much faster which can lead to more mistakes.

Edit: huh. 2 downvotes. Why not explain why you think I'm wrong, rather than just downvoting because you don't agree?

You get downvotes because your reply contradicts itself. A safer language results in more mistakes being found at compilation time, so would result in less mistakes in the product, not more. Also I would expect having to specify annotations for your code to be verified (ie correct type specifications, pre/postconditions, contracts, carefully specifying program input/output through parsers, etc) results in slightly slower pace of development, not faster. But I suppose if you make up more than that time in debugging and troubleshooting time, or time explaing awkward bugs to annoyed customers, it's still a win.

Compilers can't detect logic errors which tend to be more common with faster development.

Unless your language is dependently typed.

If even true, that's besides the point. How did 'faster development' get into your reasoning at all?

there is an approach to hard real time software where antirez's idea for a memory checker is done.

This post reminded me of my time as a consulting systems support specialist. Lots of weird problem turned out to be bad hardware. Usually memory or disk, sometimes bad logic boards. For end users, this would often lead to complete freezing of the computer, so it was less likely to be blamed on broken software, but there were still many times it was hard to be sure. Desktop OS software can flake out in strange ways due to memory problems. I used to run a lot of memory tests as a matter of course.

I think the title of the article could be more accurate, considering how much is devoted not to issues about software reliability per se, but to distinguishing between unreliable software and unreliable hardware. I think an implicit assumption in most discussions about software reliability is that the hardware has been verified.

I personally do not think that it is the responsibility of a database to perform diagnostics on its host system, although I can sympathize with the pragmatic requirement.

When I am determining the cause of a software failure or crash, the very first thing I always want to know is: is the problem reproducible? If not, the bug report is automatically classified as suspect. It's usually not feasible to investigate a failure that only happened once and cannot be reproduced. Ideally, the problem can be reproduced on two different machines.

What we're always looking for when investigating a bug are ways to increase our confidence that we know the situation (or class of situation) in which the bug arises. And one way to do this is to eliminate as many variables as possible. As a support specialists trying to solve a faulty computer or program, I followed the same course: isolate the cause by a process of elimination. When everything else has been eliminated, whatever you are left with is the cause.

I'm still all jonesed up for a good discussion about software reliability. antirez raised interesting questions about how to define software that is working properly or not. While I'm all for testing, there are ways to design and architect software that makes it more or less amenable to testing. Or more specifically, to make it easier or harder to provide full coverage.

I've always been intrigued by the idea that the most reliable software programs are usually compilers. I believe that is because computer languages are amongst the most carefully specified kind of program input. Whereas so many computer programs accept very poorly specified kinds of input, like user interface actions mixed with text and network traffic, which is at higher risk of having ambiguous elements. (For all their complexity, compilers have it easier in some regards: they have a very specific job to do, and they only run briefly in batch operations, producing a single output from a single input. Any data mutations originate from within the compiler itself, not from the inputs they are processing.)

In any case, I believe that the key to reliable programs depends upon the a complete and unambiguous definition of any and all data types used by those programs, as well as complete and unambiguous definitions of the legitimate mutations that can be made to those data types. If we can guarantee that only valid data is provided to an operation, and guarantee that each such operation produces only legitimate data, then we reduce the chances of corrupting our data. (Transactional memory is such an awesome thing. I only wish it was available in C family languages.)

One of my crazy ideas is that all programs should have a "pure" kernel with a single interface, either a text or binary language interface, and this kernel is the only part that can access user data. Any other tool has to be built on top of this. So this would include any application built with a database back-end.

I suppose that a lot of Hacker News readers, being web developers, already work on products featuring such partitioning. But for desktop software developers who work with their own in-memory data structures and their own disk file formats, it's not so common or self-evident. Then again, even programs that do rely on a dedicated external data store also keep a lot of other kinds of data around, which may not be true user data, but can still be corrupted and cause either crashes or program misbehaviour.

In any case, I suspect that this is going to be an inevitable side-effect of various security initiatives for desktop software, like Apple's XPC. The same techniques used to partition different parts of a program to restrict their access to different resources often lead to also partitioning operations on different kinds of data, including transient representations in the user interface.

Can a program like Redis be further decomposed into layers to handle tasks focussed on different kinds of data to achieve even better operational isolation, and thereby make it easier to find and fix bugs?

I've always been intrigued by the idea that the most reliable software programs are usually compilers.

I don't think this is necessarily true; I used to maintain the Delphi compiler, and there were hundreds of bugs in the backlog that never really got looked at owing to workarounds, low impact and high cost of fixing.

What compilers usually have going for them is that they are batch processes rather than online processes, so they don't have time to build up crud in data structures; they have highly reproducible inputs - code that causes a crash normally causes a crash every run of the program, no weird mouse clicks or timing needed, and this code can usually be sent back to the vendor; and all customer code is effectively a unit test, so feedback from betas etc. is immediate and loud.

>All programs should have a "pure" kernel with a single interface

Very interesting idea... sounds like something someone would design an operating system around, it would require some sort of highly optimized external call thing.

But I am not sure I get it. What would stop an upper layer bug from simply passing bad instructions to the kernel?

Applications are open for YC Winter 2020

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact