Hacker News new | comments | show | ask | jobs | submit login

While I was at Google, someone asked one of the very early Googlers (I think it was Craig Silverstein, but it may've been Jeff Dean) what was the biggest mistake in their Google career, and they said "Not using ECC memory on early servers." If you look through the source code & postmortems from that era of Google, there are all sorts of nasty hacks and system design constraints that arose from the fact that you couldn't trust the bits that your RAM gave back to you.

It saved a few bucks in a time period where Google's hardware costs were rising rapidly, but the ripple-on effects on system design cost much more than that in lost engineer time. Data integrity is one engineering constraint that should be pushed as low down in the stack as is reasonably possible, because as you get higher up the stack, the potential causes of corrupted data multiple exponentially.




Google had done extensive studies[1]. There is roughly 3% chance of error in RAM per DIMM per year. That doesn't justify buying ECC if you have just one personal computer to worry about. However if you are in data center with 100K machines each with 8 DIMM, you are looking at about 6K machines experiencing RAM errors each day. Now if data is being replicated then these errors can propogate corrupted data in unpredictable unexplainable way even when there are no bugs in your code! For example, you might encounter your logs containing bad line items which gets aggregated in to report showing bizarre numbers because 0x1 turned in to 0x10000001. You can imagine that debugging this happening every day would be huge nightmare and developers would end up eventually inserting lot of asserts for data consistency all over the places. So ECC becomes important if you have distributed large scale system.

1: http://www.cs.toronto.edu/~bianca/papers/sigmetrics09.pdf


That data set covers 2006-2009 and the ram consisted of 1-4GB DDR2 running at 400-800 MB/S. Back when 4GB was considered a beefy desktop, consumers could get away with a few bit-flips during the lifetime of the machine. Now my phone has that much RAM and a beefy desktop consists of 16-32 GB of RAM running at 3GB/s.

It's time we start trading off the generous speed and capacity gains for a some error correction.


Note that the error rate is not proportional to the amount of RAM, it is proportional to the physical volume of the ram chips. (The primary mechanism that causes errors are highly energetic particles hitting the chips, the chance that this happens is proportional to the volume of the chips.) This means that the error rate per bit goes down as density goes up.


Cosmic rays causing the errors has got me thinking about if the error rates vary with the time.

Do you get more/less errors when it's day time (due to the Sun)? Does the season affect it (axial tilt means you're more/less "in view" of the galactic core)?


Wouldn't it go up if the density increases? If the particle hits the chip there are more bits at the place where the particle hits.

So while the chance of hit is lower (per GB), if it hits its effect will be higher (more bits flipped).


It is an interesting question but I think the parent poster did not mean density in the pure physical sense.

That is more memory but less mass which is not physical density. Also I am not sure if gamma rays need to just hit the physical bits to mess things up. If is the case where other things can be hit then it seems surface area might have high correlation but probably not.

I don't know what the answer is but I would imagine that the error rate would be the same percentage assuming orientation is kept the same.

Of course if you are going at extreme macro sense (think Asimov last question computers [1]) then density absolutely probably plays role as gravity starts to cause enormous amount of collisions. This actually happens in stars and is why photons take a long time to escape from the star as well as the edge of black holes where collisions are happening extremely frequently.

[1]: http://multivax.com/last_question.html


An alpha particle for instance is atleast an order of magnitude smaller than smallest transistor. The maximum damage it can do is effectively 1 bit.


Alpha particle won't penetrate that far, it will be stopped at the building level, or at the enclosure. Piece of paper blocks it.

Beta and gamma are the ones that can do damage (not sure about beta), and gamma can pass through the entire chip, so it can hit multiple transistors, depends on the angle and the way they are located.


Actually these high energy particles tend to be order of size of proton or less - so make that 6 orders of magnitude smaller than smallest transistor.


That's a 3% per DIMM per year chance of at least one error. Most memory faults are persistent and cause errors until the DIMM is replaced. Also, the error rate was only that low for the smallest DDR2 DIMMs.


I have hit soft errors in every desktop machine that used ECC. Either I have bad luck, ECC causes the errors or third thing. I think ECC should be mandated for anything except toys and video players.


> I have hit soft errors in every desktop machine that used ECC.

Not sure if I should start getting nervous or just your RAM sucks ;) I get ECC errors only if I overclock too much, and I run the RAM overclocked all time. It's actually one of the reasons I wanted ECC.


Different RAM, more soft errors the older a system gets. Heh, the system should auto over clock until it starts to get correctable soft errors and then back off. Or reduce refresh until soft errors and then bump it up. Max speed at the lowest power.


How much more expensive is ECC ram? I don't have it and I've never experienced obvious issues, if it's a lot more expensive it's not really worth it for the once or twice the desktop will likely experience an actual issue


Should be about 1/8th more since it's just a 72-bit bus for carrying 64-bits data and 8-bits check. Or rather, your dimm will have 9 chips instead of 8.

How they get you is Intel will sell you a xeon which is the exact same die as an i5 in a different package for more money.


Depends what you need - you can pick up older gen Xeon chips for cheap and the performance often isn't that much worse than modern consumer grade stuff. If you're looking to build a consumer-level NAS or home server, Avoton is pretty cheap and takes ECC RAM.


Unfortunately, Avoton might just suddenly stop working on you.

https://www.servethehome.com/intel-atom-c2000-series-bug-qui...



It should be 1/8th more, plus a bit for the scrubber. But in practice ECC memory is "enterprise priced" so it's more like double.


Should we do a Kickstarter to manufacture our own DIMMs? Its an easy design and I hate donating to some corporate gross margins. Maybe enough people feel the same.


It's significantly more expensive, usually around 30-100% more, depending on capacity. IMO not worth it on a desktop, possibly worth it on a home server or a serious workstation. Plus your CPU and motherboard has to support it, which is a pain with Intel's consumer lineup.


Good thing ryzen supports ECC OBO. Just waiting on motherboard support for it.



I think I may go AMD (again) for this very reason.

(Generally, I don't think ECC actually does matter that much for us casual/home users, but I like to reward the people who actually do make it easy to "do the right thing". Same deal as only purchasing AMD graphics cards since 2005-ish(?).)


If you're not worried about certain chip features and power draw, last gen server equipment is very cheap.


usually its cheaper because of server market forced upgrade cycle surplus. Problem is its mostly Buffered/Registered ECC which cant be used in desktop motherboards.


> There is roughly 3% chance of error in RAM per DIMM per year. […] with 100K machines each with 8 DIMM, you are looking at about 6K machines experiencing RAM errors each day.

Can you work out the math? I don't follow it. 3%×100K×8÷365=66 per day by my reasoning…


they've multiplied by 3 instead of 0.03


> There is roughly 3% chance of error in RAM per DIMM per year. That doesn't justify buying ECC if you have just one personal computer to worry about.

How do you make that leap?


It's an inappropriate leap. Consumers should have ECC memory too.

However the consumer market has long decided to settle for ECC nowhere and cheap everywhere.

ECC hardware comes at premium option that can easily be +100%. You need support in the memory, the motherboard and the CPU.

Given the price difference, personal computers will have to live with the memory errors. People will not pay double for their computers. Manufacturers will not sacrifice their margin while they can segment the market and make a ton of money off ECC.


Amd has modestly priced hardware that supports ecc


Was that the case before Ryzen? I know their new CPUs support ECC, but I'm not sure for earlier generations.


I think it was common for AM3 for example too.


ECC is officially supported by all AM2/3(+) CPUs and AFAIK all corresponding motherboards from ASUS. As in, you have it guaranteed on the spec sheet.

There are also reports of BIOS support in some boards which don't have ECC advertised. And you can try to enable it in the OS even without BIOS support, though some level of hardware support is still necessary. As Linux documentation puts it: "may cause unknown side effects" :)


It was technically supported by the hardware, but not by many motherboard and BIOS's.


Yep.


Bristol Ridge does support ECC BTW, but one problem is that you can't use ECC with x16 chips (because ECC is 72-bit), so with 8GB of RAM and 8Gbit chips you have to choose between non-ECC/ECC single channel with x8 chips and non-ECC dual channel with x16 chips. 4Gbit don't have this problem but will become obsolete especially when 18nm ramps up, and while DRAM prices should decline when that happens...


What's the matter with x8/x16 chips and dual channel? I don't think it should matter.

Or do you mean that if you want exactly 8GB then it's hard to find a pair of 4GB DDR4 ECC modules? Well, just get 2x8GB if you are a performance nut.


Yes, what I am saying is that it is impossible with 8Gbit chips, but possible with 4Gbit.


I'd like to know this, too.

I am guessing it's because, if RAM errors increase linearly with the number of computers, then RAM errors will be a greater and greater proportion of total errors. This assumes other kinds of errors don't scale linearly. Someone looking through logs is looking for errors, they'd like to find fixable logic errors, not inevitable RAM errors.


A cost/benefit analysis for a system where non critical operations are performed would seem to favor the non ECC memory. I suspect this is the case for the majority of people who have computers for their personal use, without taking into account that they might not even be aware such a thing exists. Although, I haven't compared ECC prices lately.


Your game machine can live without ECC.

Your NAS should better have it, though.


Probably assumptions about uses of PC. I'd imagine most of bits are media related.


Because the market.


This makes me wonder how banks deal with this issue.


> If you look through the source code & postmortems from that era of Google, there are all sorts of nasty hacks and system design constraints that arose from the fact that you couldn't trust the bits that your RAM gave back to you.

Details of this would be very interesting, but obviously I understand if you cannot provide such details due to NDAs, etc.

I mean, I can imagine a few mitigations (pervasive checksumming, etc), but ultimately there's very little you can actually do reliably if your memory is lying to you[1]. I can imagine that probabilistic programming would be an option, but it's hardly "mainstream" nor particularly performant :)

I'm also somewhat dismayed at the price premium that Intel are charging for basic ECC support. This is a case where AMD really is a no-brainer for commodity servers unless you're looking for single-CPU performance.

[1] Incidentally also true of humans.


You need ECC /and/ pervasive checksumming. There are too many stages of processing where errors can occur. For example, disk controllers or networks. The TCP checksum is a bit of a joke at 16 bits (it will fail to detect 1 in 65000 errors), and even the Ethernet CRC can fail - you need end to end checksums.

http://www.evanjones.ca/tcp-and-ethernet-checksums-fail.html


I did a bunch of protocol level design in the 90's and one of the handful of things that taught me was _ALWAYS_ use at least a CRC with a standard polynomial. Its just not worth it, in the 2000's I relearned the lesson when it comes to data at rest (on disk/etc). If nothing else both of those will catch "bugs" rather than silently corrupting things and leading to mysteries long after the initial data was corrupted.

I just had this discussion (about why TCP's checksum was a huge mistake) a couple days ago. That link is going to be useful next time it comes up.


Too many stages... for what? You haven't stated what the criteria for 'recovery' (for lack of a better word) are. What is the (intrisic) value of the data?

Personally, I'm a bit of a hoarder of data, but honestly, if X-proportion of that data were to be lost... it probably wouldn't actually affect my life substantially even though I feel like it would be devastating.


Crc checksums can be wrong if you have multiple bit errors like runs of zeros. (This resets the polynomial computation) http://noahdavids.org/self_published/CRC_and_checksum.html

but crc is good to check against single bit errors.


> ultimately there's very little you can actually do reliably if your memory is lying to you

1. Implement everything in terms of retry-able jobs; ensure that jobs fail when they hit checksum errors.

2. if you've got a bytecode-executing VM, extend it to compare its modules to stored checksums, just before it returns from them; and to throw an exception instead of returning if it finds a problem. (This is a lot like Microsoft's stack-integrity protection, but for notionally "read-only" sections rather than read-write sections.)

3. Treat all such checksum failures as a reason to immediately halt the hardware and schedule it for RAM replacement. Ensure that your job-system handles crashed nodes by rescheduling their jobs to other nodes. If possible, also undo the completion of any recently-completed jobs that ran on that node.

4. Run regular "memtest monkey" jobs on all nodes that attempt to trigger checksum failures. To get this to work well, either:

4a. ensure that jobs die often enough, and are scheduled onto nodes in random-enough orders, that no job ever "pins" a section of physical memory indefinitely;

4b. or, alternately, write your own kernel memory-page allocation strategy, to map physical memory pages at random instead of linearly. (Your TLBs will be very full!)

Mind you, steps 3 and 4 only matter to catch persistent bit-errors (i.e. failing RAM); one-time cosmic-ray errors can only really be caught by steps 1 and 2, and even then, only if they happen to affect memory that ends up checksummed.


How do you calculate those checksums without relying on the memory?


the chances of the memory erroring in such a way that the checksum still matches becomes quite small


You can't really, but you are now requiring the error to occur specifically in the memory containing your checksum, rather than anywhere in your data.


It deeper than that. What are you calculating the checksum of? Is it corrupted already?

If you can't trust your RAM, you have no hard truth to rely on. It's only probabilistic programing or living with the errors.

(Although, rereading the GP, he seems to be talking about corrupted binaries. Yes, you can catch corrupted binaries, but only after they corrupted some data.)


It's even worse than that: where's the code that's doing all the chucksumming and checking of checksums? Presumably it came from memory at some point...

Maybe it was read fine from the binary the first time, but the second time...

At some point you just have to hope.


Pervasive checksumming is going to cost a lot of CPU and touch a lot of memory. The data could be right, the checksum wrong as well. ECC double bit errors are recognized and you can handle them how you'd like, including killing the affected process.


I agree, which is why I used the word "mitigation", as in: not a solution.

Probabilistic programming is a theoretical possibility, but not really practical.


it was indeed Craig


Given that cosmic radiation is one source of memory errors, shouldn't just better computer cases reduce memory errors?

Basically a tin-foil (or plumb-foil) hat over my computer?




Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact

Search: