Hacker News new | past | comments | ask | show | jobs | submit login
Why use ECC? (danluu.com)
188 points by benkuhn on Nov 27, 2015 | hide | past | web | favorite | 95 comments



I've had ECC on my workstations since 2006 when Intel forced it on DP systems due to FB-DIMMs. Being that I have had a good number of correctable memory errors over the past decade I don't feel I can go back to not having it, wouldn't make any sense when the cost is so low. It was also a case that I couldn't get non-ECC 8GB and 16GB DIMMs when I built some of my systems, so it just was a matter of fact that I had to use it.

Do people need it? Nah probably now for systems that can handle crashing, but you'd be nuts to not use it in servers or systems running long jobs - it's just a single insurance payment on your system that gives a remote chance of protection.

Sadly there are very few studies in to it that show how modern DIMMs still get errors that are correctable. Manufacturing processes are much better, but they haven't eliminated the need for it.

I do enjoy the "why would I need it, I've never had memory errors" attitude though when people likely have no idea why their application or OS crashed. And the accounts of people who eventually diagnose memory errors after weeks of random crashes which would have been reported immediately if they had ECC.


This is a big dilemma I have. I'm trying to build a workstation similar to Nvidia's reference design for deep learning:

https://developer.nvidia.com/devbox

I will be doing deep learning and other ML GPU-powered tasks. Plus some long running high-memory I/O intensive tasks.

Note that Nvidia's build does not employ ECC RAM. And it's a quite expensive machine. Mine will be only a fraction of the cost ($4.5k), with just one Titan X. It's possible to afford a Xeon, but this comes at the compromise of buying slower hardware. What shall I do?

Intel's segmentation of the market, limiting the amount of RAM you can use in regular CPUs and removing support for ECC sucks.


They likely don't use ECC ram because it's a Dev box targeted at development and testing... ie. it's not going to need to be up and running for a large stretch of time and/or data corruption is acceptable.

Otherwise, their reference system is no different from a "high end" gaming rig (ie. not a "server").


But some deep learning tasks can easily take 4 or 5 days in a humble setup with a GPU or 2. Therefore it might be advisable to run ECC?


Yes, you are correct. I would use ECC in that situation if it were my system.

The point I was making about the reference Nvidia setup was that it's targeted at development, which means starting and stopping the system often, etc. So ECC wouldn't be of much use in that sense, and it would only serve to make Nvidia's reference system more expensive. I don't think it's a sign they don't believe it should be used.


Your GPU doesn't have ECC so the point is moot

Yes, there is a risk, but maybe not enough to jeopardize the test.


NNs are adaptive so they should be able to tolerate the odd flip in the weight matrices during training.


Do Titans have ECC? If the bulk of the calculation will be done on the GPU, adding ECC to the main board may not matter.


That's a good point. AFAIK they don't. But Teslas and Quadros do.


AMD has much cheaper points of entry to ECC.


yeah I was noticing this, AMD is much more ECC friendly whereas Intel segments their markets hard by arbitrarily disallowing ECC platform use on common consumer (read: not-Xeon) CPUs. Unfortunately AMD is still quite a bit slower than Intel these days, but if you are doing work that is GPU rather than CPU intensive it might not matter.


Even then, not all MBs support it... Most of the higher end Asus boards do, with AMD CPUs. The difference in power usage is pretty significant... Going from an FX-8350 to an i7-4790K was a significant power/heat improvement, for a faster CPU. The i3-5010U in my htpc is pretty awesome for what it does as well.

It really depends on the tasks if you NEED it... I wouldn't to a byo nas ever again... I would suggest that a lot of workloads don't need it.

That said, the additional cost should be nominal at this point, and I don't quite get why it isn't standard. It's like the coprocessors in the early 90's eventually became a critical integrated piece.


> disallowing ECC platform use on common consumer (read: not-Xeon) CPUs

this hasn't really been true since Haswell. Now even low-end Celerons and Pentiums support it.


One may suspect that some Celeron/Pentium/i3 SKUs support ECC just so that Intel doesn't have to manage low-volume very-low-end E3 SKUs. And then you run into the fact that the cheapest server motherboard is double the price of a cheap desktop one.


From what I can tell, some Celeron/Pentium/i3 SKUs support it because AMD was killing Intel in certain NAS applications thanks to the fact that their low-end chips had ECC.


Core i3's have ECC support.


Yes, but many of the consumer grade motherboards do not. (you have to step up to a xeon in most cases)


Low end server/workstation motherboards are quite cheap and plentiful nowadays. Even some of the low-end Pentiums have ECC support so there is little reason not to.


Any examples of a cheap Intel motherboard that supports ECC?


http://www.newegg.com/Product/ProductList.aspx?Submit=ENE&DE...

There seems to be plenty of options in the sub-$200 range.


I don't believe the Core i3's have ECC support. See [1]. My understanding is that the i3's (and i5's and i7's) differ from Xeons in that the do not have ECC support.

[1] https://communities.intel.com/message/282073


There are a lot of i3's with ECC support, even some Pentiums, Celerons and Atoms have ECC support. I see them used often in cheap NAS builds, seems just that specific one in your link didn't have it, not i3's in general.

http://ark.intel.com/search/advanced?s=t&ECCMemory=true


According to the link which Orf pasted below: https://aws.amazon.com/ec2/faqs/

Amazon use ECC in all server instances including their GPU compute machines. If you're part of a company or university ask Intel for a Xeon sample. They sometimes hand them out for free.


Why would Nvidia stick four graphics cards in a system when each card uses 16 lanes of pcie and the processor maxes out at 40 lanes?


It depends on algorithm/implementation but some do not require much communication with CPU once you put data on GPU.


I'm seeing the E5-1650v3 at around the same price as the i7-5930K and the Supermicro X10SRA significantly cheaper than the Asus X99-E WS but it's not SLI certified.


But with slight overclocking the i7 will offer significantly better performance at the same price point. (edited)

Also try moving to newer processes / more cores. The price gap becomes big.


Currently actually only Skylake Xeons support AVX-512, the shiniest and newest incarnation of SIMD. No desktop i3, i5 or i7 does that.

No one is using MMX anymore, but even it is supported by all the Xeons made in last 15 years. Xeons have had SSE just as long as other Intel CPUs.

So what you said is not true at all.


True, I stand corrected. Thanks.


I just switched out a tired old Dell Precision T5400 (~7 years old, ran 24hrs a day but began randomly crashing) which has ECC memory for a new box built out of bits and no ECC memory. It'll be interesting to see what difference this makes regarding error rates and hardware failures.


I just want to be clear that in the original referenced article, I am not anti-ECC per se, I just found myself caught in the massive cognitive dissonance between "you must have ECC in all your computers otherwise they will constantly and silently corrupt your data + crash" and "statistically speaking, most computers in the world do not use ECC". How can both of these things be true?

The argument for ECC is credible (I personally think rowhammer is the best example of this actually mattering, but ironically a) you can rowhammer ECC memory just fine and b) DDR4 has hardware features to mitigate rowhammer -- which shows how quickly things are changing), but it also seems to hinge on whether you have hundreds to thousands of computers all working together, e.g. the positive effects of ECC only seem to matter enough statistically at a _very_ large scale.


> "you must have ECC in all your computers otherwise they will constantly and silently corrupt your data + crash" and "statistically speaking, most computers in the world do not use ECC". How can both of these things be true?

What makes you say that? Both those things being true simply means that most computers silently corrupt your data and crash. That matches my experience. My programs occasionally crash. My pictures and videos are occasionally corrupted.

Do I know that those events are caused by memory failures? No. Most of them are probably other sorts of software or hardware failures, but some could be memory errors.


I've never had a photo or video occasionally corrupted on any computer I've ever owned going back to 1985. Crashes? Sure, who hasn't.

"Some could be.." is computing by coincidence, and I'm not a fan of that logic. You can refer to the 2007, 2009, and 2012 studies for measured data on server farms.


> I've never had a photo or video occasionally corrupted on any computer I've ever owned

How do you know? Just the other day I pushed a TS video stream through the script that accidentally modified some bytes (I didn't even bother to analyze the exact changes). There were occasional artifacts observable, but the stream played nevertheless. Which is how the players are being made: to be able to resynchronize even on bigger errors. Somebody who programs these things can surely tell you much more about that.

I was able to notice that because I played the stream in mplayer while watching the console output. So I saw the "debug messages" detecting "something" wrong, and stopped seeing them once I've corrected the script. It was the messages, not the "watching" that made me sure.


Why worry about it if you don't even notice the corruption ?


For your own videos that nobody ever even wants to watch again, you surely don't have. For the servers for your business...

Well... I can imagine there are enough people who won't care too.

Note that the chance for corruptions increases as long as the data remain in RAM unsaved and only much later get to be saved. We are lucky that, at least in private use, we typically copy the data to the medium without keeping it too long only in the RAM (order of seconds or less). So that also reduces the chances of our pictures being saved with the wrong bits.


An important tenet of the digital world is avoiding error propagation. You pass the video along a few times and the errors accumulate.


I presume most of the corruption I've experienced is because I stored my data a single hard drive that I used until it completely failed. Of course, that's how most consumers use their systems so it's probably a common experience.

Though, I usually only notice it years after the fact. I have a lot of photos and a lot of videos and only view a few old ones on occasion.

> "Some could be.." is computing by coincidence

That second paragraph of mine is mostly superfluous, and you seem to have read more into it than I intended. Whoops. I should have been more clear. I was not trying to make any sort of claim as to how to do computing. I was only trying to show that there are other possibilities that you would need to investigate and reject before going from your premises to that particular conclusion.


A few bit flips in an image are unlikely to be noticeable (it depends on the format though; uncompressed is obviously the most resilient, JPEG should be reasonably resistant too.

Also, I don't think images stay for too long in RAM. Your HDD definitively has ECC.


I've had plenty of corrupt JPGs - they're certainly noticable since a single error tends to mess up an entire macroblock, often even the entire rest of the image.

Was fun finding a bunch of my own photos like that. Luckily that was on a disposable copy and not primary storage, but it was enough to put me completely off non-checksumming filesystems.


Actually jpg is not very resilient, there is artforms doing single-bit changes to jpg-files to end up with very strange images.


This is an adversarial error, I think it's pretty good against random errors. I'll do a test sometime.


Most normal people also don't keep large amounts of data or long running programs in memory for long periods of time, which is the case that is most vulnerable.


is this even true of a typical server workload? I guess it depends on the server and the work, but certainly for a webserver, this doesn't seem true.


It's true of databases or storage servers that rely heavily on caching in RAM (like with ZFS). Also many HPC workloads, which might be storing e.g. huge matrices distributed in memory over many nodes.


The answer is unfortunately: because we don't act like professionals.

I hope that, say, civil engineers wouldn't build bridges the way we throw together computer software and hardware.


Well they do... just look at the fiasco with the replacement for the Bay Bridge in California.


That decision was not made by bridge engineers. It was made by architects, artists, and civilians. Basically, everyone with zero engineering experience. The whole thing was a political rig job, setup to create something iconic, rather than something that worked.

San Franciscans insist on always doing things differently, not necessarily doing them well.


Well lots of people come into the picture when you build any engineering system. This includes software and hardware. There are many cost tradeoffs and compromises.


In this case, it was purely non technical reasons and not many tradeoffs. It was all political: http://www.sfchronicle.com/bayarea/article/Bay-Bridge-s-trou...


> "statistically speaking, most computers in the world do not use ECC". How can both of these things be true?

... most computers aren't particularly reliable.

I'm curious - are you not monitoring or at least keeping a vague eye on ECC correction events with your existing hardware? If so, are you just not seeing any?

I've never really operated at any sort of "large" scale - a handful of racks at most - but I've always found correction events to be about as routine as IO errors, and certainly way more common than outright disk failures.


Neither of our (very good IMO) sysadmins bothered to check ECC counters on our existing server memory as far as I know. We did check SMART counters and SSD wear levels, though.. It is certainly a very good point that unusual numbers of ECC errors (corrected) are a bad sign and that machine should probably have its memory replaced at a minimum.


I.. would probably have done that before concluding I didn't have any need for it. Even just do a quick one-off mcelog run to see what events have happened recently - you don't have to do some big statistical analysis.

I'd estimate about half the machines I've used with ECC have had at least one correction a year, maybe a quarter had one every few months. I've seen one-off weird bursts that never happen again, and I've had quite a few cases where ongoing corrections have indicated a DIMM needed reseating. An actual blatantly faulty DIMM's been quite rare.


You aren't concerned about cosmic rays?

https://en.wikipedia.org/wiki/Cosmic_ray#Effect_on_electroni...

>> Studies by IBM in the 1990s suggest that computers typically experience about one cosmic-ray-induced error per 256 megabytes of RAM per month.[76] To alleviate this problem, the Intel Corporation has proposed a cosmic ray detector that could be integrated into future high-density microprocessors, allowing the processor to repeat the last command following a cosmic-ray event.[77]


Is there a real-world practical rowhammer attack against ECC memory? I haven't seen one and the test tools I ran never triggered anything on servers with ECC.


If you read the comments to the referenced article, someone had a rowhammer problem even with ECC memory. ECC does indeed reduce the chance of a memory error a lot, but the chance is far from zero (at scale of hundreds/thousands of servers), even with ECC. Another paradox about ECC, it's not a guarantee, so you still need to build systems that can tolerate / mitigate statistically rare memory error states.

This commenter seemed quite credible to me, here's where it starts:

http://discourse.codinghorror.com/t/to-ecc-or-not-to-ecc/377...

> Yep, these are what we had -- uncorrectable errors with ECC memory caused by row hammer. Luckily there are mitigations. Sandy Bridge allows you to double...


Uncorrectable errors with ECC memory caused by row hammer is just ECC working as designed.

The scary thing about rowhammer isn't just the potential a fault, it's the potential for a security vulnerability. Yeah, a DOS attack isn't nice either, but it's a million times less worse than a privilege escalation caused by memory corruption.

For rowhammer to be even the same magnitude of problem with ECC memory would require not an uncorrectable error, but an error that gets past the error detection mechanism entirely and produces data either seen as correct or correctable, but which is not the original data.


Rowhammer might get through eventually, but it will blow the corrected error count through the roof on the way there.

The job of ECC is to prevent transient failures and to warn you about less-transient failures. It's not a paradox that it can't paper over things that are actually broken.


> Alternately, it might be a plan to create literal cloud computing.

Thanks, I just snorted tea onto my keyboard after reading that. Seems like if you have the money and are maintaining a pretty critical system it would be silly not to get ECC RAM (and if you're building your own iron the price difference isn't that much as far as I can tell).

On the EC2 site they say "In our experience, ECC memory is necessary for server infrastructure, and all the hardware underlying Amazon EC2 uses ECC memory"[1]. Amazon maintain a lot of servers and if they think it's necessary I'm inclined to believe them.

1. https://aws.amazon.com/ec2/faqs/


A big part of the value proposition of ECC does seem to hinge on having thousands, and perhaps _many_ thousands of machines working together. At a large enough scale even small statistics start to matter.


It's more a value proposition of how long the machine will be running. Given time, it will get corrupt bits. You don't need thousands of machines to experience memory corruption...


I think ECC is inevitable. With 128GB DIMMs being produced now and NV-DIMMs (DDR4 flash-on-dimm) just around the corner, some kind of hardware error detection is necessary.

This is similar to high capacity spinning drives. With smaller ones you could just go with RAID5 and not worry about anything, but when drives are 3-4TB and up, you have to use RAID6, because the spec error rate becomes too high to rely on a single parity drive.

Here, too, when you machine has 1-2TB of hybrid RAM/NVM you HAVE TO have some way to detect failures, even if it's not particularly good. Performance characteristics of RAM preclude the more robust algorithms such as wide (32bit) CRC from being used (narrower CRC could still be doable in hardware, though), but parity is a complete no brainer as the first step.


There were rumblings from a fruit related technology company that ECC may in fact be required in future memory hardware because the risk of error becomes too high without it. This makes total sense to me, I am frankly surprised that ECC is still considered a weird premium upsell considered worthy only for servers and server farms. Shouldn't every computer be reliable?


It's a "premium upsell" for the same reason why we still have 1366x768 screens on laptops: PC manufacturers don't give a shit and they'd rather save $1 per unit. As to Apple, I'm not sure why they don't support ECC in anything other than MacPro. IMO they should.


RAID 6 is not an option with drives big enough because of rebuild time.

https://storagemojo.com/2010/02/27/does-raid-6-stops-working...


Nitpick: neither RAID 6, nor RAID 5 are an option. Leventhal's point is that triple-parity (eg. ZFS raidz3) becomes necessary.


I think the main point is that there is no "RAID is the solution for all our problems" anymore.


Rebuild time is a problem, I agree. But not as huge of a problem as outright data loss.


RAID is not the solution for data loss problem, never was, never will. There is backup for that.


Backup is never current, and it takes time to restore. RAID has its place. So much so, in fact, that there are now erasure-coded distributed filesystems at Google and elsewhere, reliably storing exabytes of data on super shitty disks.


"If ECC were actually important, it would be used everywhere and not just servers."

Ha. I wish I could get laptops with ECC RAM.


Fortunately, I think Intel is finally coming with mobile Xeons that support ECC SO-DIMMs. I think the ThinkPad P50/P70 is using them.


Thinkpad P50 (Xeon E3-1535M v5, ECC DDR4, up to 3 drives including PCIe NVMe)

Thinkpad P70 (Xeon E3-1505M v5, ECC DDR4, up to 4 drives including PCIe NVMe)

http://www.lenovo.com/psref/pdf/ThinkPad.pdf


"... DDR4, ECC or non-ECC (ECC function supported only on Xeon processor), dual-channel capable, four DDR4 SO-DIMM sockets ..."


IIRC, ECC has higher power consumption and generally costs more than non-ECC parts. Not exactly mobile-friendly.


Atwood's original article was puzzling to me, and the conclusions just didn't compute.

I can recall at least a half dozen times when I was a DBA in olden times that ECC either corrected was essential in the isolation of faults on my Informix and later Oracle boxes, running mostly on Sun and RS/6000 at the time.

Sun had a nice habit of shipping defective CPUs and memory in the late 90s. The details are foggy, but I remember correlating ECC faults to long transactions that would fail, and getting a bunch of stuff out of Sun.

Than again, that was 15+ years ago, so maybe the newfangled memory we have these days is more reliable.


Here's James Gosling's account of radioactive RAM chips used in the UltraSparc II...

http://nighthacks.com/roller/jag/entry/at_the_mercy_of_suppl...

When Sun folks get together and bullshit about their theories of why Sun died, the one that comes up most often is another one of these supplier disasters. Towards the end of the DotCom bubble, we introduced the UltraSPARC-II. Total killer product for large datacenters. We sold lots. But then reports started coming in of odd failures. Systems would crash strangely. We'd get crashes in applications. All applications. Crashes in the kernel. Not very often, but often enough to be problems for customers. Sun customers were used to uptimes of years. The US-II was giving uptimes of weeks. We couldn't even figure out if it was a hardware problem or a software problem - Solaris had to be updated for the new machine, so it could have been a kernel problem. But nothing was reproducible. We'd get core dumps and spend hours pouring over them. Some were just crazy, showing values in registers that were simply impossible given the preceeding instructions. We tried everything. Replacing processor boards. Replacing backplanes. It was deeply random. It's very randomness suggested that maybe it was a physics problem: maybe it was alpha particles or cosmic rays. Maybe it was machines close to nuclear power plants. One site experiencing problems was near Fermilab. We actually mapped out failures geographically to see if they correlated to such particle sources. Nope. In desperation, a bright hardware engineer decided to measure the radioactivity of the systems themselves. Bingo! Particles! But from where? Much detailed scanning and it turned out that the packaging of the cache ram chips we were using was noticeably radioactive. We switched suppliers and the problem totally went away. After two years of tearing out hair out, we had a solution.

But it was too late. We had spent billions of dollars keeping our customers running. Swapping out all of that hardware was cripplingly expensive. But even worse, it severely damaged our customers trust in our products. Our biggest customers had been burned and were reluctant to buy again. It took quite a few years to rebuild that trust. At about the time that it felt like we had rebuilt trust and put the debacle behind us, the Financial Crisis hit...


We use high-spec systems for storing data coming off radiation detectors in experiments (which can be in the multiple GB/s of data with high end digitizers). You can bet we use ECC for that; we made sure to after one experiment got ruined by memory corruption...


Interesting. I had not heard that story before, although I believe one or two of my old friends were working at Sun on Sparc development at the time. The curious thing about this story is that when I worked in the memory business for a while in the late '80s I was told that the major source of alpha particles that could cause soft errors was the device packaging material. As it was explained to me, earlier in DRAM history they were usually packaged in ceramic packages (also military applications always used ceramic). Later no plastic packaging materials were used because they were less expensive (and presumably the associated reliability issues had been resolved sufficiently to allow the use of plastic in more applications). Anyway, the plastic didn't emit alpha radiation like the ceramic did. When it was explained to me I got the impression everyone in the business knew this.


Don't hardware manufacturers test their systems for many weeks for MTBF estimates and the like? For something that is supposed to be running for years at a time, how did this escape their QA process before shipping?


Are you referring to the Ecache Data Parity (EDP) errors? I think those things hastened Sun's downfall. They replaced every single CPU and memory board in every Sun server when I was at Hotmail. That was a lot of modules.


I looked at the atwood article, which really didn't give useful numbers, except the one citation. I opened that, found the fit number, converted to ~ .5 errors per year and thought, eh, skipping ecc is fine.

I didn't notice this: "From the graph above, we can see that a fault can easily cause hundreds or thousands of per month." Now I want ecc again.


Drill into the 3 referenced studies, they are all reasonably recent (2007, 2009, 2012) and contain specifics/data.


Speaking of Google's shipping containers. Sun had that too- https://en.wikipedia.org/wiki/Sun_Modular_Datacenter

"A data center of up to 280 servers can be rapidly deployed by shipping the container"


The last time I built a system with ECC it was impossible to tell which cpu/motherboard combos supported ECC; the time before that was when AMD had super-affordable systems with ECC support, but I gave up trying to figure out if AMD even supported ECC on their workstation parts.


The problem is, there is also additional hidden costs to using ECC, mostly due to what is available in the ecosystem. Namely - I want ECC. Great, then I need to get to an X99 board with xeon chips running at higher TDPs. Depending on the case I use, then I would need to upgrade PSU and fans/coolers.

Or, (especially in the mini-ITX world), I can choose one of the server boards or "workstation" C236 boards. Then I would lose official desktop windows support, or lose a m.2 SSD slot, or onboard sound, or other "desktop workstation" features.

It is still not easy to do ECC in this day and age...


From my point of view it boils down to how critical stability and uptime within any given environment is. I'm not going to wring my hands over whether or not the hardware powering my todo list startup includes ECC RAM. I would wring my hands over it if it I were building a system for military, healthcare, or critical infrastructure applications, to use a few examples.


Saying that only “military, healthcare, or critical infrastructure applications” should bother using ECC RAM is akin to claiming that only banks should use HTTPS.


Well, I didn't say that. Additionally, I don't consider error correction to be analogous to SSL.


Is it possible to get some of the benefits of ECC without ECC by serializing out the entire state of a program at some set rate? For example, one could read in the n'th serialized checkpoint, run the program to the n+1'th checkpoint, and compare the original n+1'th checkpoint to the new n+1'th checkpoint. If these differ, cosmic rays flipped a bit in the interim. This would, of course, break if the code itself doesn't guarantee bit compatible results over multiple runs (due to the use of certain parallel algorithms, etc). I suppose this would double the runtime, however...


2x time overhead vs. 10% cost overhead?


the cost is not everything... sometimes it is impossible to get the features that you need in the form factor that you are seeking


ECC is a requirement for servers when we validate installations of our database product.

Basically the take away is that without ECC you can expect a memory error every two days on machines with a lot of RAM.


Is there someone else who consistently reads 'Elliptic Curve Cryptography' and is disappointed by the subject of the article?


I indeed do and was. But then again, not being able to have ECC RAM on my laptop has been a pet peeve of mine for quite a long time, so after my initial disappointment I could also enjoy the surprise article :)




Registration is open for Startup School 2019. Classes start July 22nd.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact

Search: