Do people need it? Nah probably now for systems that can handle crashing, but you'd be nuts to not use it in servers or systems running long jobs - it's just a single insurance payment on your system that gives a remote chance of protection.
Sadly there are very few studies in to it that show how modern DIMMs still get errors that are correctable. Manufacturing processes are much better, but they haven't eliminated the need for it.
I do enjoy the "why would I need it, I've never had memory errors" attitude though when people likely have no idea why their application or OS crashed. And the accounts of people who eventually diagnose memory errors after weeks of random crashes which would have been reported immediately if they had ECC.
I will be doing deep learning and other ML GPU-powered tasks. Plus some long running high-memory I/O intensive tasks.
Note that Nvidia's build does not employ ECC RAM. And it's a quite expensive machine. Mine will be only a fraction of the cost ($4.5k), with just one Titan X. It's possible to afford a Xeon, but this comes at the compromise of buying slower hardware. What shall I do?
Intel's segmentation of the market, limiting the amount of RAM you can use in regular CPUs and removing support for ECC sucks.
Otherwise, their reference system is no different from a "high end" gaming rig (ie. not a "server").
The point I was making about the reference Nvidia setup was that it's targeted at development, which means starting and stopping the system often, etc. So ECC wouldn't be of much use in that sense, and it would only serve to make Nvidia's reference system more expensive. I don't think it's a sign they don't believe it should be used.
Yes, there is a risk, but maybe not enough to jeopardize the test.
It really depends on the tasks if you NEED it... I wouldn't to a byo nas ever again... I would suggest that a lot of workloads don't need it.
That said, the additional cost should be nominal at this point, and I don't quite get why it isn't standard. It's like the coprocessors in the early 90's eventually became a critical integrated piece.
this hasn't really been true since Haswell. Now even low-end Celerons and Pentiums support it.
There seems to be plenty of options in the sub-$200 range.
Amazon use ECC in all server instances including their GPU compute machines. If you're part of a company or university ask Intel for a Xeon sample. They sometimes hand them out for free.
Also try moving to newer processes / more cores. The price gap becomes big.
No one is using MMX anymore, but even it is supported by all the Xeons made in last 15 years. Xeons have had SSE just as long as other Intel CPUs.
So what you said is not true at all.
The argument for ECC is credible (I personally think rowhammer is the best example of this actually mattering, but ironically a) you can rowhammer ECC memory just fine and b) DDR4 has hardware features to mitigate rowhammer -- which shows how quickly things are changing), but it also seems to hinge on whether you have hundreds to thousands of computers all working together, e.g. the positive effects of ECC only seem to matter enough statistically at a _very_ large scale.
What makes you say that? Both those things being true simply means that most computers silently corrupt your data and crash. That matches my experience. My programs occasionally crash. My pictures and videos are occasionally corrupted.
Do I know that those events are caused by memory failures? No. Most of them are probably other sorts of software or hardware failures, but some could be memory errors.
"Some could be.." is computing by coincidence, and I'm not a fan of that logic. You can refer to the 2007, 2009, and 2012 studies for measured data on server farms.
How do you know? Just the other day I pushed a TS video stream through the script that accidentally modified some bytes (I didn't even bother to analyze the exact changes). There were occasional artifacts observable, but the stream played nevertheless. Which is how the players are being made: to be able to resynchronize even on bigger errors. Somebody who programs these things can surely tell you much more about that.
I was able to notice that because I played the stream in mplayer while watching the console output. So I saw the "debug messages" detecting "something" wrong, and stopped seeing them once I've corrected the script. It was the messages, not the "watching" that made me sure.
Well... I can imagine there are enough people who won't care too.
Note that the chance for corruptions increases as long as the data remain in RAM unsaved and only much later get to be saved. We are lucky that, at least in private use, we typically copy the data to the medium without keeping it too long only in the RAM (order of seconds or less). So that also reduces the chances of our pictures being saved with the wrong bits.
Though, I usually only notice it years after the fact. I have a lot of photos and a lot of videos and only view a few old ones on occasion.
> "Some could be.." is computing by coincidence
That second paragraph of mine is mostly superfluous, and you seem to have read more into it than I intended. Whoops. I should have been more clear. I was not trying to make any sort of claim as to how to do computing. I was only trying to show that there are other possibilities that you would need to investigate and reject before going from your premises to that particular conclusion.
Also, I don't think images stay for too long in RAM. Your HDD definitively has ECC.
Was fun finding a bunch of my own photos like that. Luckily that was on a disposable copy and not primary storage, but it was enough to put me completely off non-checksumming filesystems.
I hope that, say, civil engineers wouldn't build bridges the way we throw together computer software and hardware.
San Franciscans insist on always doing things differently, not necessarily doing them well.
... most computers aren't particularly reliable.
I'm curious - are you not monitoring or at least keeping a vague eye on ECC correction events with your existing hardware? If so, are you just not seeing any?
I've never really operated at any sort of "large" scale - a handful of racks at most - but I've always found correction events to be about as routine as IO errors, and certainly way more common than outright disk failures.
I'd estimate about half the machines I've used with ECC have had at least one correction a year, maybe a quarter had one every few months. I've seen one-off weird bursts that never happen again, and I've had quite a few cases where ongoing corrections have indicated a DIMM needed reseating. An actual blatantly faulty DIMM's been quite rare.
>> Studies by IBM in the 1990s suggest that computers typically experience about one cosmic-ray-induced error per 256 megabytes of RAM per month. To alleviate this problem, the Intel Corporation has proposed a cosmic ray detector that could be integrated into future high-density microprocessors, allowing the processor to repeat the last command following a cosmic-ray event.
This commenter seemed quite credible to me, here's where it starts:
> Yep, these are what we had -- uncorrectable errors with ECC memory caused by row hammer. Luckily there are mitigations. Sandy Bridge allows you to double...
The scary thing about rowhammer isn't just the potential a fault, it's the potential for a security vulnerability. Yeah, a DOS attack isn't nice either, but it's a million times less worse than a privilege escalation caused by memory corruption.
For rowhammer to be even the same magnitude of problem with ECC memory would require not an uncorrectable error, but an error that gets past the error detection mechanism entirely and produces data either seen as correct or correctable, but which is not the original data.
The job of ECC is to prevent transient failures and to warn you about less-transient failures. It's not a paradox that it can't paper over things that are actually broken.
Thanks, I just snorted tea onto my keyboard after reading that. Seems like if you have the money and are maintaining a pretty critical system it would be silly not to get ECC RAM (and if you're building your own iron the price difference isn't that much as far as I can tell).
On the EC2 site they say "In our experience, ECC memory is necessary for server infrastructure, and all the hardware underlying Amazon EC2 uses ECC memory". Amazon maintain a lot of servers and if they think it's necessary I'm inclined to believe them.
This is similar to high capacity spinning drives. With smaller ones you could just go with RAID5 and not worry about anything, but when drives are 3-4TB and up, you have to use RAID6, because the spec error rate becomes too high to rely on a single parity drive.
Here, too, when you machine has 1-2TB of hybrid RAM/NVM you HAVE TO have some way to detect failures, even if it's not particularly good. Performance characteristics of RAM preclude the more robust algorithms such as wide (32bit) CRC from being used (narrower CRC could still be doable in hardware, though), but parity is a complete no brainer as the first step.
Ha. I wish I could get laptops with ECC RAM.
Thinkpad P70 (Xeon E3-1505M v5, ECC DDR4, up to 4 drives including PCIe NVMe)
I can recall at least a half dozen times when I was a DBA in olden times that ECC either corrected was essential in the isolation of faults on my Informix and later Oracle boxes, running mostly on Sun and RS/6000 at the time.
Sun had a nice habit of shipping defective CPUs and memory in the late 90s. The details are foggy, but I remember correlating ECC faults to long transactions that would fail, and getting a bunch of stuff out of Sun.
Than again, that was 15+ years ago, so maybe the newfangled memory we have these days is more reliable.
When Sun folks get together and bullshit about their theories of why Sun died, the one that comes up most often is another one of these supplier disasters. Towards the end of the DotCom bubble, we introduced the UltraSPARC-II. Total killer product for large datacenters. We sold lots. But then reports started coming in of odd failures. Systems would crash strangely. We'd get crashes in applications. All applications. Crashes in the kernel. Not very often, but often enough to be problems for customers. Sun customers were used to uptimes of years. The US-II was giving uptimes of weeks. We couldn't even figure out if it was a hardware problem or a software problem - Solaris had to be updated for the new machine, so it could have been a kernel problem. But nothing was reproducible. We'd get core dumps and spend hours pouring over them. Some were just crazy, showing values in registers that were simply impossible given the preceeding instructions. We tried everything. Replacing processor boards. Replacing backplanes. It was deeply random. It's very randomness suggested that maybe it was a physics problem: maybe it was alpha particles or cosmic rays. Maybe it was machines close to nuclear power plants. One site experiencing problems was near Fermilab. We actually mapped out failures geographically to see if they correlated to such particle sources. Nope. In desperation, a bright hardware engineer decided to measure the radioactivity of the systems themselves. Bingo! Particles! But from where? Much detailed scanning and it turned out that the packaging of the cache ram chips we were using was noticeably radioactive. We switched suppliers and the problem totally went away. After two years of tearing out hair out, we had a solution.
But it was too late. We had spent billions of dollars keeping our customers running. Swapping out all of that hardware was cripplingly expensive. But even worse, it severely damaged our customers trust in our products. Our biggest customers had been burned and were reluctant to buy again. It took quite a few years to rebuild that trust. At about the time that it felt like we had rebuilt trust and put the debacle behind us, the Financial Crisis hit...
I didn't notice this: "From the graph above, we can see that a fault can easily cause hundreds or thousands of per month." Now I want ecc again.
"A data center of up to 280 servers can be rapidly deployed by shipping the container"
Or, (especially in the mini-ITX world), I can choose one of the server boards or "workstation" C236 boards. Then I would lose official desktop windows support, or lose a m.2 SSD slot, or onboard sound, or other "desktop workstation" features.
It is still not easy to do ECC in this day and age...
Basically the take away is that without ECC you can expect a memory error every two days on machines with a lot of RAM.