Hacker News new | past | comments | ask | show | jobs | submit login
Why Use ECC? (2015) (danluu.com)
79 points by pmoriarty 3 days ago | hide | past | favorite | 97 comments





Back when I used to work for BitKeeper, having support problems ultimately be traced back to bad memory (or disk, or nfs, ...) was a common occurrence. BK would checksum everything it writes to disk and verify those checksums when it read it back. And since it uses 'simple' arithmetic checksums rather than 'secure' ones it was possible to pre-compute the checksum we would get when writing new files so even memory errors would be noticed if it changes what we will be writing to disk.

The hard part when doing support is convincing the customer that the errors in your software are from their systems. I had to learn the failure patterns of a couple large filers so I could look for them.

Now we use git which is fine, and git uses much stronger checksums. My main concern is that since those SHA{1,256} checksums are expensive they are typically generated only once and then only used for lookups. If you intentionally corrupt git files, it is hard to have git notice without explicitly running a check. These checks should be done it cron nightly as part of backups.



Interesting!

Anyone know the git invocation to put in cron?

edit: probably this? https://git-scm.com/docs/git-fsck


Check, check, check. I only use ECC DRAM in my desktop these days and I monitor bit flips which on my 64GB desktop happen about once a month[1]. A similar rate on my FreeNAS device which also has ECC memory.

Like others, I was super happy when I could by a Ryzen chip at retail prices and get ECC rather than buying a Xeon sku which was an extra $300.

[1] When I first built the system I was getting about 2 a week which I tracked down to a 'weak' ram stick. Replaced it (Corsair, and they replaced it for free) and now it's about one a month.


In the early 2000s for a year or two I had a process running on my Linux desktop without ECC that just filled a 128 MB buffer with a pattern, and then checked every minute to see if the buffer had changed.

I ran this for a couple of years and never saw a change.

I believe that based on published memory error rates for memory at that time, I should have seen a few errors. I am not sure why I never did. Maybe consumer desktop memory was less dense than server or workstation memory, and the published error rates were for the latter?

From 2008 to 2017 I had a 2008 Mac Pro at work with ECC memory, and from 2009 to 2017 I had a 2009 Mac Pro at home with ECC memory. I'd occasional look in the memory section of the System Information report and never saw any ECC corrected errors reported.

With the Mac Pros, it is possible that I just never happened to check the report between the time of a corrected error and the time the counter was next reset (assuming it resets every boot...if it is cumulative than I have no idea how I never got an error).


FWIW these are 16GB DIMMs. So 64GB there are roughly 512 times as many bits in that memory as opposed to 128MB of memory. Assuming a linear relationship between bit flips[1] and time (it is a statistical event after all) 1 bit flip a month would be the equivalent of 1 bit flip every 42 years in 128MB.

Or put differently, your survey size was quite a bit different than mine has been. Also there is always a chance that the bit flip is in the ECC bits not the main memory bits. ECC DIMMS typically have 72 bits per 64 bit long word, vs 66 bits per 64 data bits for parity memory, and 64 bits for 64 bits on memory with no protection at all. So it's hard to make a 1:1 comparison without knowing the exact layout of the memory and whether or not error detection bits are present.

[1] Soft (and Hard) bit errors are generally spec'd as a probability per unit time per bit, so more bits means higher probability of a bit flip per unit time.


Your sample size is too small. In my experience with thousands of servers with 64-512GB of ECC ram, not very many of them will have any ECC reports during use. Usually, everything works fine, but sometimes something will go wrong, and it's nice to know.

Of course, that's when I'm not paying the bills. I don't have ECC on my home systems, because of added expense, lower performance (no XMP ECC ram when I was shopping), and extra research needed. If a consumer oriented motherboard says it supports ECC, that may mean it lets you use ECC ram, but doesn't enable any of the reporting, which would be mostly useless; not completely useless, because used ECC ram from server retirements is sometimes very inexpensive, but it's often fully buffered, which is less likely to work in consumer platforms.


> If a consumer oriented motherboard says it supports ECC, that may mean it lets you use ECC ram, but doesn't enable any of the reporting, which would be mostly useless

That is largely a feature of OS software, not the motherboard. Linux on AMD motherboards which support ECC has reporting. There is no reason why it wouldn't. If the ECC is there and active, Linux can get the information (EDAC).

On Intel though, they flat out fuse ECC off on consumer CPUs for market segmentation reasons. If you want a consumer motherboard with ECC, you go AMD.


> it lets you use ECC ram, but doesn't enable any of the reporting

Is there any special term which does describe this capability? How would I find out if that reporting exists? Only by reading the Manual and looking through the BIOS pages?


Yeah, I think you have to read through the manual, or contact support to confirm or find a review that specifically addressed it; unfortunately.

In my own experience almost every motherboard will happily accept unbuffered ECC ram and ignore the parity bit. Registered DIMMs, however, will only work with both MCH and BIOS support so I don't expect them to work unless explicitly stated in the specs.

I ran memtest+ for a weekend every 3 months or so on 64GB DDR4 Ryzen system for 2+ years and haven't found anything. Rock solid system, which has been on almost 24/7. No data loss that I can verify with backups.

I think the point about chipkill is salient, why isn't it more widely used if the benefits there are obvious? I think ECC is completely not worth the price for home users if you can do regular backups, run memtest, and compare checksums for any data corruption. Even with errors near the hard/soft boundary, you'll likely catch those by just running memtest for a longer period of time. DRAM errors get progressively worse, so you'll catch any errors that way as well using memtest again.

Datacenters have more surface area which cosmic rays can attack and are also more likely to see weird hard errors which might not have been caught in QC/QA, which is very different from isolated home computers or phones which aren't on 24/7 or can restart easily. If you have a datacenter and have truly sensitive unreplicated data, get chipkill. If you are home user, do regular backups, and run memtest. It's that simple. Hard drives have moving parts, and more failure modes, so pick a file system with checksums. CPUs/GPUs and SSDs might or might not have ECC caches because they have other ways of reducing hard/soft errors in SRAM.

People may disagree, but the study I would like to see is one that takes into account the denser environment of a datacenter vs the isolated home user and all modes of data loss/corruption in each case. We know Google replicates data in GFS, and a home user can do the same with backups.


> I ran memtest+ for a weekend every 3 months or so on 64GB DDR4 Ryzen system for 2+ years and haven't found anything.

Don't assume that that means there are never any errors. DRAM errors often depend on access patterns: While dialing in the speed and timings for my RAM initially settled on a configuration that did not report any errors in memtest but later during heavy usage (e.g. compiling LLVM with make -j64) reported ECC errors.


memtest exactly tests for different access patterns including attacks like Rowhammer. I also use Gentoo with background tasks like mprime/foldingathome (they can catch memory errors too), and any gcc/llvm errors would've been obvious by now (It's easy to repeatably and reliably test checksums on every compiled package). DRAM errors get progressively worse, replace the module when you find any errors. Simple as that. Focusing on finding errors early and doing backups protects against almost all classes of failures for home PCs.

With a median of 10,000 FIT or lower, it's zero errors every 5-10 years, and DDR4/5 might actually be below that if the rate of improvement between DDR1/2 was anywhere between 2x-10x continued. You'd have to overlook this important trend to see any benefit from ECC. Yes, there are DRAM errors, but if they can be found by memtest, bg scrubbing, self-tests and don't affect low-use scenarios, then it's meaningless to be talking about ECC for home use.

Setting speed and timings on ECC modules seems super flaky to me, do you know what speed and timings the separate ECC logic can handle? Can you turn off ECC and test DRAM independently? Maybe what you're seeing is the ECC logic throwing errors and not the DRAM.


How did you track these bit flips and how did you ultimately narrow it down to "weak" RAM?

The memory log has both the bit number and the DIMM "row". They were all coming from the same row, so that was the DIMM I replaced.

I also have ECC memory for my Ryzen system, I’m curious what sticks you use as I really struggled to find mine as they have to be ECC UDIMMs?

One of the things I really like about AMD Ryzen is that they don't reserve ECC just for server chips. It was a major factor in my desktop PC builds.

Yes, the Intel policy to intentionally cripple their cheaper SKUs, hoping to extort money from customers by pushing them to buy more expensive SKUs, is extremely annoying.

For example, right now all the new Tiger Lake CPUs contain a so-called "In-Band ECC" device, which allows error detection and correction even when using LPDDR4x memories, which do not have an ECC variant, like the DIMMs or SODIMMs.

This works by storing the ECC codes in a reserved part of the memory, so increasing the reliability is paid by a slight reduction in the memory capacity and in the memory bandwidth.

Nonetheless, those who are risk-averse, like me, would prefer this trade-off and would enable the "In-Band ECC".

However, you cannot do that because Intel disables the "In-Band ECC" feature on all Tiger Lake SKUs, except on 3 SKUs intended for Embedded and Industrial Temperature Range, which are presumably more expensive.

"In-Band ECC" is even disabled on the other 4 Tiger Lake SKUs for the Embedded market, which leaves them without any visible advantage over the normal Tiger Lake SKUs.

When Intel competed with Intel, this kind of business decisions probably made money for Intel, but now, knowledgeable customers should better buy from competitors, e.g. the new Ryzen V2000 for embedded applications, which support standard ECC memories without problems.


> Yes, the Intel policy to intentionally cripple their cheaper SKUs, hoping to extort money from customers by pushing them to buy more expensive SKUs, is extremely annoying.

I agree that this practice sucks, but to play devils advocate - is it possible that this is due to binning / yield maximization?


I would think that maybe they can save a very low proportion of chips with ECC not-working, but the large majority would be completely arbitrary market segmentation: handling of ECC is going to be limited to a small relative surface area of the die.

Now Intel 10nm still seems to be so bad that maybe it is interesting enough to bin on that, and maybe that's the cause Tiger Lake is even more limited on ECC capabilities than before in the various SKU. Although we already saw a restrictive move on Comet Lake.

Instead of continuing to multiply their SKU (I think they had enough even 5 or 10 years ago), Intel should go back to the drawing board and ship interesting microarch on better nodes...


> I agree that this practice sucks, but to play devils advocate - is it possible that this is due to binning / yield maximization?

Oh definitely not. They disable all kinds of arbitrary features that are totally unrelated to yield.


Yup. Number of cores, cache size, and clock speed are things which make sense for binning to improve yield. Most other features do not (ECC memory support does not require much silicon, the odds of it being damaged while the rest of the chip is OK are too slim for binning to make sense).

I am surprised that ECC is not more popular. RAM errors (SEUs) are common in every platform. One hacker that I worked with almost a decade ago did this BlackHat presentation with some analysis that correlates errors with increased temperature (by geo-locating clients of "bit squatting").

I put together a new SuperMicro server a few months ago and went with 256GB of ECC. Yeah it's very expensive, but it's absolutely worth it if you care about the integrity of your data.

See: https://media.blackhat.com/bh-us-11/Dinaburg/BH_US_11_Dinabu...

Where are you now Artem?


It's not even that expensive. An 16GB "server" DIMM costs what, $65 at retail? Apple will charge you $200 for 8GB of non-ECC. The real cost on ECC memory is getting a CPU and platform that support it properly.

Four of these weren't cheap. ($1,441.36)

https://www.amazon.com/gp/product/B07YF249FX


> An 16GB "server" DIMM costs what, $65 at retail?

Clocked at 2400MHz, sure. I can't find a single non-sketchy listing for even 3200MHz unbuffered ECC memory, let alone the 3600MHz you want for current Ryzen chips. And those are just the speeds at reasonable prices, ignoring the 4000-5000 range.

At least DDR5 should improve things.


> I can't find a single non-sketchy listing for even 3200MHz unbuffered ECC memory

https://www.crucial.com/memory/server-ddr4/mta18adf2g72az-3g...

> let alone the 3600MHz you want for current Ryzen chips

Current DDR4 JEDEC-approved speeds top out at DDR4-3200. Anything beyond that is technically overclocking, even if it's overclocked from the factory and marketed to run at those speeds. Assuming that your platform supports overclocking DRAM, you can also run DDR4-3200 ECC memory at 3600 MT/s, if you feel that the performance uplift is worth the risk of stability issues.


Good, glad it exists, though it looks like it's only been available for about a year.

I don't care about JEDEC. I care about the XMP spec that's promised to me by the manufacturer, so I can rely on it. I'm not against overclocking myself but if I'm trying to boost it by a significant amount then I have no idea what to set the timings to and there's a good chance it just won't work, with no recourse on my end.

Also that module costs way more than $65.


And the good thing is that you’ll know immediately about stability issues with ECC.

Patrol scrubbing is a RAS feature of only more expensive platforms. On most hosts you are likely to learn about the issue only at the time some program tries to access the marginal page or row.

> Patrol scrubbing is a RAS feature of only more expensive platforms.

All of the ECC-capable platforms I've used (including my AMD-based desktop PCs) have supported patrol scrubbing. What have you used that doesn't?


My AM4 motherboard doesn't offer any options for scrubbing and its manual is vague about whether ECC errors will even be reported to the operating system. By contrast my HPE ProLiant DL360 has firmware settings for scrubbing and I know the event reporting works because I see it in the logs all the time.

The operating system will explicitly tell you at runtime if ECC support is available. If this returns an affirmative value, then ECC is working.

Support for patrol scrubbing is a bit more variable. It's supported in my ASRock motherboard (B450M Pro4). It was also supported in two previous ASUS motherboards (AMD AM2 and AM3 platforms, respectively), although they called it something else that I can't remember. As well, a lack of a setting for patrol scrubbing doesn't necessarily mean that it's unavailable; it could just be enabled but not user-configurable.


This is OS-configurable. Look for /sys/devices/system/edac/mc/mc*/sdram_scrub_rate on Linux.

The motherboard doesn't really get a say on whether the OS can correctly use ECC support on AMD as far as I know. As long as the memory is correctly initialized in ECC mode, the OS should always be able to use its EDAC driver to interact with it.


> The motherboard doesn't really get a say on whether the OS can correctly use ECC support on AMD as far as I know.

This isn't entirely true. The motherboard needs to physically support ECC (i.e., traces for a 72-bit memory bus need to be present) for it to work. ECC support can also be disabled within the BIOS firmware, either explicitly via a user-adjustable toggle, or by the motherboard vendor hard-coding the setting to disabled.

In any case, if the OS reports ECC as working, then ECC is fully functional.


Some people claim Gigabyte X570 motherboards in theory have the traces needed for UDIMM ECC, but the (hardcoded) BIOS settings cause the memory to not initialize in ECC mode.

I thought we were talking about correctness, not going shopping for more-error-prone DIMMs.

Well I thought we were talking about getting ECC without major tradeoffs, to understand its [lack of] popularity.

And I wouldn't say faster memory is necessarily more error prone. But it doesn't really matter, because non-ECC products have to have extremely low error rates to be viable. Even if a speed is at the high end of "extremely low", once you add ECC you'll have an exceptionally solid component.


> Well I thought we were talking about getting ECC without major tradeoffs, to understand its [lack of] popularity.

ECC isn't popular on desktops because Intel, which supplies the overwhelming majority of the processors for platforms that use DIMMs, doesn't support ECC at all. As such, the market for people who can actually use ECC in a desktop (much less would want to) is very small. The only reason unbuffered ECC DIMMs are even on the market at all is because low-end servers use them.


I'm not sure where everyone gets the idea that Intel charges a big premium for ECC platforms. If you buy the low-end workstation Xeons the processor costs about the same. For example, Xeon W-1290 (10C, 5.2GHz turbo) has a $494 MSRP. The i9-10850K (also 10C, 5.2GHz turbo) has a $453 MSRP. You also have to buy a motherboard with a workstation chipset to use the Xeon, but W480 chipset motherboards are also priced about the same (~$200) as their consumer Z490 counterparts.

The real reason ECC is not common is because it is perceived as unnecessary for desktop workloads. Software crashes and corrupts data all the time because of bugs, not bitflips. Most consumers care so little about their data that a single hard drive crash will wipe it all out anyways. Why spend 12% extra on memory if consumers hardly see any value from it?


>I'm pro-ECC, but if you don't have regular backups set up, doing backups probably has a better ROI than ECC.

But since you don't know that you backup a corrupted file for years i'm pro ECC and pro FS-Checksumming.

I learned my lesson with some corrupted Sound-Files, the bad thing, i had to check every single file near that death block cluster, the good thing, all of them where ripped from my CD's.


So I have 8 sticks of ECC RAM in my desktop. I overclock the RAM , and ECC is nice there since it gives more confidence that the overclock is stable. But more importantly, ECC was really useful once one of the sticks started failing. Not only did the ECC errors immediately point the blame at the RAM (no need to triage random crashes or corruption), it made it clear that all errors were coming from only one of the sticks. Figuring out which stick it was involved some guesswork of the physical -> logical -> Linux utilities' layout mapping, but just replacing that one stick fixed things and the system has been stable since. Definitely glad that I went with ECC for this system.

Plugging in a past question I had related to this [0]: Given that ECC RAM is not abundant, why don't operating systems have self-checks for RAM? Instead we only tend to have near-post mortem checks with memtest86 and the like.

[0] https://news.ycombinator.com/item?id=24642062


Without hardware-level ECC support, how would a memory self-check (that isn't along the lines of a diagnostic like Memtest86) work?

You would have to have some way of detecting that memory is corrupted, which non-ECC platforms do by writing a known pattern to memory and accessing it to verify that the pattern matches. You obviously can't do that during runtime because that memory is being used by the kernel and applications.


The idea in ECCs [1] is that you store some ancillary data that can be used to rectify a bounded amount of error introduced in the stored data. A straightforward virtual memory system implementation would be to have this information stored alongisde page tables and the OS would go through the pages and update their ECCs periodically (probably at a limited rate so as not to cause load spikes). The devil in the details would be the bookkeeping about when a page contents has been written to and the ECC is invalid (soft page faults?).

[1] https://en.wikipedia.org/wiki/Error_correction_code


(forgive errors in my memory here, details may be wrong)

The first generation of UltraSPARC CPUs with 8mb cache didn't use ECC on that cache. This caused some issues; so the "fix" cut that to 4mb mirrored and checked. These were the 250Mhz parts and everybody went to 400Mhz new process stuff instead anyway, but it was a bit of a thing at the time.

However, the fuss meant that generation of CPU hit ebay cheap and in volume, which I was quite grateful for.


I have an Intel C226 chipset with an i3 CPU, ECC RAM, and Linux, but as far as I can tell, there is no way to determine whether ECC is actually working.

I've heard of people collecting known-bad DIMMs for this purpose, or trying to blast their RAM with heat or radiation.

Why does this have to be so difficult?


So this is the datasheet for the C226 (https://www.intel.com/content/www/us/en/products/docs/chipse...) and you have an i3-8100 or equivalent that does support ECC, then the BIOS will configure how it wants ECC errors reported. On my Ryzen board they are reported via SMI events and there is a chip that you can talk to on the SMBUS that will tell you how many events it has seen since the log was reset.

In my experience getting the information isn't "difficult" so much as it isn't "common". On my FreeNAS device (which has the Xeon) it also has an IPMI controller which lets me ask it over the network how many ecc errors have been seen. (and it can act like a remote console as well but we'll leave that for another time :-))

So find out how the motherboard is tracking ECC, then find the tool that talks to that system to ask it for the logs of ECC errors.


Well, non-ECC RAM can report "0 errors" for less money.

The problem is how to determine whether your system is actually capable of detecting/logging errors end-to-end. There needs to be a standard way to inject an ECC error and see what happens.


There are also BIOS setups that "mask" soft or correctable errors since nothing bad happens when they correct the problem. I've always preferred to not mask those errors as it was always a good thing to note on server hardware on clusters that if their soft error correction rate started rising then a DIMM was going to need to be replaced.

> Well, non-ECC RAM can report "0 errors" for less money.

"0 errors" is very different from "ECC not available," and that's a distinction I'd expect the operating system to be able to make.

For example, here's how Linux shows ECC and non-ECC memory reports:

    # ECC
    $ sudo edac-util -v
    mc0: 0 Uncorrected Errors with no DIMM info
    mc0: 0 Corrected Errors with no DIMM info
    edac-util: No errors to report.

    # Non-ECC
    $ sudo edac-util -v
    edac-util: Error: No memory controller data found.

Overclock it one step at a time until it gets mad.

It doesn't have to be that difficult. A crude yet effective method is to overclock the RAM beyond a reasonable frequency and a handful of correctable errors should pop up in no time.

In case the OS/BIOS does not support overclocking or fails report memory error properly, certain data pins of the dimm slot could be physically shorted with metal wire to add a persistent stuck bit. Without ECC the system will not boot. I've seen it done on multiple occasions but web search is drawing a blank for the moment. IIRC it was being discussed a lot when first gen Ryzen cpu came out and people were arguing left and right whether ECC support is present on motherboard X.

Finally some memtest software claims to be able to inject errors on a software level. I can't vouch for them as I have very little experience but it would be very simple to find out.


You can also play with voltage settings if those are more accessible than speed and timing.

Or play with rowhammer tests.


As far as I know Intel doesn't support ecc on their consumer platforms. You need a xeon for that.

Intel supports ECC on i3, but not i5-i7. At least that was the case for the Haswell generation; I haven't checked recently:

https://ark.intel.com/content/www/us/en/ark/products/77495/i...


i3-8100 cpus support ECC

See the comment from viraptor with the link to edac-utils.

Apple should spec an ECC configuration for their M1 MacBook Pros. They currently want $200 for an extra 8GB on their M1 Air and 13" Pro. I'd gladly upgrade to the Pro and pay that $200 for an extra 8GB if the 16GB was ECC. That'd be $1499 for a 13" Pro. ECC would be the tipping point. Otherwise, $999 for an Air will do.

Been using a home-built workstation for over a decade now with 64GB ECC RAM (high performance sticks for the time). It's one of three such PC's in service here. Over the years a couple modules developed problems and were replaced. Glad for the ECC. Those errors might have manifest as silent corruption issues instead.

Yes, I also use only ECC memories in all my computers for the same reason.

When they are new, memory errors are very seldom, if at all.

Nevertheless, after several years of use, I had many cases when some DIMM modules began suddenly to have frequent errors and being notified by ECC allowed me to replace the modules (or the motherboards if they were very old) before causing unrecoverable data corruption.


It’s not like there’s an outrageous price difference between the two anymore. If your server can take ECC, might as well use it.

ECC implies error correcting. That implies more extra bits to store Hamming codes.

But you can have far fewer redundant bits if you only want to detect error s. (Parity). In the worst case the computer reboots. But you must foresee this because hardware can die for many reasons (power failures, component failures etc)


In my field, what's interesting is how ECC memory corrections are going from being considered abnormal - an indicator you have faulty hardware and should replace it - to totally normal and an absolute must-have part of your implementation.

A sibling commenter says he gets one bit flip per month on his computer. I don't know about you, but I don't like the prospect of my computer randomly rebooting once a month.

There's a long way from a single bit flip to a reboot. You're not likely to even notice a single event. For example it could be any bit of a single pixel on the screen. Or one letter in all the text displayed over a month. Even if it happened in the code, you'd have to hit some function in the kernel which has any impact at all, instead of for example sending a bad packet.

>For example it could be any bit of a single pixel on the screen.

Yeah or one bit in your checksum...imagine the fun :)


Does the RAM publish the fault/correction count on any accessible register/counter? It would be great if we could just query that on existing systems and have the simple answer - does ECC do anything for us.

Quick search later - yes, you can: https://linux.die.net/man/1/edac-util

For all the discussion I've seen about this topic, I wish more people actually published their numbers.


The RAM doesn't publish anything because it isn't doing any ECC. ECC memory is just wider memory (72 bits instead of 64 bits per row). The ECC itself is performed by the memory controller, which makes sense, because doing ECC in the memory would leave the memory bus itself unprotected.

I'm actually hesitating, my freenas has ECC ram (it's a nas and it'd be stupid not to) but I'm building a desktop around a ryzen chip that will run pfsense as well as multiple vms.

I wanted to buy ECC but I have a hard time finding 2x32GB of DDR 3600, no XMP and I don't care about the data on that desktop as much as I care about the data on my nas.


JEDEC specifications for DDR4 top out at DDR4-3200, and anything beyond that is vendor overclocking with whatever settings that particular vendor deems is suitable.

While there's nothing technically preventing vendors from factory-overclocking ECC memory the same way, ECC's (perceived) main audience is reliability-critical systems, and so it doesn't make sense for vendors to market a feature that their target audience wouldn't appreciate.

What you can do is get two 32 GB DDR4-3200 memory modules (example: https://www.crucial.com/memory/server-ddr4/mta18asf4g72az-3g...), and then overclock them yourself to 3600 MT/s. As with any overclock, your results may vary, but that's also the case with factory overclocking as well.


Ram definitely does flip bits more often than might be intuitively obvious. Most people who have worked at places with a large physical server deployment have stories, and there's row hammer, but I think my favorite is the paper on DNS bitsquatting. You can see the statistics on flipped bits on normal PCs.


ECC RAM is also important if you want to run OpenZFS as a lot of its speed comes from caching previously read data in RAM. If there is a RAM error, that error will be propogated back to disk and OpenZFS currently has no way to prevent this from completely destroying all of your data.

It's a damn shame Intel has kept ECC out of the hands of regular folks. AMD lagged for a long time as well, but apparently newer Ryzens support it. Is there one for laptops?

What about Apple? Will they support it now they're unshackled from Intel?


AMD has supported ECC for a pretty long time. My 11 years old AMD consumer desktop sure supports ECC.

It's just Intel that doesn't.


I've read that before, but I remember folks in similar threads say they were never able to confirm it actually working.

I can confirm that ECC DRAM worked as expected on my current AMD Ryzen 9 3900X, an AMD Phenom II x6 1090T that I had in a previous build, and an AMD Opteron 185 that I had in a build before that. The operating system detects the error-correction capability in the memory controller, and errors are properly detected (tested by reducing the memory timings slightly below stable values).

I don't think there's a mass-market laptop with ECC support, but AMD's desktop CPUs have supported ECC for as long as I can remember. That being said, it's been an optional feature that is up to the motherboard maker to implement, and not all do.


Most so called "mobile workstation" laptops with 45-W H CPUs, from Dell, Lenovo or HP, support ECC (if you choose a H Xeon CPU instead of the equivalent Intel Core CPU).

I have a Dell Precision laptop from 2016, with a Skylake Xeon and ECC memory.


Yep, and how was the price difference and heat generation? It’s why I’ve shied away from xeon laptops in the past.

Isn't ECC also slower than non-ECC DRAM? Also, I'm not aware of any overclockable Intel CPU/motherboard combos that even support ECC RAM.

I think that's a side effect of product segmentation. ECC RAM tends to be marketed for server applications, so they aim to be JEDEC compliant and call it a day. JEDEC speeds tend to be slow by today's standards. Meanwhile non-ECC RAM is marketed to consumers, specifically gamers who want the highest performance part.

This is a limitation of Intel's platform, not a limitation inherent in ECC DRAM. You can overclock them just fine on AMD platforms.

That being said, unbuffered ECC DRAM, all other things being equal, probably won't overclock as high since the extra memory chip places a slightly higher load on the memory controller than equivalent non-ECC memory. In that sense, you could consider it "slower."


Registered ECC memory is slower, but unbuffered ECC memory is not.

I don’t use ECC in my home server because ECC-supported motherboards are still more expensive than their regular counterparts, and ECC ram is also more expensive. And of course this is only if your CPU supported it, which is basically not any prosumer Intel CPU.

Now that AMD CPUs are taking a big market share, hopefully we see a better market for ECC hardware.


Unbuffered ECC memory is definitely more expensive, but ECC support on AMD motherboards is common. I'm currently running ECC on an ASRock B450M Pro4, which was one of the least expensive AMD Ryzen motherboards available at the time I purchased it (without going to the very down-market A320 series).

If you’re building new, then, yeah, getting a board and CPU with ECC support will cost you. This is why most homelabbers buy server gear that’s been decommissioned, my R3/4/520’s with Ivy Bridge-EN chips may not be top of the line, but they’re affordable.

(2015)

If every system had ECC then we would lower the probability of a virus spontaneously appearing in memory. Is our data security worth lowering the probability of computer genesis?

I guess it depends what you use the server for. If you are a bank dealing with bank account balances you don't really want a bit to be flipped even once a year. If one character in a web page index is flipped in google's indexation service, who cares.

>If one character in a web page index is flipped in google's indexation service, who cares.

>>When Google used servers without ECC back in 1999, they found a number of symptoms that were ultimately due to memory corruption, including a search index that returned effectively random results to queries.


Yeah. Who cares. Out of the billions of search queries, a handful will return irrelevant results.

Google obviously cared, since google has everything on one platform who cares if your ad-sense shows wrong numbers? Your Gmail, Fotos, Drive queries...who cares right?

Oh and not to mention debugging, can you trust your program or your hardware, who makes the wrong query, how to find out which node needs to be restarted?


If it's a very rare event you might not care. Look at how google applies security rules. They don't mind closing a few accounts unfairly if it means they can automate the whole process. I can't imagine they would care about flipping a few pixels in a handful of the hundred of millions of photos they host.

Yeah i stop here, you obviously have no clue since you compare apples to...nothing at all.

This is one of my favorite talks I’ve ever watched. https://youtu.be/yQqWzHKDnTI

And it outlined how a bit flip at google created a huge security risk, that an outside attacker can exploit.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: