Hacker News new | past | comments | ask | show | jobs | submit login
ECC matters (realworldtech.com)
1053 points by rajesh-s on Jan 3, 2021 | hide | past | favorite | 550 comments



I still remember Craig Silverstein being asked what his biggest mistake at Google was and him answering "Not pushing for ECC memory."

Google's initial strategy (c. 2000) around this was to save a few bucks on hardware, get non-ECC memory, and then compensate for it in software. It turns out this is a terrible idea, because if you can't count on memory being robust against cosmic rays, you also can't count on the software being stored in that memory being robust against cosmic rays. And when you have thousands of machines with petabytes of RAM, those bitflips do happen. Google wasted many man-years tracking down corrupted GFS files and index shards before they finally bit the bullet and just paid for ECC.


>I still remember Craig Silverstein being asked what his biggest mistake at Google was and him answering "Not pushing for ECC memory."

Did they ( Google ) or He ( Craig Silverstein ) ever officially admit it on record? I did a Google search and results that came up were all on HN. Did they at least make a few PR pieces saying that they are using ECC memory now because I dont see any with searching. Admitting they made a mistake without officially saying it?

I mean the whole world of Server or computer might not need ECC insanity was started entirely because of Google [1] [2] with news and articles published even in the early 00s [3]. And after that it has spread like wildfire and became a common accepted fact that even Google doesn't need ECC. Just like Apple were using custom ARM instruction to achieve their fast JS VM performance became a "fact". ( For the last time, no they didn't ). And proponents of ECC memory has been fighting this misinformation like mad for decades. To the point giving up and only rant about every now and then. [3]

[1] https://blog.codinghorror.com/building-a-computer-the-google...

[2] https://blog.codinghorror.com/to-ecc-or-not-to-ecc/

[3] https://danluu.com/why-ecc/


Your [3] has a footnote quoting a Google book that reads "Modern DRAM systems are a good example of a case in which powerful error correction can be provided at a very low additional cost... The following machine generation at Google did include memory parity detection, and once the price of memory with ECC dropped to competitive levels, all subsequent generations have used ECC DRAM"


Well They made a big deal out of it and lots of PR out of not having use ECC, and only quietly adding back ECC and listed it out in a book that barely any one reads.


The fact that ECC isn't the default across everything is a failure of human cognition and Capitalism.


I have always been angry that ECC was treated as an "enterprise" feature and increased the price way more than it should have.


It's treated as enterprise because most regular users had no realistic reason to care about it. The practical consequences of not having this on your phone or home PC are completely invisible for the vast majority of people. Because of this and the fact that in the consumer space prices matter most the situation is quite expected.


Users most definitely want it. They just aren't technically literate enough to know about it.

Hello user, for an extra 10 dollars, would you like to guarantee that cosmic radiation never affects your computing experience, including having to reformat, reinstall, or otherwise reboot your phone/computer randomly when one day something just doesn't work anymore?

Was it an update? Was it cosmic radiation? Was it a bad capacitor? Who knows!

Why do you think so many problems are solved by rebooting? Sure 99 out of 100 might be software bugs, but the other 1 out of 100 is straight up cosmic radiation, and that 1 percent is growing more and more every year as software becomes more robust and bug free with better tooling.


Indeed, users don't know enough to care but more than that, for most users this is simply not a problem because they haven't been practically affected. Most may never experience a bit flip, and most of the flips won't be obvious anyway (e.g. no visible effect).

So your hypothetical dialogue will sound more like a scam to a regular user. You're charging money to tackle a problem they don't have or see, only addressing a single one of the root causes that trigger that same result, and in the end you're not even completely fixing it, just reducing the already infinitesimal odds it happens.

It will become mainstream when manufacturers just decides to include it everywhere and not really give the user a choice. Apple is a prime candidate for a company with enough clout to afford this.


My point is, they have been affected. I think most people have seen bitflips.

>2009 Google's paper "DRAM Errors in the Wild: A Large-Scale Field Study" says that there can be up to 25000-75000 one-bit FIT per Mbit (failures in time per billion hours), which is equal to 1 - 5 bit errors per hour for 8GB of RAM after my calculations. Paper says the same: "mean correctable error rates of 2000–6000 per GB per year".

https://stackoverflow.com/questions/2580933/cosmic-rays-what...

I believe bitflips are far more common than people realize. I see weird shit all the time from my users that is only explainable due to bizarre software bugs or bitflips.


Bit flips are reasonably common, but mean is of very little use when one stick can throw no errors and another stick can throw thousands. A more useful quote is "About a third of machines and over 8% of DIMMs in our fleet saw at least one correctable error per year."


But behringer, regular users don't see bitflips, they see "a random error", a random reboot, a random green pixel in a photo. That's if they even consciously observed anything at all. You yourself said those could have a dozen other root causes. I for one have not actually observed a bitflip (so something that I can actually consciously detect as an issue) in a decade. No unexpected reboots, no unexpected file corruption, etc., nothing that made me say "weird, must be some memory corruption".

Don't get me wrong, I'm all for ECC memory, I practically showed the value of ECC after intentionally running the same multi-day Matlab job several times on a non-ECC machine with different results just to prove this very point. But you use your deeper knowledge to assume what someone with shallow knowledge and interest wants or needs, and that will almost always be off mark. Your car does not have a roll cage. Normal computers do not have ECC memory.


"Your car does not have a roll cage" No but the crash structures are one of the areas where we are absolutely trying to recreate the effect of a rollcage without the weight. The weight putting constraints on fuel efficiency is the big issue there.

For ECC, there's no reason to really expect that adding ECC to everything would really make things too expensive or slow long term. For a long time the reason why desktops did not have ECC was pretty much because intel wanted people who really need ECC to buy Xeons.


> No but the crash structures are one of the areas where we are absolutely trying to recreate the effect of a rollcage without the weight

Exactly like how we try to recreate the effect of ECC hardware but without its "weight" (cost, complexity). For a problem that's not nearly as visible or life and death.

> too expensive or slow long term

Not "too" expensive long term still means more expensive especially short term, like when the person buys it. For an issue no average consumer is actually frustrated about. Why are we still debating why those consumers don't care about ECC? How is it different from hot swappable RAM for the average Joe? They'll upgrade or expand their RAM more often than they'll get frustrated about the effect of bitflips. So why not have hot swappable memory? Because as much as I'd love hot swappable anything, regular consumers neither care about it nor want to pay for it.


>So your hypothetical dialogue will sound more like a scam to a regular user. You're charging money to tackle a problem they don't have or see,

I don't know the same argument could be made and ecc be swapped with surge protectors and people still purchase those.


I feel this way about all software.


[flagged]


Just because capitalism fails to provide something does not mean that communism would solve the problem, or that criticism of capitalism as a system is uncalled for. This is not some kind of binary capitalism/communism world we live in.


i'm just struggling to find the link between capitalism and ECC memory.


It's the act of maximizing profits by segmenting the market that creates artificially bad products that end up being the mainstream because of the price difference. If the companies responsible for making these decisions were optimizing just a tad bit more for practicality and usability rather than just profit, ECC would be the standard and that would be the end of that. Similar to how a lot of luxury and non-luxury cars will have seemingly essential and cheap equipment only available as part of an expensive options package - the cost of producing and installing the extra equipment is a mere fraction of the total package cost, a large majority will end up buying the package anyway (and 100% would if money was no object). One might argue that this practice is part and parcel of modern industrial markets, and is almost fair - prices are arbitrary and consumers get to pay for the value they get out of something, there is no optimal, moral way to determine what someone's margins should be, yadda yadda. In my opinion, usually the market will choose what are acceptable margins on a given product and/or service, but with 2 thousand dollar parking sensors and double-the-price ECC memory, with artifical segmentation, I'm paying not for the manufacturing and delivery of a product with a feature I want, but instead I'm having to pay for the opportunity of being able to have the product with the features I want + the manufacturing and delivery.


> double-the-price ECC memory

I was buying DDR4 RAM last week and the cost for a 4x8GB ECC+Registered DDR4 3200 RAM kit for a Xeon W build I’ll be doing soon was about 5% more expensive than for a kit of otherwise identical RAM kit for a non-ECC Core i7 rig - which also came with tacky RGB LED heatspreader - about $230 in total for each of them.

The fact the pricing was so similar does make me wonder about the claims that non-ECC RAM is really ECC RAM just without the CPU/MMU being made aware of it - I think that’s possible if the RAM is already Buffered/Registered.


Oh wow, the prices have normalized a bit - about 18 months ago I had to pay about twice the money to get 64 gigabytes of ECC goodness.


> The fact the pricing was so similar does make me wonder about the claims that non-ECC RAM is really ECC RAM just without the CPU/MMU being made aware of it - I think that’s possible if the RAM is already Buffered/Registered.

Who claims that? And can't you just count the chips?


Market segmentation.


There is no link. Segmentation of ECC support was an Intel decision which other chip manufactures have copied to a greater or lesser extent. For what it's worth, ECC support is spotty in Chinese-designed chips as well.

Intel's decision could just as well have been made by a socialist government's regulatory body (in the form of "minimum requirements for consumer chip" legislation), or by a committee of decision makers in an employee-owned company. We may never know, because we only have 2-4 data points and they're all basically capitalist.

Honestly, I'm not sure it was intended to be a serious comment. It could just as easily have been "...failure of human cognition and Obama."


Recent advances have blurred the lines a bit. The ECC memory that we all know and love is mainly side-band EEC, with the memory bus widened to accommodate the ECC bits driven by the memory controller. However as process size shrink, bit flips become more likely to the point that now many types of memory have on-die EEC, where the error correction is handled internally on the DRAM modules themselves. This is present on some DDR4 and DDR5 modules, but information on this is kept internal by the DRAM makers and not usually public.

https://semiengineering.com/what-designers-need-to-know-abou...

There has been a lot of debate regarding this that was summarised in this post -

https://blog.codinghorror.com/to-ecc-or-not-to-ecc/


> This is present on some DDR4 and DDR5 modules, but information on this is kept internal by the DRAM makers and not usually public.

On-die ECC is going to be a standard feature for DDR5. I'm not aware of any indication that anyone has implemented on-die ECC for DDR4 DRAM, and Hynix at least has made clear statements that on-die ECC is new for their DDR5 and was not present in their DDR4.


https://scholar.google.com/scholar?cluster=47120328903164805...

On-die ECC DRAM is already here with us.


Figure this is as good of a time as any to ask this:

There are many various DRAMs in a server (say, for disk cache). Has Google or anyone who operates at a similar scale seen single bit errors in these components?


This is as old as computing and predates Google.

When America Online was buying EV6 servers as fast as DEC could produce them, they used to see about about 1 double bit error per day across their server farm that would reboot the whole machine.

DRAM has only gotten worse--not better.


The supercomputing community has looked at some of the effect on different parts of the GPU.

https://ieeexplore.ieee.org/abstract/document/7056044


Yes.

Bit flips (for all reasons) occur in buses, registers, caches, etc. Anything that has state can have state changed incorrectly.

This is why filesystems like ZFS exist and storage formats have pervasive checksums.


New Yorker article that credits Jeff Dean and Sanjay Ghemawat with discovering the company’s bitflip issue:

https://www.newyorker.com/magazine/2018/12/10/the-friendship...


I remember reading how someone registered some google domains with a single bit flipped, and saw actual requests coming to them.


If you or anybody can remember the source article, that sounds like an interesting read!

Edit: found one with a quick search. https://nakedsecurity.sophos.com/2011/08/10/bh-2011-bit-squa...

And https://www.researchgate.net/publication/262273269_Bitsquatt...


Here's original paper: http://media.blackhat.com/bh-us-11/Dinaburg/BH_US_11_Dinabur...

But I'm not entirely convinced that most of these requests weren't just typos. Though those requests with mismatched Host header surely were true bitflips.


There was some follow up work as well: http://blog.dinaburg.org/2013/09/bitsquatting-at-defcon21-an...

And I am beyond certain they were not typos; the requests were for long URLs that no one would be typing by hand. A few were for services that humans never enter urls into (like the Windows crash reporter).


Also here: https://www.youtube.com/watch?v=gXY3jm34RFU

I can't find it right now, but I also saw someone doing this with some version of Google Now (or what it was back then) and serving different assets into it.


A long time ago I did that with "CDN" domains. I bought ~10 of them (variations of fbcdn, akamai, and ytimg). I _did_ see some traffic (some hits per hour if I remember well), and many of them were from cheap handheld phones (from the user-agent).


That works for any domain that's busy enough.

Random bit flips happen on client machines and on routers.

If there are enough requests for a domain name, some of those requests will be subject to one one of those bit-flips.


That might just be typos in some cases?


I mean early on sure at a startup where you’re not printing money I can see how saving on hardware makes sense. But surely you don’t need an MBA to know that hardware will continue to get cheaper whereas developers and their time will only get more expensive: better to let the hardware deal with it than to burden developers with it ... I’d have made the case for ECC but hindsight being what it is ...


But if you can save $1M+ now, then throw the cost of fixing it onto the person who replaces you, why do you care? You already got your bonus and jumped ship.


Such a short sighted incentivisation structure harms the business in the long run I think.


One of the best quotes in the Google quotes file an early Googler maintained (I am sure I am screwing it up):

“I’ve heard of defensive programming, but never adversarial memory.” — Ben Gomes


Close!

> I've never thought of defensive programming in terms of adversarial memory.


ECC memory can't eliminate the chances of these failures entirely. They can still happen. Making software resilient against bitflips in memory seems very difficult though, since it not only affects data, but also code. So in theory the behavior of software under random bit flips is well... Random. You probably would have to use multiple computers doing the same calculation and then take the answer from the quorum. I could imagine that doing so would still be cheaper than using ECC ram, at least around 2000.

Generally this goes against software engineering principles. You don't try to eliminate the chances of failure and hope for the best. You need to create these failures constantly (within reasonable bounds) and make sure your software is able to handle them. Using ECC ram is the opposite. You just make it so unlikely to happen, that you will generally not encounter these errors at scale anymore, but nontheless they can still happen and now you will be completely unprepared to deal with them, since you chose to ignore this class of errors and move it under the rug.

Another intersting side effect of quorum is that it also makes certain attacks more difficult to pull off, since now you have to make sure that a quorum of machines gives the same "wrong" answer for an attack to work.


I don't think ECC is going to give anyone a false sense of security. The issue at Google's scale is they had to spend thousands of person-hours implementing in software what they would have gotten for "free" with ECC RAM. Lacking ECC (and generally using consumer-level hardware) compounded scale and reliability problems or at least made them more expensive than they might otherwise had been.

Using consumer hardware and making up reliability with redundancy and software was not a bad idea for early Google but it did end up with an unforeseen cost. Just a thousand machines in a cosmic ray proof bunker will end up with memory errors ECC will correct for free. It's just reducing the surface area of "potential problems".


consumer hardware...

That's Intel's PR. Only "enterprise hardware", with a bigger markup, supports ECC memory. Adding ECC today should add only 12% to memory cost.

AMD decided to break Intel's pricing model. Good for them. Now if we can get ECC at the retail level...

The original IBM PC AT had parity in memory.


When I said consumer hardware I was meaning early Google literally using consumer/desktop components mounted on custom metal racks. While Intel does artificially separate "enterprise" and "consumer" parts, there's still a bit of difference between SuperMicro boards with ECC, LOM features, and data center quality PSUs and the off the shelf consumer stuff Google was using for a while.

I don't know if AMD really intended to break Intel's pricing model. Their higher end Ryzen chips you'd use in servers and capital W Workstations don't seem to have a huge price difference from equivalent Xeons. Even if they're a bit cheaper you still need a motherboard that supports ECC so it seems at first glance to be a wash as far as price.

That being said if I was putting together a machine today it would be Ryzen-based with ECC.


> Now if we can get ECC at the retail level

You can actually, most AMD consumer chips (except the ones with integrated graphics) have ECC support, even though it's not officially supported. See this Reddit thread for more details: https://www.reddit.com/r/Amd/comments/ggmyyg/an_overview_of_...


Small nit, the PRO version of Ryzen APUs do support ECC[0], also ASRock has been quoted saying that all of their AM4 motherboards support ECC, even the low end offerings with the A320 chipset.

[0] https://www.asrock.com/mb/amd/a320m-hdv%20r4.0/index.asp#Spe...


The CPU chip can do it. Some motherboards bring out the pins to do it, but they're often called "workstation" boards and cost 2x the price of a standard desktop motherboard. ECC memory itself is overpriced. $60 for 16 GB DDR4 without ECC, $130 for 16 GB DDR4 with ECC.

This is the legacy of Intel's policies.


Because if you give consumers a choice between having ECC or LEDs on otherwise identical boards with identical price, most will go for the LEDs. In reality the price isn't even the same because ECC realistically adds to the BOM (board, modules) more than LEDs do. So the price goes up with seemingly no benefit for the user.

As such features that are unattractive to the regular consumer go into workstation/enterprise offerings where the buyer understands what they're buying and why.


> Because if you give consumers a choice between having ECC or LEDs on otherwise identical boards with identical price, most will go for the LEDs.

Citation needed.

I would bet that your typical ram-purchasing consumer is not seeing or even considering the existence of the ECC model.

> ECC realistically adds to the BOM (board, modules) more than LEDs do. So the price goes up with seemingly no benefit for the user.

LEDs are a great opportunity to increase profit margin, so I'm not sure about your price conclusions.


That's pretty much exactly my conclusion

> Citation needed.

It really isn't. It was a hypothetical choice between 2 models, with ECC or LEDs, at the same price. Hypothetical because most boards don't offer the ECC support at all, and certainly not at the same price.

> LEDs are a great opportunity to increase profit margin, so I'm not sure about your price conclusions

You confused manufacturing costs, price of the product, and profit margins. LEDs cost far less to integrate than ECC but command a higher price premium (thus better profit margins) from the regular consumer. Again supporting my statement that even if presented with 2 absolutely identical parts save for ECC vs. LEDs the vast majority of consumers will go for LEDs because they don't care or know about ECC.


> It really isn't. It was a hypothetical choice between 2 models, with ECC or LEDs, at the same price. Hypothetical because most boards don't offer the ECC support at all, and certainly not at the same price.

You're making a claim about what people would choose. If you have no related data, and logic could support multiple outcomes, then a claim like that is basically useless.

> You confused manufacturing costs, price of the product, and profit margins.

I'm not sure why you think this.

> Again supporting my statement that even if presented with 2 absolutely identical parts save for ECC vs. LEDs the vast majority of consumers will go for LEDs because they don't care or know about ECC.

Sure, if you don't tell them that it's ECC they won't pick the ECC part.

If you actually do a fair test, and put them side by side while explaining that one protects them from memory errors and the other looks cooler, you can't assume they'll all pick the LED.

When people never even think of ECC, that is not evidence that they wouldn't care or know about it in a head-to-head competition.


> You're making a claim

My claims are common sense and supported by the real life: regular people don't know what ECC is, and those who do find the problem's impact is too minor to get palpable benefits from fixing it. Why are you being pedantic if you aren't actually going to bring arguments at the same level you expect from me?

> If you actually do a fair test, and put them side by side while explaining that one protects them from memory errors and the other looks cooler, you can't assume they'll all pick the LED.

Isn't this exactly the kind of claim you yourself characterize one paragraph above as "useless" because "you have no related data, and logic could support multiple outcomes"? Sure, if people were more tech-educated then my assumption might be wrong. But people aren't more educated so...

The benefits of LEDs are hard to miss (light) all the time. The benefits of ECC are hard to observe even in that fraction of a percent of the time. Human cellular "bitflips" happen every hour but they don't visibly affect you so you also consider it's not an issue that demands more attention, like constant solar protection. People aren't keen on paying to solve problems they never suffered from, or even noticed, especially when you tell them they happen so often still with no obvious impact. Unless they have no choice, like OEMs selling ECC RAM only devices.

Sell me ECC memory when my (actual real life) 10 year old desktop or 5 year old phone never glitched. Sell me ECC RAM when my Matlab calculations come back different every time. See the difference?

> When people never even think of ECC, that is not evidence that they wouldn't care or know about it in a head-to-head competition.

Well then, I guess none of us has any evidence except today people buy LEDs not ECC RAM. Educate people or wait until manufacturing process and design are so susceptible to bitflips that people notice and it will be a different conversation.


> My claims are common sense and supported by the real life: regular people don't know what ECC is, and those who do find the problem's impact is too minor to get palpable benefits from fixing it. Why are you being pedantic if you aren't actually going to bring arguments at the same level you expect from me?

Regular people aren't given the choice! The things you're quoting about the real world to support your argument are incompatible with a scenario where someone is actually looking at ECC and LED next to each other. And I'm not being "pedantic" to say that, it's a really core point.

> Isn't this exactly the kind of claim you yourself characterize one paragraph above as "useless" because "you have no related data, and logic could support multiple outcomes"?

A claim of a specific outcome is useless. "you can't assume" is another way of phrasing the lack of knowledge of specific outcomes.

> Sure, if people were more tech-educated then my assumption might be wrong. But people aren't more educated so...

It's the kind of thing that can go on a product page. But first someone has to actually make a consumer-focused sales page for ECC memory, and the ECC has to be plug-and-play without strong compatibility worries.

And just like when LEDs spread over everything, it's something that you can teach people about and create demand for with a bit of advertising.

> Sell me ECC memory when my (actual real life) 10 year old desktop or 5 year old phone never glitched. Sell me ECC RAM when my Matlab calculations come back different every time. See the difference?

That's a clear picture of one person. But "never glitched" is a very dubious claim, and you can't blindly extrapolate that to how everyone would act.


I think they may have been referring to the actual mainstream retail availability of ECC RAM. I can buy non-ECC RAM at almost any retailer that sells computers. If I need non-ECC RAM right now I can have it in my hands in 30 minutes. ECC on the other hand I pretty much have to buy online. Microcenter stocks a single 4GB stick of PC4-21300, and I can't think of a single use case where I'd want ECC but not more than 4GB.


You're right, rereading the parent post with that angle makes it clearer that they were complaining about the unavailability of memory and other hardware.

It would definitely be great to have more reliable hardware generally available and at less of a price premium.


Yes. I'd pay an extra 25%, but not an extra 110%.


It can't eliminate it but:

1. Single bitflip correction along with Google's metrics could help them identify algorithms they've got, customer's VMs that are causing bitflips via rowhammer and machines which have errors regardless of workload

2. Double bitflip detection lets Google decide if they say, want to panic at that point and take the machine out of service, and they can report on what software was running or why. Their SREs are world-class and may be able to deduce if this was a fluke (orders of magnitude less likely than a single bit flip), if a workload caused it, or if hardware caused it.

The advantage the 3 major cloud providers have is scale. If a Fortune 500 were running their own datacenters, how likely would it be that they have the same level of visibility into their workloads, the quality of SREs to diagnose, and the sheer statistical power of scale?

I sincerely hope Google is not simply silencing bitflip corrections and detections. That would be a profound waste.


ECC seems like a trivial thing to log and keep track of. Surely any Fortune 500 could do it and would have enough scale to get meaningful data out of it?


It's not just tracking ECC errors, which as you point out is not hard, but correlating it with the other metrics needed to determine the cause and having the scale to reliably root cause bitflips to software (workloads that inadvertently rowhammer) or hardware or even malicious users (GCP customers that may intentionally run a rowhammer.)


IBM does. They will probably sell you the information if you rent the machines from them.


There was an interesting challenge at DEF CON CTF a while back that tested this, actually. It turns out that it is possible to write x86 code that is 1-bit-flip tolerant–that is, a bit flip anywhere in its code can be detected and recovered from with the same output. Of course, finding the sequence took (or so I hear) something like 3600 cores running for a day to discover it ;)


Nit: not for a day, more like 8 hours, and that's because we were lazy and somebody said he "just happened" to have a cluster with unbalanced resources (mainly used for deep learning, but all GPUs occupied with quite a lot CPUs / RAMs left), so we decided to brute force the last 16 bits :)

Also, the challenge host left useful state (which bit was flipped) in registers before running teams' code, without this I'm not sure if it is even possible.


Sure, all's fair in a CTF. That story came to me through the mouths of at least a handful of people, who might have a bit of an incentive to exaggerate given that they hadn't quite been able to get to zero and might be a just a little sour :P

The state was quite helpful, yes–for x86 it seems like a "clean slate" shellcode would be quite difficult, if impossible, to achieve as we saw. However, I am left wondering how other ISAs would fare…perhaps worse, since x86 is notoriously dense. But maybe not? The fixed-width ones would probably be easy to try out, at least.


Maybe being notoriously dense is not a bad thing? While those ModRM bytes popping up everywhere is annoying as f* (too easy to flip an instruction into a form with almost-guaranteed-to-be-invalid memory access), at least due to the density there won't be reserved bits. For example, in AArch64 if bit 28 and bit 27 is both zero the instruction will almost certainly be an invalid one (hitting unallocated area), and with a single bit flip all branch instructions will have [28:27] = b'00...

[1] https://developer.arm.com/docs/ddi0596/h/top-level-encodings...


Right, I was saying that the other ISAs would do wore because they aren't as dense and will hit something undefined much more readily. But the RISCs in general are much less likely to touch memory (only if you do a load/store from a register that isn't clean, maybe). From a glance, MIPS looks like it might work, since the opcode field seems to use all the bits and the remaining bits just encode reg/func/imm in various ways. The one caveat I see is that I think the top bit of opcode seems to encode memory accesses, so you may be forced to deal with at least one.


This sounds really cool and interesting.

Was any code dumped anywhere?

I found this which corroborates everything you're saying but provides no further details: https://www.cspensky.info/slides/defcon_27_shortman.pdf


Oh, hey, it's Chad's slides!

Coverage of the finals is usually much less detailed, unfortunately, since the number of teams is much smaller and the challenges don't necessarily go up. However, https://oooverflow.io/dc-ctf-2020-quals/ has a couple more writeups linked from it; https://dttw.tech/posts/SJ40_7MNS#proof-by-exhaustion from PPP and http://www.secmem.org/blog/2019/08/19/Shellcoding-and-Bitfli... from SeoulPlusBadass.


I see. Thanks very much for this info.

Binary bitflip resilience is really cool. The radiation-hardened-quine idea (https://codegolf.stackexchange.com/questions/57257/radiation..., https://github.com/mame/radiation-hardened-quine) is cool, but these source-based approaches depend on a perfectly functioning and rather large (Ruby, V8, whole browser) binary stack. A bitflip-protected hex monitor or kernel, on the other hand...


> Making software resilient against bitflips in memory seems very difficult though, since it not only affects data, but also code.

There is an OS that pretty much fits the bill here. There was a show where Andrew Tanenbaum had a laptop running Minix 3 hooked up to a button that injected random changes into module code while it was running to demonstrate it's resilience to random bugs. Quite fitting that this discussion was initiated by Linus!

Although it was intended to protect against bad software I don't see why it wouldn't also go a long way in protecting the OS against bitflips. Minix 3 uses a microkernel with a "reincarnation server" which means it can automatically reload any misbehaving code not part of the core kernel on the fly (which for Minix is almost everything). This even includes disk drivers. In the case of misbehaving code there is some kind of triple redundancy mechanism much like the "quorum" you suggest, but that is where my crude understanding ends. AFAIR Userland software could in theory also benefit provided it was written in such a way to be able to continue gracefully on reloading.


At some point, whatever's watching the watchers is going to be vulnerable to bitflip and similar problems.

Even with a triple-redundant quorum mechanism, slightly further up that stack you're going to have some bit of code running that processes the three returned results - if the memory that's sitting on gets corrupted, you're back where you started.


> At some point, whatever's watching the watchers is going to be vulnerable to bitflip

One advantage of microkernels is that the "watcher" is so small that it could be run directly from ROM, instead of loaded into RAM. QNX has advocated that route for robotics and such in the past.

Minix may not be the best example of the type. While it is a microkernel, it's real world reliability has been poor in the past. More mature microkernel operating systems like QNX and OpenVMS are better examples.


> While it is a microkernel, it's real world reliability has been poor in the past.

Nitpick/clarification: it currently supervises the security posture, attestation state and overall health of several billion(?) Intel CPUs as the kernel used by the latest version of the Management Engine.

If ME is shut down completely apparently the CPU switches off within 20 minutes. Presumably this applies across the full uptime of the processor, and not just immediately after boot, and iff this is the case... percentage of Intel CPUs that randomly switch off === instability/unreliability of Minix in a tightly controlled industrial setting.


Anyone have any idea why there haven't been any open-source QNX clones, at least not any widely known ones? Even before their Photon MicroGUI patents expired, the clones could have used X11.

I used to occasionally boot into QNX on my desktop in college. It was a very responsive and stable system.

Hypervisors are, to a first approximation, microkernels with a hardware-like interface. All of this kernel bypass work being done by RDBMSes, ScyllaDB, HFTs, etc. is, to a first approximation, making a monolithic kernel act a bit like a microkernel.


There are well known open source microkernels, like Minix 3 and L4. Probably not that attractive.

Why something hasn't been done is always a hard question to answer, since to succeed a lot of things have to go right, and by default none of them do. But one thing is that microkernels were more trendy in the 90s - r&d people are mostly doing things like "the cluster is the computer", unikernel, exokernel, rump kernel, embedded (eg tock), remote attestation since then (I'm not up to date on the latest).


Thinking about it a bit more, QNX clones might suffer from something akin to second system syndrome. There's a simple working design, and it likely strongly invites people to jump right to their own twist on the design before they get very far into a clone.


> Minix may not be the best example of the type. While it is a microkernel, it's real world reliability has been poor in the past. More mature microkernel operating systems like QNX and OpenVMS are better examples.

You might be referring to the previous versions. Minix 3 is basically a different OS, it's more than an educational tool - in fact it's probably running inside your computer right now if you have an Intel CPU (it runs Intel's ME chip - for better or worse).


Yes, but this is the entire principle around which microkernels are designed: making the the last critical piece of code as small and reliable as possible. Minix3's kernel is <4000 lines of C.

As far as bitflips are concerned, having the critical kernel code occupy fewer bits reduces the probability of a bitflip causing an irrecoverable error.


Yes, I understand this -- basic risk mitigation by reducing the size of your vulnerability.

(I'll archaic brag a bit by mentioning I used to be a heavy user of Minix - my floppy images came in over an X25 network - and saw Andy Tanenbaum give his Minix 3 keynote at FOSDEM about a decade ago. I'm a big fan.)

Anyway, while reducing risk this way is laudable, and will improve your fleet's health, as per TFA it's a poor substitute, with bad economics and worse politics behind it, than simply stumping up for ECC.

I'll also note that, for example, Google's sitting on ~3 million servers so that ~4k LoC just blew out to 12,000,000,000 LoC -- and that's for the hypervisors only.

Multiply that out by ~50 to include VM's microkernels, and the amount of memory you've now got that is highly susceptible to undetected bit-flips is well into the mind-blowing range.


Oh i'm not saying it's the single best solution, I guess I got carried away in argument - It's simply a scenario where the concept shines, yet it's entirely artificial scenario and I agree ECC is the correct way.


You can have two watchers watching each other's integrity.



Error-correcting code (the "ECC" in ECC) is just a quorum at the bit level.


I'm surprised that the other replies don't grasp this. This is the proper level to do the quorum.

Doing quorum at the computer level would require synchronizing parallel computers, and unless that synchronization were to happen for each low level instruction, then it would have to be written into the software to take a vote at critical points. This is going to be greatly detrimental both to throughput and software complexity.

I guess you could implement the quorum at the CPU level... e.g. have redundant cores each with their own memory. But unless there was a need to protect against CPU cores themselves being unreliable, I don't see this making sense either.

At the end of the day, at some level, it will always come down to probabilities. "Software engineering principles" will never eliminate that.


I would highly recommend a graduate-level course in computer architecture for anyone who thinks ECC is a 1980s solution to a modern problem.

There are a lot of seemingly high-level problems that are solved (ingeniously) in hardware with very simple, very low-level solutions.


Could you please link me to such a course that displays the hardware level solutions? I'm super interested!



https://en.wikipedia.org/wiki/NonStop_(server_computers)

My first employer out of Uni had an option for their primary product to use a NonStop for storage -- I think HP funded development, and I'm not sure we ever sold any licenses for it.


Modern error correction codes can do much better than that.


You need two alpha particles hitting the same rank of memory for failure to happen. Although super rare, even then it is still correctable. You need three before it is silent data corruption. Silent corruption is what you get with non ECC with even a single flip.


Where are you getting this from? My understanding is that these errors are predominantly caused by secondary particles from cosmic rays hitting individual memory cells, and I've never heard something so precise as "you need two alpha particles". Aren't the capacitances in modern DRAM chips extremely small?


The structure of the ECC is at the rank level. This allows for correcting single bit flips in ranks and detecting double bit flip in ranks. So when you grab a cache line each 64bit is corrected and verified.


Bit flips can happen, but regardless if they can get repaired by ECC code or not, the OS is notified, iirc. It will signal a corruption to the process that is mapped to the faulty address. I suppose that if the memory contains code, the process is killed (if ECC correction failed).


> I suppose that if the memory contains code, the process is killed (if ECC correction failed).

Generally, it would make the most sense to kill the process if the corrupted page is data, but if it's code, then maybe re-load that page from the executable file on non-volatile storage. (You might also be able to rescue some data pages from swap space this way.)


If you go that route, you should be able to avoid the code/data distinction entirely; as data pages can also be completly backed by files. I believe the kernel already keeps track of what pages are a clean copy of data from the filesystem, so I would think it would be a simple matter of essentially pageing out the corrupted data.

What would be interesting is if userspace could mark a region of memory as recomputable. If the kernel is notified of memory corruption there, it triggers a handler in the userspace process to rebuild the data. Granted, given the current state of hardware; I can't imagine that is anywhere near worth the effort to implement.


> What would be interesting is if userspace could mark a region of memory as recomputable.

I believe there's already some support for things like this, but intended as a mechanism to gracefully handle memory pressure rather than corruption. Apple has a Purgeable Memory mechanism, but handled through higher-level interfaces rather than something like madvise().


> You probably would have to use multiple computers doing the same calculation and then take the answer from the quorum.

The Apollo missions (or was it the Space Shuttle?) did this. They had redundant computers that would work with each other to determine the “true” answer.


The Space Shuttle had redundant computers. The Apollo Guidance Computer was not redundant (though there were two AGCs onboard-- one in the CM and one in the LEM). The aerospace industry has a history of using redundant dissimilar computers (different CPU architectures, multiple implementations of the control software developed by separate teams in different languages, etc) in voting-based architectures to hedge against various failure modes.


This remains common in aerospace, each voting computer is referred to as a "string". https://space.stackexchange.com/questions/45076/what-is-a-fl...


In aerospace where this is common, you often had multiple implementations, as you wanted to avoid software bugs made by humans. Problem was, different teams often created the same error at the same place, so it wasn’t as effective as it would have seemed.


Forgive my ignorance, but wouldn't the computer actually reacting to the calculation (and sending a command or displaying the data) still be very vulnerable to bit-flips? Or were they displaying the results from multiple machines to humans?


Sounds similar to smart contracts running on a blockchain :)


If you use multiple computers doing the same calculation and then take the answer from the quorum, how do you ensure the computer that does the comparison is not affected by memory failures? Remember that all queries have to through it, so it has to be comparable in scale and power.


> how do you ensure the computer that does the comparison is not affected by memory failures?

You do the comparison on multiple nodes too. Get the calculations. Pass them to multiple nodes, validate again and if it all matches, you use it.


> validate again

Recursion, see recursion.



I mean raft and similar algorithms run multiple verification machines because a single point of failure is a single point of failure.


Raft, Paxos, and other consensus algorithms add even more overhead. Imagine running every Google query through Raft and think how long it will take and how much extra hardware would be needed.

ECC memory is just as fast as non-ECC memory, and only cost a little more.


Your comment sounded like "your recursive definition is impossible".

I am totally for ECC and was flabbergasted when it went away. But the article makes sense since I remember Intel pushing hard to keep it out of the consumer space. The freaking QX6800 didn't support ECC and it retailed for over a grand.


I beg this, every time this conversation comes up it’s the same answer “I don’t see a problem”.

It’s so easy to chalk these kind of errors to other issues, a little corruption here, a running program goes bezerk there- could be a buggy program or a little accidental memory overwrite. Reboot will fix it.

But I ran many thousands of physical machines, petabytes of RAM, I tracked memory flip errors and they were _common_; common even in: less dense memory, in thick metal enclosures surrounded by mesh. Where density and shielding impacts bitflips a lot.

My own experience tracking bitflips across my fleet led me to buy a Xeon laptop with ECC memory (precision 5520) and it has (anecdotally) been significantly more reliable than my desktop.


Yeah, it's real obnoxious of Intel to silo ECC support off into the Xeon line, isn't it? I switched to ECC memory in 2013 or 2014 with a Xeon E3 (fundamentally a Core i7 without the ECC support fused off) and of course a Xeon-supporting motherboard (with weird "server board" quirks: e.g., no on-board sound device).

I love that AMD doesn't intentionally break ECC on its consumer desktop platforms and upgraded to the Threadripper in 2017.


I've considered using an AMD CPU instead of Intel's Xeon on the primary desktop computer, but even low-end Ryzen Threadripper CPUs have TDP of 180W, which is a bit higher than I'd like. And though ECC is not disabled in Ryzen CPUs, AFAIK it's not tested in (or advertised for) those, so one won't be able to return/replace a CPU if it doesn't work with ECC memory, AIUI, making it risky. Though I don't know how common it is for ECC to not be handled properly in an otherwise functioning CPU; are there any statistics or estimates around?


> but even low-end Ryzen Threadripper CPUs have TDP of 180W, which is a bit higher than I'd like.

Why does it matter? It doesn't idle that high; it only goes that high of you're using it flat out, in which case the extra power usage is justified because it's giving that much more performance over a 100 W TDP CPU. Now I totally get it if you don't want to go Threadripper just for ECC because it's more expensive, but max power draw, which you don't even have to use? I've never seen anyone shop a desktop CPU by TDP, rather than by performance and price.


>> I've never seen anyone shop a desktop CPU by TDP, rather than by performance and price.

Oh oh, me! Back in the day I bought a 65W CPU for a system that could handle a 90W. I wanted quiet and figured that would keep fan noise down at a modest performance penalty. It should also last longer, being the same design but running cooler. I ran that from 2005 until a few years ago (it still run fine but is in storage).

Planning to continue this strategy. I suspect it's common among SFF enthusiasts.


On AMD, with Ryzen Master, you can set the TDP-envelope of the processor to what you want. Then the boost/frequency/voltage envelope it chooses to operate in under sustained load is different.

IMO, shopping by performance/watt makes sense. Shopping by TDP doesn't. (Especially since there is no comparing the AMD and Intel TDP numbers as they're defined differently; neither is the maximum the processor can draw, and Intel significantly exceeds the specified TDP on normal workloads).


TDP matters a fair bit in SFF(Small Form Factor) PCs. For instance the 3700x is a fantastic little CPU since it has a 65W TDP but pretty solid performance.

In a sandwich style case you're usually limited to low profile coolers like Noctua L9i/L9a since vertical height is pretty limited.


Performance/watt matters. You can just set TDP to what you want with throttling choices.

If you want a 45W TDP from the 3700X, you can just pop into Ryzen Master and ask for a 45W TDP. Boom, you're running in that envelope.

I think shopping based on TDP is not the best, because it's not comparable between manufacturers and because it's something you can effectively "choose".


How do you do that? Is it a setting in the bios? Or can it be done runtime? If so, how? It sounds interesting if I can run a beefy rig as a power efficient device, for always-on scenarios, and then boost it when I need.


> How do you do that? Is it a setting in the bios? Or can it be done runtime?

On AMD, it's a utility you run. I believe you may require a reboot to apply it. On some Intel platforms, it's been settings in the BIOS.

> It sounds interesting if I can run a beefy rig as a power efficient device, for always-on scenarios, and then boost it when I need.

This is what the processor is doing internally anyways. It throttles voltage and frequency and gates cores based on demanded usage. Changing the TDP doesn't change the performance under a light-to-moderate workload scenario at all.

Ryzen Master lets you change some of the tuning for the choices it makes about when and how aggressively to boost, though, too.


Ryzen Master doesnt seem to be available for linux so you end up with bunch of unnofficial hacks that may or may not work. I run sff setup myself, originally wanted to get 3600 but it was out of stock, and the next tdp friendly processor was 3700x.


That's an annoyance, but on Linux you have infinite more control of thermal throttling and you can get whatever thermal behavior you want. Thermald has been really good on Intel, and now that Google contributed RAPL support you can get the same benefits on AMD-- pick exactly your power envelope and thermal limits.


Yeah but can I get a metric ton of benchmarks at that 45w setpoint?

I don't really see the reason in paying for a 100w TDP premium if I'm just going to scale it down to 65w.


> Yeah but can I get a metric ton of benchmarks at that 45w setpoint?

Yup, they're out there.

> I don't really see the reason in paying for a 100w TDP premium if I'm just going to scale it down to 65w.

You might want the core count or peak performance for the very short term. When I was looking, running 65W parts in the 45W envelope was only about a 7% penalty, so you get a bunch more performance/watt.


I'm running a 2400G in a Mellori-ITX. Another issue is sizing the power supply.


Back when my daily driver was a Core 2 laptop, someone told me that capping the clock frequency would make it unusable.

As a petty "Take that", I dropped the max frequency from 2.0 GHz to 1.0 GHz. I ran a couple benchmarks to prove the cap was working, and then just kept it at 1.0 for a few months, to prove my point.

It made a bigger difference on my ARM SBC, where I tried capping the 1,000 MHz chip to 200 or 400 MHz. That chip was already CPU-bound for many tasks and could barely even run Firefox. Amdahl's Law kicked in - Halving the frequency made _everything_ twice as slow, because almost everything was waiting on the CPU.


The funny thing is, on modern processors-- throttling TDP only affects when running flat out all-core workloads. A subset of cores can still boost aggressively, and you can run all-core max-boost for short intervals.

And the relationship between power and performance isn't linear as processor voltages climb trying to squeeze out the last bit of performance.

So if you want to take a 105W CPU and ask it to operate in a 65W envelope, you're not giving up even 1/3rd of peak performance, and much less than that of typical performance.


You’re giving up 0 of peak single thread performance. A single core in turbo across Intel and AMD, mobile and desktop uses max 50W.


Here are some numbers on single core power consumption, ⅌ https://www.anandtech.com/show/16214/amd-zen-3-ryzen-deep-di...:

AMD Ryzen 9 5950X: 20.6W for a single core at 5050MHz, 49W for the whole package. (And it’s generally the package figure that you care about.)

AMD Ryzen 9 5900X: 17.9W/54W at 4875MHz.

AMD Ryzen 7 5800X: 17.3W/37W at 4825MHz.

AMD Ryzen 5 5600X: 11.8W/28W at 4650MHz (though the highest core reading is 13W, at three cores loaded).

You’re both correct: by simply restricting that power envelope by 40%, you shed a lot less multi-threaded performance than people realise, and no single-threaded performance.

Look at the 5950X figures, and you observe that at about 120W, it can run 6 cores at 4,650MHz (27,900 core–MHz), or 16 cores at 3,775MHz (60,400 core–MHz).

Expressed one way: by dropping the frequency by 20%, power per watt increased by around 2.7×.

Expressed another way: let’s skip a 65W envelope—put this particular 105W chip in a 40W envelope and you lose only 20% of your six-cores performance. Seriously. But I’m not sure what the curve would look like if you load all 16 cores at a 40W envelope, what speed they’d be going at.

(But do remember that “TDP” is a bit of a mess as a concept, and that we’re depending on non-core power consumption being generally fairly consistent regardless of load.)


Hm... My 2013 NUC in fanless Akasa enclosure runs 24/7 on a 6W CPU, I recently looked at the options, and the 2019 6W offering changes little in performance. Yes, memory got faster, but that's it.

My passive-cooled desktop is also running a slightly trottled down 65W CPU.

So yes, there are people who choose there hardware by TDP.


When looking for a CPU for a server that sits in my living room, I went down the thought process of getting a low tdp. I don't have a quote, but I seem to remember coming to the conclusion that tdp is the max temp threshold, not the consistent power draw. If you have a computer idling I believe you won't see a difference in temp between cpus, but you will have the performance when you need it.

These days, a quiet, pwm fan with good thermal paste (and maybe some linux CPU throttling) more than achieves my needs for a "silent" pc 99% of the time.

I would love to be told my above assumptions are wrong if they are.


Yah-- one should look at performance within a given power envelope. Being able to dissipate more and then either end up with the fan running or the processor throttling back somewhat is good, IMO.

The worst bit is, AMD and Intel define TDP differently-- neither is the maximum power the processor can draw-- though Intel is far more optimistic.


SFF?


"small form factor" as far as I can tell


The Intel Nuc and Mac mini are good examples of this - however the Nuc doesn’t have its psu inside, it’s a brick. Great for fixing failures, horrible in general as a built in psu is so much tidier.


Internal PSU adds heat that is noisy or expensive to dissipate or is avoided by throttling performance.


If the concern is about heat and noise, the cooling system is the most important factor, oddly I don't see it even mentioned.

Get a huge cooler like Noctua d14, and you pc becomes silent. It lasts forever, requires no maintenance, a good investment.

If you are adventurous, watercooling is even better, but its a can of worms I decided I'd rather live without - possibility of leaks and cost make it harder to justify


I prefer to pick PSU and fans (for both CPU and chassis) that can handle it comfortably (preferably while staying silent and with some reserve) with maximum TDP in mind, and given that I don't need that many cores or high clock speed either, a powerful CPU with high TDP is undesirable because it just makes picking other parts harder. I've mentioned TDP explicitly because I wouldn't mind if it was a (possibly even high-end) Threadripper that somehow didn't produce as much heat. Although price also matters, indeed.


> I've never seen anyone shop a desktop CPU by TDP, rather than by performance and price.

That's me. When I start to plan for a new system, I select the processor first and read its thermal design guidelines (Intel used to have nice load vs. max temp graphs in their docs) and select every component around it for sustained max load.

This results in a more silent system for idle and peace of mind for loading it for extended duration.


That's not necessarily correct.

You can passively cool threadrippers if you underclock them enough and have good ventilation in case.


If my only interest would be ECC, I might do that but, I develop scientific software for research purposes. I need every bit of performance from my system.

In my case loading means maxing out all cores and extended period of time can be anything from five minutes to hours.


The problem is-- you can't compare the TDP nor even the system cooling design guidelines between AMD and Intel.

Both are optimistic lies, but-- if you look at the documents it looks like currently AMD needs more cooling, but actually dissipates less power in most cases and definitely has higher performance/watt.


> The problem is-- you can't compare the TDP nor even the system cooling design guidelines between AMD and Intel.

Doesn't matter for me since I'm not interested in comparing them.

> Both are optimistic lies, but-- if you look at the documents it looks like currently AMD needs more cooling, but actually dissipates less power in most cases and definitely has higher performance/watt.

I'm aware of the situation, and I always inflate the numbers 10-15% to increase headroom in my systems. The code I'm running is not a most case code. A FPU heavy, "I will abuse all your cores and memory bandwidth" type, heavily optimized scientific software. I can sometimes hear that my system is swearing at me for repeatedly running for tests.

I don't like to add this paragraph but, I'm one of the administrators of one of the biggest HPC clusters in my country. I know how a system can surpass its TDP or how can CPU manufacturers skew this TDP numbers to fit in envelopes. We make these servers blow flames from their exhausts.


Built a NAS. My #1 concern for choosing CPU was TDP. This machine is on 24/7 and power use is a primary concern where I live because electricity is NOT cheap.


This is a poor way to make the choice. TDP is supposed to specify the highest power you can get the processor to dissipate, not typical or idle use. And since different manufacturers specify TDP differently, you can't even compare the number.

Performance/watt metrics and idle consumption would have been a far better way to make this choice.

If you have a choice between A) something that can dissipate 65W peak for 100 units of performance, but would dissipate 4W average under your workload, and B) something that can dissipate 45W peak for 60 units of performance, but would dissipate 4.5W under your workload... I'm not sure why you'd ever pick B.


Is there a metric to look for to understand what power consumption is at "idle" or something close to that? That is what confuses me. I don't want to spend a lot of money on something that will be always on, and usually idling, and finding that its power usage is way higher than I thought. But perhaps there is a metric that tells that. I have not looked closely at it.

Also, even though the CPU may draw less, can still the power supply waste more, just because it is beefy? Comparing with a sports car, they have great performance, but also use more gas in ordinary traffic? Can a computer be compared with that?


> Is there a metric to look for to understand what power consumption is at "idle" or something close to that? That is what confuses me. I don't want to spend a lot of money on something that will be always on, and usually idling, and finding that its power usage is way higher than I thought.

Community benchmarks, from Tom's Hardware, etc.

The vendor numbers are make believe-- you can't use them for power supply sizing or for thermal path sizing. If you look at the cited TDP numbers today-- it can be misleading-- e.g. often Intel 45W TDP parts use more power at peak than AMD 65W parts.

On modern systems, almost none of the idle consumption is the processor. The power supply's idle use and motherboard functions dominate.

> Also, even though the CPU may draw less, can still the power supply waste more, just because it is beefy?

Yes, having to select a larger power supply can result in more idle consumption, though this is more of a problem on the very low end.


> And though ECC is not disabled in Ryzen CPUs, AFAIK it's not tested in (or advertised for) those

ECC isn't validated by AMD for AM4 Ryzen models, but it's present and supported if the motherboard also supports it. Many motherboards have ECC support (the manual will say for sure), and a handful of models even explicitly advertise it as a feature.

I have a Ryzen 9 3900X on an ASRock B450M Pro4 and 64 GB of ECC DRAM, and ECC functionality is active and working.


What do you mean by “validated”? There’s the silicon, but they don’t test it?


IMO, "validated" is intentionally wishy-washy and mostly means that AMD would prefer it if enterprises paid them more money by buying EPYC (or Ryzen Pro) parts instead of consumer Ryzen parts. Much like how Intel prefers selling higher-margin Xeons over Core i5. It's market segmentation, but friendlier to consumers than Intel's approach.


More like "The feature is present in silicon but motherboard makers are not required to turn it on". At the end of the day, ECC support does require extra copper traces in the PCB and some low end models may deliberately choose to skip them, thus the expectation has to be managed.


My (largely unfounded) understanding is that this means they don't don't run the consumer chip configurations though a battery of compatibility tests with with various memory modules on a reference motherboard. My understanding is that they run these steps for each stepping or significant process change for their.

Secondly, it probably also means that they do not include tests for this functionality when they perform the final tests against each fully assembled chip. I'd expect that a jtag boundary scan does verify that the bond wires are in place and work, but no functional tests of ECC are run on each processor in the consumer configuration.

The net result is that with a compatible motherboard and memory, ECC almost certainly works (since the memory controller is the same as in the supported model) but AMD does not officially guarantee it. It is much like overclocking. The functionality is present, and it should work, and most likely does, but AMD accepts no responsibility if it does not, since they don't formally test for it.


I went through this about a year ago, to build a low-TDP ECC workstation. I do not have stats on failure rates, just this anecdotal experience. Asrock and Asus seem to be the boards to get. For RAM, I got two sticks of Samsung M391A4G43MB1, and verified. The advice I remember from the forums was to stick to unbuffered ram (UDIMMS).


Did you consider any off-the-shelf ECC boxes?

Found some here -- bottom of the EPYC product line starts at $2849 ...!

https://www.velocitymicro.com/wizard.php?iid=337


The TDP on EPYC chips is a lot higher. I think of Threadripper as mid-tier, and EPYC as the high-end. Ryzen is remarkable because you can buy new equipment with ECC, at consumer prices. I am hazy, but don't think that has been possible since the 386 era ('parity ram').


> The TDP on EPYC chips is a lot higher.

EPYC TDP ranges from "a lot lower" (35W embedded, 120W regular) up to "comparable with" (180-240W, a single 280W model) relative to Threadripper (180-250W last gen; current gen is all 280W). It's definitely not a lot higher on the Epyc side.

> I think of Threadripper as mid-tier, and EPYC as the high-end.

This oversimplifies to the point of not being a useful intuition (or is arguably even incorrect). Threadripper is a SKU with a moderate number of cores at relatively high clocks; (high-power) EPYC SKUs have a lot of very efficient cores running at lower clocks. They both have a niche, but Threadripper has unambiguously better single-core performance due to the ~20% higher clocks. And single-core IPC still matters in many applications (to oversimplify: Amdahl's law; but also, latency-sensitive applications).


AMD is more accurate in their TDP specs than Intel is, they can't really be compared.


This is not responsive to he comment you're replying to, which is comparing AMD to AMD.


My comment was corrupted by rowhammer. Not only was it misdirected, the content is also probably wrong.


Yes, the consumer parts only support UDIMMs. If you want RDIMMs, you have to pay for EPYC.


I don't think Threadripper is a hard requirement for ECC. There's some pretty reasonable TDP processors if you step down from Threadripper.


I haven't seen definite details and test results on these (but haven't looked recently).

What specific configurations (CPU, MB, RAM) are known to work?

Let's say I have a Ryzen system, how can I check if ECC really works? Like, can I see how many bit flips got corrected in, say, last 24h?


All desktop Ryzen CPUs without integrated GPU, i.e. with the exception of APUs, support ECC.

You must check the specifications of the motherboard to see if ECC memory is supported.

As a rule, all ASRock MBs support ECC and also some ASUS MBs support ECC, e.g. all ASUS workstation motherboards.

I have no experience with Windows and Ryzen, but I assume that ECC should work also there.

With Linux, you must use a kernel with all the relevant EDAC options enabled, including CONFIG_EDAC_AMD64.

For the new Zen 3 CPUs, i.e. Ryzen 5xxx, you must use a kernel 5.10 or later, for ECC support.

On Linux, there are various programs, e.g. edac-utils, to monitor the ECC errors.

To be more certain that the ECC error reporting really works, the easiest way is to change the BIOS settings to overclock the memory, until memory errors appear.


Regarding verification. There is a debian package called edac-utils. As I recall you overclock your RAM and run your system at load in order to generate failures.

Looking back at my notes, the output of journalctl -b tells should say something like, "Node 0: DRAM ECC enabled."

Then 'edac-ctl --status' should tell you that drivers are loaded.

Then you run 'edac-util -v' to report on what it has seen,

    mc0: 0 Uncorrected Errors with no DIMM info
    mc0: 0 Corrected Errors with no DIMM info
    mc0: csrow2: 0 Uncorrected Errors
    mc0: csrow2: mc#0csrow#2channel#0: 0 Corrected Errors
    mc0: csrow2: mc#0csrow#2channel#1: 0 Corrected Errors
    mc0: csrow3: 0 Uncorrected Errors
    mc0: csrow3: mc#0csrow#3channel#0: 0 Corrected Errors
    mc0: csrow3: mc#0csrow#3channel#1: 0 Corrected Errors
    edac-util: No errors to report.


> As I recall you overclock your RAM and run your system at load in order to generate failures.

You can also use memtest86+ for this, although I don't recall if it requires specific configuration for ECC testing.


All AMD CPUs with integrated memory controllers support ECC. The CPU also exposes an interface usable by the operating system to verify ECC works - the same interface is used to provide monitoring of memory fault data provided by ECC.

They aren't tested on it, so it's possible to get a dud, but it's minuscule chance that isn't worth bothering.

Now, to actual issues you can encounter: motherboards

The problem is that ECC means you need to have, iirc, 8 more data lines between CPU and memory module, which of course mean more physical connections (don't remember how many right now). Those also need to be properly done and tested, and you might encounter a motherboard where it wasn't done. Not sure how common, unfortunately.

Another issue is motherboard firmware. Even though AMD supplies the memory init code, the configuration can be tweaked by motherboard vendor, and they might simply break ECC support accidentally (even by something as simple as making a toggle default to false then forgot to expose it in configuration menu).

Those are the two issues you can encounter.

The difference with AFAIK Threadripper PRO, and EPYC, is that AMD includes ECC in its test and certification programs for it, which kind of enforces support.


> Another issue is motherboard firmware. Even though AMD supplies the memory init code, the configuration can be tweaked by motherboard vendor, and they might simply break ECC support accidentally (even by something as simple as making a toggle default to false then forgot to expose it in configuration menu).

I think some Gigabyte boards are infamous for this in certain circle

OTOH: Gigabyte might have a Threadripper PRO motherboard (WRX80 chipset) coming out in the future


Gigabyte is also infamous for trying to claim that they implemented UEFI by dropping a build of DUET (UEFI that boots on top of BIOS, used for early development) into BIOS image...


On Windows, to check if ECC is working, run the command 'wmic memphysical get memoryerrorcorrection':

    PC C:\> wmic memphysical get memoryerrorcorrection
    MemoryErrorCorrection
    6
SuperUser has a convenient decoder[1], but modern systems will report "6" here if ECC is working.

When Windows detects a memory error, it will record it in the system event log, under the WHEA source. As a side note, this is also how memory errors within the CPU's caches are reported under Windows.

[1] https://superuser.com/questions/893560/how-do-i-tell-if-my-m...


Every Ryzen (non APU) supports it* Check the montherboard of your choice, they would declare it in big bold letters, e.g.[0]

*not officially, and the memory controller provides no report for 'fixed' errors.

0: http://www.asrock.com/mb/AMD/X570%20Taichi/


It’s not. I have a low end Epyc machine with ECC. It has a TDP of something like 30 watts.


I didn't consider embedded CPUs (I guess that's about an embedded EPYC, not a server one), those look neat. But there's no official ECC support (i.e., it's similar to Ryzen CPUs), is there?

Edit: as detaro mentioned in the reply, there is, and here's the source [0] -- that's what they mean by "RAS" on promotional pages [1]. That indeed looks like a nice option.

[0] https://www.amd.com/system/files/documents/updated-3000-fami...

[1] https://www.amd.com/en/products/embedded-epyc-3000-series


For embedded applications, there is official ECC support for all CPUs named Epyc or Ryzen Vxxxx or Ryzen Rxxxx.

There are computers in the Intel NUC form factor, with ECC support (e.g. with Ryzen V2718), e.g from ASRock Industrial.


All EPYC, including the embedded ones, do officially have ECC support


RAS covers more than just DRAM, but yes. Historically, the reporting interface is called MCA (Machine Check Architecture) / MCE. I think both AMD and Intel have extensions with other names, but MCA/MCE points you in the right direction.


what kind of machine is that? Been vaguely looking for one a while back, and everything seemed difficult to get (since the main target is large-volume customers I guess)


Sorry, I didn't see this earlier. It's Supermicro board with an Epyc 3201 (8c/8t), base frequency is ~1.5 Ghz. There's no CPU fan, so it can get quite hot. It throttles at 94 degrees celcius, and it will hit that if more than around 4-6 cores are being fully used. However there are options in the bios to disable some of the cores. In my case I've disabled 4 of the cores and the temperature is much more reasonable now.


I don't understand. Whatever the TDP of Intel processors, you are straight up getting less bang for watt given their ancient process. Same reason smartphones burst to high clocks and power; getting the task done faster is on average much more efficient.


> one won't be able to return/replace a CPU if it doesn't work with ECC memory

I don't know where you live, but around here, (if you buy new?), the vendor MUST take back items up to 15 days after they were delivered, for ANY reason.

So, as long as you synchronize your buying of CPU, RAM, (motherboard), you should be fine.


You can use BIOS settings to change the TDP to whatever you like, with substantially higher efficiency if you are under-powering and substantially lower efficiency if you are over-powering.


> I've considered using an AMD CPU instead of Intel's Xeon on the primary desktop computer, but even low-end Ryzen Threadripper CPUs have TDP of 180W, which is a bit higher than I'd like.

Any apples-to-apples comparable Intel CPU will have comparable power use. The difficulty is that Intel didn't really have anything like Threadripper — their i9 series was the most comparable (high clocks and moderate core counts), but i9 explicitly did not support ECC memory, nullifying the comparison.

You're looking at 2950X, probably? That's a Zen+ (previous gen) model. 16 core / 32 thread, 3.5 GHz base clock, launched August 2018.

Comparable Intel Xeon timeline is Coffee Lake at the latest, Kaby lake before that. As far as I can tell, no Kaby Lake nor Coffee Lake Xeons even have 16 cores.

The closest Skylake I've found is an (OEM) Xeon Gold 6149: 16/32 core/thread, 3.1 GHz base clock, 205W nominal TDP (and it's a special OEM part, not available for you). The closest buyable part is probably Xeon Gold 6154 with 18/36 core/threads, 3GHz clock, and 200W nominal TDP.

Looking at i9 from around that time, you had Skylake-X and a single Coffe Lake-S (i9-9900K). 9900K only has 8 cores. The Skylake i9-9960X part has 16/32 cores/threads, base clock of 3.1GHz, and a nominal TDP of 165W. That's somewhat comparable to the AMD 2950X, ignoring ECC support.

Another note that might interest you: you could run the Threadripper part at substantially lower power by sacrificing a small amount of performance, if thermals are the most important factor and you are unwilling to trust Ryzen ECC: http://apollo.backplane.com/DFlyMisc/threadripper.txt

Or just buy an Epyc, if you want a low-TDP ECC-definitely-supported part: EPYC 7302P has 16/32 cores, 3GHz base clock, and 155W nominal TDP. EPYC 7282 has 16/32 cores, 2.8 GHz base, and 120W nominal TDP. These are all zen2 (vs 2950X's zen+) and will outperform zen+ on a clock-for-clock basis.

> And though ECC is not disabled in Ryzen CPUs, AFAIK it's not tested in (or advertised for) those, so one won't be able to return/replace a CPU if it doesn't work with ECC memory, AIUI, making it risky.

If your vendor won't accept defective CPU returns, buy somewhere else.

> Though I don't know how common it is for ECC to not be handled properly in an otherwise functioning CPU; are there any statistics or estimates around?

ECC support requires motherboard support; that's the main thing to be aware of shopping for Ryzen ECC setups. If the board doesn't have the traces, there's nothing the CPU can do.


Keep in mind that Intel lies about its TDP.


There's been a lot of misinformation spread about what TDP means for modern CPUs. In Intel's case TDP is the steady state power consumption of the CPU in its default configuration while executing a long running workload. Long meaning more than a minute or two. The CPU implements this by keeping an exponentially weighted moving average (EWMA) of the CPU's power consumption. The CPU will modulate its frequency to keep this moving average at-or-below the TDP.

One consequence of using a moving average is that if the CPU has been idle for a long time then starts running a high power workload instantaneous power consumption can momentarily exceed the TDP while the average catches up. This is often misleadingly referred to as "turbo mode" by hardware review sites. It's not a mode, there's no state machine at work here, it's just a natural result of using a moving average. The use of EWMA is meant to model the heat capacity of the cooling solution. When the CPU has been idle for a while and the heatsink is cool, the CPU can afford to use more power while the heatsink warms up.

Another factor which confuses things is motherboard firmware disabling power limits without the user's knowledge. Motherboards marketed to enthusiasts often do this to make the boards look better in review benchmarks. This is where a lot of the "Intel is lying" comes from, but it's really the motherboard manufacturers being underhanded.

The situation on the AMD side is of course a bit different. AMD's power and frequency scaling is both more complex and much less documented than Intel's so it's hard to say exactly what the CPU is doing. What is known is that none of the actual power limits programmed into the CPU align with the TDP listed in the spec. In practice the steady state power consumption of AMD CPUs under load is typically about 1.35x the TDP.

Unlike Intel, firmware for AMD motherboards does not mess with the CPU's power limit settings unless the user does so explicitly. Presumably this is because AMD's CPU warranty is voided by changing those settings, while Intel's is not.


Intel and AMD absolutely have "turbo mode" and market such features using the term "turbo". It might just be a result of the weighted moving average, but the term "turbo" isn't something reviewers just made up out of nowhere.

https://www.intel.com/content/www/us/en/architecture-and-tec...

https://www.amd.com/en/technologies/turbo-core


Intel measures TDP at base frequency... that's disingenuous.


They don’t. They just measure it differently than AMD. Intel measures at base clock, but AMD measures at sustained max clock IIRC. It’s definitely deceptive, but it’s not a lie as long as Intel tells you (which they do).


Intel's TDP numbers are at best an indicator of which product segment a chip falls into. They are wildly inaccurate and unreliable indicators of power draw under any circumstance. For example, here's a "58W" TDP Celeron that can't seem to get above 20W: https://twitter.com/IanCutress/status/1345656830907789312

And on the flip side, if you're building a desktop PC with a more high-end Intel processor, you will usually have to change a lot of motherboard firmware settings to get the behavior to resemble Intel's own recommendations that their TDP numbers are supposedly based on. Without those changes, lots of consumer retail motherboards default to having most or all of the power limits effectively disabled. So out of the box, a "65W" i7-10700 and a "125W" i7-10700K will both hit 190-200W when all 8 cores/16 threads are loaded.

If a metric can in practice be off by a factor of three in either direction, it's really quite useless and should not be quantified with a scientific unit like Watts.


It is a lie when they change the definition of TDP without telling you first and later redefined the word to different thing once they got caught.

May be we should use a new term for it, something like iTDP.


Well, it's a power measurement that isn't total and can't be used for design... So, it's a lie.

If they gave it some other name, it would be only misleading. Calling it TDP is a lie.


They both lie, but Intel lies worse :D


Nah. Both brands pull more than TDP when boosting at max, AMD desktop processors will pull up to 30% above the specified TDP for an indefinite period of time (they call this number the "PPT" instead, but they need to go higher than TDP to hit full boost, and PPT is the number that governs that).

Intel mobile processors actually obey TDP better than AMD processors do - Tiger Lake has a hard limit, when you configure a 15W TDP then it really is 15W steady-state once boost expires, while AMD mobile products will pull up to 50% more than configured in steady-state operation. (the gap is larger than desktop)

https://images.anandtech.com/doci/16084/Power%20-%2015W%20Co...

"the brands measure it differently" is sort of theoretically true but not in the sense people think, and in practice it is not true.

On AMD it is literally just a number they pick that goes into the boost algorithm. Robert Hallock did some dumb handwavy shit about how it's measured with some delta-t above ambient with a reference cooler but the fact is that the chip itself basically determines how high it'll boost based on the number they configure, so that is a self-fulfilling prophecy, the delta-t above ambient is dependent on the number they configure the chip to run at.

In practice: what's the difference between a 3600 and a 3600X? One is configured with a TDP of 65W and one is configured with a TDP of 95W, the latter lets you boost higher and therefore it clocks higher. Configure them both to a 65W PPT limit and they will boost to pretty much the same place.

Intel nominally states that it's measured as a worst-case load at base clocks, something like Prime95 that absolutely nukes the processor (and even then many processors do not actually hit it). But really it is also just a number that they pick. The number has shifted over time, previously they used to undershoot a lot, now they tend to match the official TDP. It's not an actual measurement, it's just a "power category" that they classify the processors as, it's informed by real numbers but it's ultimately a human decision which tier they put them in.

So in practice, for both brands, it is just a number they pick. They have different theoretical methods for getting there but ultimately the marketing department looks at where the clocks would put them and pick a power number that they think represents that. It is not, in practice, a pure measurement from either brand, it is just a "category" they use.

Real-world you will always boost above base clocks on both brands at stock TDP, at least on real-world loads. You won't hit full boost on either brand without exceeding TDP, the "AMD measures at full boost" is categorically false despite the fact that it's commonly repeated. AMD PPT lets them boost above the official TDP for an unlimited period of time, they cannot run full boost when limited to official TDP.


Can you cite something? Sounds interesting.


It’s not true. Sortove. Intel measures at base clock while AMD does at sustained peak clock. Deceptive? Yes. Lie? No.


Yeah, the iMac Pro has the Xeon W and ECC. T'would be nice if the Apple Silicon MacBook Pro had it. There's not much of a reason to pay for the Pro over the Air. But like Linus, I'm going to blame Intel for this situation in the market. Maybe Apple will strike out on its own with Apple Silicon but since their dominant use case is phones, I'll not hold my breath.


Unless something weird happens, the next generation of the Apple M-line will use LPDDR5 memory instead of the LPDDR4X used in the Apple M1. While it probably won't support error correction monitoring, LPDDR5 has built in error correction that silently corrects single bit flips. That alone should be a huge reliability improvement.

LPDDR5 will enable some much needed level of error correction in a metric ton of other future SoC designs too. I look forward to the future Raspberry Pi with built in error correction capabilities.


Will this also exist for consumer DDR5?


Yes, this applies to both DDR5 and LPDDR5. Leaks indicate that DDR5 CPUs and motherboards by Intel and AMD are not going to be out this year though, at least not on the desktop.


Well then, maybe I'll wait for the rumored 14.1 MacBook Pro. MacRumors says second half of 2021.


While it's true that Intel only has ECC support on Xeon (and several other chips targeted at the embedded market) it's not true that ECC is supported well on AMD.

We only use Xeons on developer desktops and production machines here precisely because of ECC. It's about 1 bit flip/month/gigabyte. That's too much risk when doing something critical for a client.


> it's not true that ECC is supported well on AMD

ECC is supported on most Ryzen models[1], as long as the motherboard supports it. In fact, ASUS and ASRock (possibly others) have Ryzen motherboards designed for workstation/server use where ECC support is specifically advertised.

[1] The only exception is the Ryzen CPUs with integrated graphics.


Depends what you mean by supported. Semi-offically:

ECC is not disabled. It works, but not validated for our consumer client platform.

Validated means run it through server/workstation grade testing. For the first Ryzen processors, focused on the prosumer / gaming market, this feature is enabled and working but not validated by AMD. You should not have issues creating a whitebox homelab or NAS with ECC memory enabled.

https://old.reddit.com/r/Amd/comments/5x4hxu/we_are_amd_crea...


AMD may claim not to validate ECC on Ryzen, but it's working well enough for major motherboard vendors to market Ryzen motherboards with ECC advertised as a feature.

ECC support not being "validated," for all practical purposes, simply means that board vendors can advertise a board lacking ECC support as compatible with AMD's AM4 platform, without getting a nasty letter from AMD's lawyers.


Yes there is a risk to buy a Ryzen CPU with non-functional ECC.

However, I use only computers with ECC, previously only Xeons, but in the last years I have replaced many of them with Ryzens, all of which work OK with ECC memory.

When having to choose between a very small risk of losing the price of a CPU and having to use for sure, during many years, an Intel CPU with half of the AMD speed, the choice was very obvious for me.


Your quote is for consumer platforms (Ryzen) only; GP's statement was that ECC is not well-supported on AMD at all, which is obviously false (EPYC, Threadripper).


It's an unsupported configuration and it's not tested.

The latter is a big problem, one of the extreme-OC guys (Buildzoid) who interacts frequently with the OEMs (as he is pushing their stuff to the limit and he frequently needs their help) has commented that AMD has a really bad problem with their BIOS teams. The AGESA firmware (the low-level code that the processor actually runs) is buggy as all hell at a firmware level and the OEMs are forced to patch around it in BIOS, but the AGESA firmware also has a massive problem with code churn, so these BIOS fixups basically stop working all the time. And the driver teams at a lot of OEMs are literally one person, so there isn't enough staffing there to test everything all the time. Long and short of it is: stuff breaks in AMD BIOSs, constantly, and they don't notice it.

This is obviously a huge problem when ECC is not an officially supported feature, because it means nobody is testing it! You might update your BIOS (as you frequently have to do with AMD machines) and suddenly ECC stops working, it might be running ECC in non-ECC mode and no longer correcting errors. Or it might have screwed up reporting them to the OS.

The server/workstation boards are the only ones you should be trusting Ryzen with ECC usage on.


> While it's true that Intel only has ECC support on Xeon

That's not true. There are Core i3, Atom, Celeron, and Pentium SKUs with ECC. E.g. the Core i3-9300

https://en.wikichip.org/wiki/intel/core_i3/i3-9300


> it's not true that ECC is supported well on AMD.

That's an extreme claim. Why do you say so?


Doesn't intel make ECC available on the i3 line of CPUs?


I was going to say no but I just checked and at least ONE latest generation i3 processor supports ECC

https://ark.intel.com/content/www/us/en/ark/compare.html?pro...

https://ark.intel.com/content/www/us/en/ark/products/208074/...

Problem is this processor is an Embedded processor so probably not for us

> Industrial Extended Temp, Embedded Broad Market Extended Temp

My understanding is Intel does not support ECC on the desktop unless you pay extra.


Yeah, that appears to be a BGA-packaged processor designed to be permanently soldered to the board of some embedded device, not something that you can install in your desktop at all. I'm not sure why Intel decided to brand their embedded processors with ECC as i3, though I suspect the reason this range exists at all is because companies were going with competitors like AMD instead due to their across-the-board ECC support.


That i3 is for file servers.


They used to support ECC in the desktop i3 lineup, current gen does not have ECC except in some embedded SKUs.

https://ark.intel.com/content/www/us/en/ark/products/199280/...



You can find non-Xeons with ECC support. But they are rare and usually suitable for some kinds of micro servers.


Were you around for enough DRAM generations to notice an effect of DRAM density / cell-size on reported ECC error rate?

I’ve always believed that, ECC aside, DRAM made intentionally with big cells would be less prone to spurious bit-flips (and that this is one of the things NASA means when they talk about “radiation hardening” a computer: sourcing memory with ungodly-large DRAM cells, willingly trading off lower memory capacity for higher per-cell level-shift activation-energy.)

If that’s true, then that would mean that the per-cell error rate would have actually been increasing over the years, as DRAM cell-size decreased, in the same way cell-size decrease and voltage-level tightening have increased error rate for flash memory. Combined with the fact that we just have N times more memory now, you’d think we’d be seeing a quadratic increase in faults compared to 40 years ago. But do we? It doesn’t seem like it.

I’ve also heard a counter-effect proposed, though: maybe there really are far more “raw” bit-flips going on — but far less of main memory is now in the causal chain for corrupting a workload than it used to be. In the 80s, on an 8-bit micro, POKEing any random address might wreck a program, since there’s only 64k addresses to POKE and most of the writable ones are in use for something critical. Today, most RAM is some sort of cache or buffer that’s going to be used once to produce some ephemeral IO effect (e.g. the compressed data for a video frame, that might decompress incorrectly, but only cause 16ms of glitchiness before the next frame comes along to paper over it); or, if it’s functional data, it’s part of a fault-tolerant component (e.g. a TCP packet, that’s going to checksum-fail when passed to the Ethernet controller and so not even be sent, causing the client to need to retry the request; or, even if accidentally checksums correctly, the server will choke on the malformed request, send an error... and the client will need to retry the request. One generic retry-on-exception handler around your net request, and you get memory fault-tolerance for free!)

If both effects are real, this would imply that regular PCs without ECC should still seem quite stable — but that it would be a far worse idea to run a non-ECC machine as a densely-packed multitenant VM hypervisor today (i.e. to tile main memory with OS kernels), than it would have been ~20 years ago when memory densities were lower. Can anyone attest to this?

(I’d just ask for actual numbers on whether per-cell per-second errors have increased over the years, but I don’t expect anyone has them.)


I think it's been quadratic with a pretty low contribution from the order 2 term.

Think of the number of events that can flip a bit. If you make bits smaller, you get a modestly larger number of events in a given area capable of flipping a bit, spread across a larger number of bits in that area.

That is, it's flip event rate * memory die area, not flip event rate * number of memory bits.

In recent generations, I understand it's even been a bit paradoxical-- smaller geometries mean less of the die is actual memory bits, so you can actually end up with fewer flips from shrinking geometries.

And sure, your other effect is true: there's a whole lot fewer bitflips that "matter". Flip a bit in some framebuffer used in compositing somewhere-- and that's a lot of my memory-- and I don't care.


Sorry, I don't have the numbers you asked for. But afaik one other effect is that "modern" semiconductor processes like FinFET and Fully-Depleted Silicon-on-Insulator are less prone to single event upsets and especially result in only a single bit flipping and no drain of a whole region of transistors from a single alpha particle.


When you say bitflips were "common" on thousands of physical machines, does that mean you observed thousands of bitflips?

Otherwise, I would think that an unlikely event becoming 1000x more likely by sheer numbers would have warped your perception.

I believe that hardware reliability is mostly irrelevant, because software reliability is already far worse. It doesn't matter whether a bitflip (unlikely) or some bug (likely) causes a node to spuriously fail, what matters is that this failure is handled gracefully.


Yes. I work at Facebook on data compression.

The libraries we maintain (1) are responsible for a non-trivial part of Facebook's overall compute footprint, (2) should basically never fail of their own accord, and (3) have pretty good error monitoring. So my team is operating what is effectively (among other things) a very sensitive detector for hardware failure.

And indeed we see examples all the time of blobs that fail to decompress, and usually when we dig in we find that the blob is only a single bit-flip away from a blob that decompresses successfully into a syntactically correct message. I can't share numbers, but, off the top of my head, I think it's the largest source of failures we see. It happens frequently enough that I wrote a tool to automate checking [0].

So yes. It happens. Pretty frequently, in the sense that if you're doing xillions of operations a day, a one-in-a-xillion failure happens all the time.

[0] https://github.com/facebook/zstd/tree/dev/contrib/diagnose_c...


On a single computer with a large memory, e.g. 32 GB or more, the time between errors can be of a few months, if you are lucky to have good modules. Moreover, some of the errors will have no effect, if they happened to affect free memory.

Nevertheless, anyone who uses the computer for anything else besides games or movie watching, will greatly benefit from having ECC memory, because that is the only way to learn when the memory modules become defective.

Modern memories have a shorter lifetime than old memories and very frequently they begin to have bit errors from time to time long before breaking down completely.

Without ECC, you will become aware that a memory module is defective only when the computer crashes or no longer boots and severe data corruption in your files could have happened some months before that.

For myself, this was the most obvious reason why ECC was useful, because I was able in several cases to replace memory modules that began to have frequent correctable errors, after many years with little or no errors, without losing any precious data and without downtime.


The good modules bit is important. I'm told by some colleagues that most of the bit flips are from alpha particles from the ram casings surprisingly enough.


Plus, ECC RAM is so accessible these days thanks to Ryzen. All the asrock mobos, from the lowest end to the highest end, advertise official support for it.


It depends where the failure happens. Sometimes you really lose the “failure in the wrong place” lottery. For example, in a domain name: http://dinaburg.org/bitsquatting.html


Another comment[1] mentioned 1 bitflip per gigabyte per month. If you have a lot of RAM, that's rather a lot.

> It doesn't matter whether a bitflip (unlikely) or some bug (likely) causes a node to spuriously fail

Except that a bitflip can go undetected. It may crash your software or system, but it also may simply leak errors into your data, which can be far more catastrophic.

[1] https://news.ycombinator.com/item?id=25623206


I have a server with 384GB of RAM sitting next to me, and in two months of uptime with the only shielding being a pile of IBM iron on top of it, it has detected a grand total of 0 errors.

This is, of course, an anecdote rather than data, but 0 is different enough from the expected 768 that it makes me doubt that statistic.


So can a bug.


Yes. And? That doesn't suddenly make bitflips benign.


The point is that you can't prevent failure by just buying something. You have to deal with the fact that failure can not be prevented.

In other words, if a single defective DIMM somewhere in your deployment is causing catastraphic failure, your mistake was not buying the wrong RAM modules. Your mistake was relying on a single point of failure for mission critical data.


People might not be preventing failure by using ECC, but they significantly decrease the likelihood of having to deal with hard-to-debug problems caused by bitflips.


Its enough that graphs can show you solar weather.

I can't give my source, but its far higher than most people think. Just pay the money.


Ya, I'm not buying that biyflips are a problem. Or maybe modern software can correct better for this? Because I use my desktop all day everyday running tons of software on 64 gb of ram and I don't get errors or crashes often enough to remember ever having one.


> I'm not buying that biyflips are a problem.

Google and read up - it is a problem, has killed people, has thrown election results, and much more.

It's such a common problem than bitsquatting is a real thing :)

Want to do an experiment? Pick a bitsquatted domain for a common site, and see how often you get hits.

https://en.wikipedia.org/wiki/Bitsquatting


Nobody denies that bitflips happen. On the whole, you fail to make a case that preventing bitflips is the solution to a problem. Bitsquatting is not a real problem, it's a curiosity.

As for the case of bitflips killing someone: Bitflips are not the root cause here. The root cause is that somebody engineered something life-critical that mistakenly assumed hardware can not fail. Bitflips are just one of many reasons for hardware failure.


>Bitflips are not the root cause here.

So those systems didn't fail when a bitflip happened?

> The root cause is that somebody engineered something life-critical that mistakenly assumed hardware can not fail.

The systems I am aware of were designed with bitflips in mind. NO software can handle arbitrary amounts of bitflips. ALL software designed to mitigate bitflips only lower the odds via various forms of redundancy. (For context, I've written code for NASA, written a few proposals on making things more radiation hardened, and my PhD thesis was on a new class of error correcting codes - so I do know a little about making redundant software and hardware specifically designed to mitigate bitflips).

By claiming a bitflip didn't kick off the problems, and trying to push the cause elsewhere, you may as well blame all of engineering for making a device that can kill on failure.

So your argument is a red herring

>On the whole, you fail to make a case that preventing bitflips is the solution to a problem

Yes, had those bitflips been prevented, or not happened, those fatalities would not have happened.

>Ya, I'm not buying that biyflips are a problem.

If bitflips are not a problem then we don't need ECC ram (or ECC almost anything!) which is clearly used a lot. So bitflips are enough of a problem that a massively widespread technology is in place to handle precisely that problem.

I guess you've never written a program and watched bits flip on computers you control? You should try it - it's a good exercise to see how often it does happen.

I guess you define something being a problem differently than I or the ECC ram industry do.


> So those systems didn't fail when a bitflip happened?

I didn't say that. I'm saying that the root cause (as in "root cause analysis") is not the bitflip. Designating the bitflip as the root cause is like analyzing your drunk driving accident and concluding that the root cause must be ethanol, rather than your drinking habits.

> The systems I am aware of were designed with bitflips in mind. NO software can handle arbitrary amounts of bitflips. ALL software designed to mitigate bitflips only lower the odds via various forms of redundancy.

Of course, and I'm not actually arguing that adding in ECC is completely worthless to that effect, though it is close to worthless. Luckily, ECC is quite cheap, if not free, so throwing it in there makes sense.

However, suppose ECC would increase the cost by several magnitudes, would it still be worth it? Obviously not. Redundancy alone reduces the probability of spurious failure by several magnitudes, and simply increasing redundancy would be far cheaper than adding in ECC.

> If bitflips are not a problem then we don't need ECC ram (or ECC almost anything!) which is clearly used a lot. So bitflips are enough of a problem that a massively widespread technology is in place to handle precisely that problem.

My point is that bitflips either don't really matter, in case data integrity is not mission critical, or they don't actually solve the problem, in case data integrity is mission critical.

If you have solved the problem of data integrity through redundancy, then ECC doesn't make much of a difference anymore. If you haven't solved the problem, then ECC will only prevent a vanishingly small subset of disasters that are awaiting you.

> I guess you've never written a program and watched bits flip on computers you control? You should try it - it's a good exercise to see how often it does happen.

I don't care how often it happens. I care about the odds of a bitflip causing an actual problem. If a computer crashes, that's okay, it'll reboot. If any data were to be corrupted, it would most likely happen at the disk level and not the DRAM level.

> I guess you define something being a problem differently than I or the ECC ram industry do.

Of course, somebody who sells ECC RAM will want to convince you that ECC actually solves a real problem. The same can be said about the nutritional supplement industry, or many other industries that rely on make-belief.


>I don't care how often it happens.

Yes, that is clear.

> If you have solved the problem of data integrity...

As above, this is not a binary, black and white thing, but you keep presenting it as such. It's probabilistic, and higher protection is not free - the tradeoff is engineering.

> Redundancy alone reduces the probability of spurious failure by several magnitudes

ECC "alone reduces the probability of spurious failure by several magnitudes". That's why it is used.

Naive redundancy ignores almost a century of better method form forward error correcting codes. I have a feeling your idea of redundancy is having multiple exact copies of a system or data and having them vote, which is a terribly expensive way to do data protection when there are vastly better methods.

>Of course, somebody who sells ECC RAM will want to convince you that ECC actually solves a real problem. The same can be said about the nutritional supplement industry, or many other industries that rely on make-belief.

And we're done. If you don't think ECC helps a real problem then I see why you don't understand bitflip causing problems. Good luck.


> As above, this is not a binary, black and white thing, but you keep presenting it as such. It's probabilistic, and higher protection is not free - the tradeoff is engineering.

The actual problem is binary. You either solved it, or you didn't. ECC is "free", but it doesn't actually solve the problem. Actually solving the problem requires engineering.

Of course there's a probabilistic element to it, but the problem is to drive the probability of failure to "vanishingly small". The utility of adding or removing a vanishingly small constant to another vanishingly small constant is vanishingly small. This is what ECC does for you.

> ECC "alone reduces the probability of spurious failure by several magnitudes". That's why it is used.

ECC reduces the probability of spurious failure due to bitflips in DRAM by several magnitudes. However, spurious failure can occur for so many more reasons that the bitflip issue becomes a vanishingly small part.

> I have a feeling your idea of redundancy is having multiple exact copies of a system or data and having them vote, which is a terribly expensive way to do data protection when there are vastly better methods.

As you know, having worked for NASA, this is the right choice under certain circumstances. If there are lives on the line and you have a choice between "not solving a problem" and "a terribly expensive solution", you should go with the latter.

> If you don't think ECC helps a real problem then I see why you don't understand bitflip causing problems.

ECC does not solve the problem of data integrity. If you actually solve the problem of data integrity, you will find that ECC becomes effectively redundant. Do we not fundamentally agree on this? If so, why not?

That's not to say ECC is entirely useless from an administrative standpoint. It makes DRAM bitflips one less thing to worry about. One less thing out of thousands of things. Commensurately, the cost of ECC in a given deployment, like its utility, is vanishingly small.


Crashes aren't such a big problem. You can detect them and reboot or whatever. Silent data corruption is the real issue IMHO.

See also this comment above: https://news.ycombinator.com/item?id=25623764


There is no guarantee of state at the quantum level ... just a high-degree of assurance in a state. After 40 years in the electronics, optics, software business, I've learned that there is absolutely the possibility for unexplained "blips".


How did you track memory errors across thousands of physical machines?


https://github.com/netdata/netdata/issues/1508

Looks like `mcelog --client` might be a starting place? Feed that into your metrics pipeline and alert on it like anything else...


Newer linux have replaced mcelog with edac-util. I think most shops operating systems at that scale are getting their ECC errors out of band with IPMI SEL, though.



The same way you do it with everything else, export the telemetry and store it in time series...


Can you get decent battery life with this ecc memory in a laptop?


Yes. ECC memory uses only marginally more power than non-ECC memory. And memory isn’t the largest consumer of battery life by a country mile.

Screen, Wi-Fi, and to a much lesser extent (unless under load) the CPU are the most major culprits of low battery life.


It can actually reduce power consumption, because refresh rates don't need to be so high:

https://media-www.micron.com/-/media/client/global/documents...


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: