Hacker News new | past | comments | ask | show | jobs | submit login
Using time travel to remotely debug faulty DRAM (julialang.org)
161 points by KenoFischer 8 months ago | hide | past | favorite | 62 comments

Well that's remarkable; this is one of those demos that makes me think maybe I should go learn this tech stack just so I could do this. (Another is org mode in emacs.) Looks like rr is flexible enough that I can use it without having to pick up Julia, which is an easier on-ramp even if this post makes me think that maybe I should learn Julia...

I was just thinking, it's odd that there are still processors being made that don't support ECC memory, and that this isn't a thing that is expected given the (relatively) large amounts of RAM that are included in modern machines.

ECC adds significant cost, and the benefits are stastically meager. I ran a fleet of a few thousand servers with all having 64GB and some having up to 768GB ECC RAM. I'd estimate we had no more than one swap a month. Most systems never had any (reported) ECC corrections.

I still think it's important, but it makes sense that people don't want to pay more for something to detect errors that are quite unlikely.

Well, it uses 12.5% of the bytes for error correction instead of data storage, which is not very much overhead. Your 128GB becomes 112GB, which is still pretty good.

I also disagree that the benefits are meager. I have suffered from several computers with flaky memory (fine when you build the computer, flaky several years later; then you do an overnight memory test and find that indeed the memory has gone bad), and a strong software signal that "hey your memory is bad" is very actionable. You also have to think about it from a programming standpoint -- what happens if this variable isn't what I set it to? What if "for 1..10" is actually "for 1..2147483658"? Do you have time to debug that? How much data do you lose when you persist that to disk? To me it is insane to not get this nearly-free consistency check if you ever plan to persist any bytes in RAM to long-term storage. Even consumer GPUs have ECC memory these days. It's a no-brainer.

What surprises me is that the industry segments RAM sticks by ECC/not-ECC, when it's just a software function performed by the memory controller. I think everyone with a HEDT setup would be happy to enable ECC and get better reliability at the cost of 12% less RAM. I know I would be. (I built a Threadripper workstation recently and just couldn't get any reasonable prices on QVL'd ECC memory. So I skipped it and paid like $600 for 128GB of 3600MT/16-CAS memory... would be happy to flip a switch and have that be only 112GB.)

The reason why ECC isn't widespread is it's because it's used to segment the market. Home users are cost sensitive, so hardware vendors have to think of ways to get server users to not buy home user equipment. ECC is one of these levers. That's all there is to it. Everyone would have ECC if it were a mere 12.5% more expensive.

This is not how ECC ram works. 128GB of ram does not change to 112GB with ECC. 128GB of ram with ECC will invariably be 144GB.

Standard memory modules become (a multiple of) 9 bits wide, with the additional bit stored near the other 8 bits.

Which is still 12.5%. Which for most people wouldn't be a problem since ram is a fraction of the system price (so it might only raise the price of the machine 3% if the ram is 1/3 of the total BOM).

The problem is _INTEL_ deciding its a premium feature, and the memory manufactures charging 50%+ more for 12.5% more hardware.

So instead of being a couple percent of the system cost, it ends up being a noticeable double digit percentage.

Of course none of this explains why apple/etc haven't done it in their phones.

You seem to be commenting on cost figures, but have not actually identified the cost / sales volume relationship.

ECC memory physically requires additional wires, necessitating different parts that don't sell at the same volumes.

Not sure what volume has to do with it, if the whole market is ECC.

And as far as "wires", traces are basically free as long as they don't force additional pcb layers, which shouldn't be the case given the careful pin/ddr chip/processor designs focused on exactly that. The extra pins are there regardless, and in the past so were the "wires" given that CBx/DMx pins were muxed. Further, packetized ram interfaces have been known to bury the error correction in the protocol same as PCIe/etc. Meaning its a slight efficiency loss.

Look at it this way, every part of the system _EXCEPT_ the ram has some kind of error correction on it at this point. Its a cheap way to not only increase robustness but its also a security mechanism.

If I were guessing, I would say that DDR5 is the last version that isn't ECC end to end, the idea that the manufactures can shrink the dies at the cost of some BER and make it up with a bit of embedded ECC will just be to tempting. Particularly with AMD on the rise, intel will have a much harder time playing product segmentation games if AMD doesn't do the same.

It's not just about the cost of a pcb trace. A wider parallel bus means more potential skew which means either relaxing timing or tighter tolerances. The notional 12% cost for the extra ram bank and related hardware may not seem like much, but the margins on commodity computing equipment are razor thin. Consumers don't and won't buy on the basis of ECC being on a sticker on the box, so manufacturers correctly conclude that they should cut the feature or be undercut on price.

Dunno, I've notice it being a pretty minimal step up in price. Over a few generations the E3-1230 (quad core/8 threads) was cheaper then the similar I7. The clock speed is slightly less (100-200 Mhz), but I figure that's reasonable if you are going for a nice, cool, and reliable CPU.

So I save $50 on the CPU and that just about covers the ram/motherboard premium.

I know that's not how it's sold (your "128GB" stick actually contains 144GB of memory if it's ECC), but that's dumb and drives up the cost. Needing two SKUs is kind of the problem, and it's not a technical problem.

> and it's not a technical problem

For a memory controller to add extra bits, they would have to come from somewhere. For every 64 bits (8 bytes read), you now need to read 72 bits (9 bytes).

However “DIMMs are printed circuit boards that carry multiple packaged DRAMs and support 64bit or 72bit databus widths, the latter to enable eight error-checking and correction (ECC) bits to protect against single-bit errors.”.

So now every read from 64 bit (non-ECC) DRAM now needs two reads, one for the 64 bits you want, and another to read the 8 bits ECC.

If your access pattern is random, then you will slow down memory access by 100%. For long run sequential reads/writes the slow down will be 12.5% assuming you can lock the memory accesses for 9 sequential reads/writes (avoiding invalid memory states is essential!).

The “cost” of ECC implemented by the memory controller is:

* you lose 1/9th of your memory (as you point out)

* the speed of your computer drops by up to 100% (speed of many operations is limited by random memory access speed, not CPU)

* CPU atomic instructions for memory access are complicated https://en.wikipedia.org/wiki/Compare-and-swap and would also incur a further speed penalty in critical sections - which can very significant.

The idea seems useful (clearly it would be a fantastic feature to be able to switch on if you suspect faulty memory) but there are likely technical reasons for why your idea is not implemented (not just price discrimination).

> dumb

That appears condescending to me. Generally you should assume people are smart. ECC RAM is designed by smart people to be the way it is and start with the assumption that if an idea isn’t implemented that perhaps there are good reasons why not?

(Edited to make reply flow better. Disclaimer: I only have a very shallow knowledge of the design constraints, and I expect there are other more serious problems with the idea).

Needing an additional bit to store ECC data is not a technical problem?

Speed does not decrease with ECC memory -- you might be thinking of the extra cycle of latency associated with registered memory, and that's hardly a 100% penalty. Unbuffered ECC DIMMs have no penalty compared to non-ECC DIMMs. Shipping 72 bits to the CPU instead of 64 bits costs a bit more power, nothing more. Your desktop CPU already has all the transistors needed to handle ECC memory, they're just disabled in consumer SKUs.

The point he is trying to make is you don't need additional bits, you just need bits, they could be additional, or they could be subtracted from what is available.

How would this actually work? Memory access on your non-ecc processor is (for example) 64 bits wide. Your data is using all 64 bits. Perhaps you would only use 56 bits for user data and 8 bits of ecc data on every read? Suddenly you need two memory accesses to read your 64bit value?

I suppose you could have an extra dimm slot just for ecc storage, and only require it to be 1/8 the size of total ram on the normal slots.

(I think there are probably issues with this in terms of how ddr itself actually works with its interleaving and adding a third channel to access from but I don't know enough about how it works to know for sure if that's the case)

Separate channels are not timing locked. This would involve a lot more complexity and cost than the current practice.

At the end of the day, electrical/computer engineers do actually generally know what they're doing.

Unbuffered ECC RAMs are expensive because of lower demand. Registerd ECC RAMs especially used ones are dirt cheap thanks to huge amount of servers.

> Do you have time to debug that?

By the way, for detection you don't need the full error correcting code, but you can use an error detecting code which can use fewer bits.

ECC is only more expensive until the first time you have to spend multiple hours debugging an issue or recovering data due to memory corruption. My time and data are worth more than the $40 premium paid for the ECC unbuffered DIMMs in my Ryzen system. And that price premium would pretty much go away if every system shipped with ECC memory as the DRAM vendors would produce x9 chips instead of x8. Shame on Intel for segmenting ECC memory out of the consumer market.

What do you mean by ECC is more expensive?

Are you talking about ECC ram being more expensive? That's probably true but why not give people the option whether to use ECC? ECC being optional is a good idea anyway because ECC memory is generally slower.

I don't think the actual ECC implementation in the processor is very expensive to implement.

Consumer CPUs not having ECC is pure product segmentation, without technical justification, IMO.

ECC ram is more expensive, yes.

But supporting ECC ram also adds more expense. To properly support it, BIOS engineers have to support and test it; motherboard makers have to support and test it.

I appreciate that AMD offers it for their consumer oriented processors, but because it's a best effort feature, and it's hard for an end user (or reviewer) to test, you never really know if you're going to get full support, or if the support is really just that you can use ECC ram, if you want to spend a little more for your ram, and get none of the benefits of ECC.

It certainly adds something to the cost (die space) of the memory controller, but I agree it's probably not much.

Well those costs are mainly related to adding support for another memory standard, but i'm suggesting ECC become the standard with non-ECC being dropped. The BIOS/testing cost would be unaffected, they would just be testing only ECC memory.

Realistically the mainstream CPU manufacturers have ECC solutions, so including them in the mainstream processors shouldn't be a huge issue, it's just a market segmentation ploy to exclude the feature from the consumer processor designs.

> i'm suggesting ECC become the standard with non-ECC being dropped.

I think the cost pressure is too high. Early IBM PCs required parity ram (a 9th chip to store if the sum of bits was even or odd), and would fault if the value was incorrect on reads. Ram module manufactures made innovative fake parity modules that calculated the parity value on access, replacing the 9th ram chip with a very simple circuit and saving money.

It would be hard to convince the whole industry not to make fake ECC ram, if ECC was mandatory.

Fake ECC as you put it serves a purpose, it protects the bus interface. Which is one of the failure points on modern machines, and is why sometimes to fix ram/qpi/etc errors you end up replacing the motherboard.

We will have to see with DDR5 (because it supports "internal" ECC) if its worth it to the memory industry to build RAM that is internally denser, but more error prone (as is the case with modern flash) or continue to attempt to build 100% reliable ram (and failing).

I'm betting some clever person figures that out. Which leaves only the memory bus itself unprotected. Which IMHO, is foolish and serves only to create product segmentation. So, for a DDR5 dimm with internal ECC, generating bus ECC should be a trivial addition.

> that calculated the parity value on access, replacing the 9th ram chip with a very simple circuit and saving money.

> It would be hard to convince the whole industry not to make fake ECC ram, if ECC was mandatory.

Assuming you don't use memory-mapped IO, that's easy to fix. On startup, generate 4 random bits a,b,c,d. Parity bit is data line 4a+2b+c, with d?even:odd parity, data bit 4a+2b+c is on data line D8. ECC on 64/72 uses more random bits, but is otherwise similar, although for modern chipsets it would probably have to be scrambled in the northbridge or southbridge (or equivalent) rather than the CPU, to allow for DMA and such. Note that there's no gate delay involved here; the multiplexing can be done with pass transistors.

> ECC adds significant cost

Does it, though? I don't remember all the prices from when I was researching memory close to a year ago for a new PC, but one big notable difference between ECC and non-ECC I've seen is that almost all non-ECC sticks are overclocked. You buy something rated at 2667MHz, and you actually get is e.g. 1833MHz chips overclocked at that speed. Which makes them cheaper. But I'd expect non-overclocked non-ECC sticks to be that far in pricing to ECC sticks (well, apart from the obvious difference they'd have more chips). And because the market is focused on cheap non-ECC sticks, they don't do cheaper ECC overclocked sticks, but there is no reason they couldn't make them. There aren't that many ECC unbuffered sticks already, it's easier to find registered ones. Essentially, the market is skewed by Intel not supporting ECC on non-Xeons.

Ironically, ECC is great to have for manual overclocking as you will know when you pushed your system too far - even if most things work fine and problems only happen with some demanding workloads. Without ECC you'd likely falsely assume that the software / OS / drivers are buggy since everything else works.

Benefits are statistically meager until a bit silently flips somewhere important and you only discover it days later when it has propagated into massive data corruption.

Been there. If your servers don’t have ECC memory, you’re eventually going to get bit.

I agree. The only processors that seem to support it are designed for servers (eg Xeon). Other processors say they're compatible but don't actually use the ECC part of the memory.

I built a ML workstation for my work team and was disappointed that my options were: expensive, low clock speed server CPU + ECC, or inexpensive, very fast, desktop CPU without ECC. Even if I were willing to pay more money, it was really hard to get the same performance if I needed ECC.

It's intentional. Intel has a tradition to cripple features on consumer and enthusiast-grade CPUs in order to sell more Xeon CPUs. The way Intel crippled virtualization/IOMMU in the past was also particularly annoying - on mainstream CPUs it was allowed, but on overclockable enthusiast CPUs, it was disabled.

On the other hand, AMD doesn't do that.

This bugs me so much. DW's i7-6700HQ laptop only supports 16gb, but my old 1007U supports 32gb. :/

That's just Intel though. All AMD Ryzen CPUs without integrated graphics do support ECC although YMMV depending on the motherboard you use. [1]

[1]: https://hardwarecanucks.com/cpu-motherboard/ecc-memory-amds-...

Supposedly AMD Ryzen and Threadripper support unbuffered ECC with the right mainboard (Asrock was mentioned several times).

I can confirm that Threadripper does check for errors with ECC ram. (MSI motherboard in my case but it should work with all of them)

When you collect lots of coredumps you'll get a long tail of bizarre, one-off crashes. Some of these are subtle memory corruption bugs that cause random crashes, but others are caused by bitflips-- either faulty DRAM or cosmic rays. Usually you throw your hands up and only look at crashes that happen multiple times because they're more likely to be reproducible.

The unique factor here is that rr provides _reproducible_ crash recordings, so when it fails to reproduce a crash, you've found some nondeterminism-- either a bug in rr where it didn't replay syscalls correctly or nail down thread behavior accurately, or a hardware issue like this.

In general, in the Julia community, we are all used to see Keno do impressive stuff routinely, but this one tops everything so far.

What is particularly impressive is that this lowest of the low-level debugging work is done completely in Julia - all the way.

My Dell M6800 RAM has 2 bad bits out of 32GiB (easily found with the wonderful 'stressapptest' tool). I'm not exchanging the DIMMs because of that (besides, it's of course in the DIMMs which are more difficult to access). When running Linux I mark a MiB around those memory locations to not be used (memmap option). I haven't bothered to do the same when running MS Windows (it seems a bit more complex). Every once in a while that memory is accessed in a way which causes some corruption leading to a fault. I feel a bit guilty about all those senseless bug reports automatically generated and sent ...

Is that tool the same as windows memtest? My brand new pc randomly blue screens twice a month with 'corrupted structure' or something like that, i suspect of ram but windows tool said no errors found after a 3 hour test.

Cant even put it on warranty because it cant be reproduced easily. Cant afford to just start replacing components. 5000€ lemon.

There are of course different kind of hardware errors and what appears to be bad memory might be due to some cells being defect or some less-than-perfect bus signals (which then should affect many addresses). In my (limited, perhaps dated) experience (I used it on maybe a few dozen computers), stressapptest (even using the default plain memory test), is much more efficient in finding hardware issues than memtest86+. If stressapptest doesn't find a problem in 30s, memtest86+ won't find any in 24h. I haven't bothered with memtest86+ in years anymore.

EDIT: only ever used the free memtest86+ tool, I have no experience with PassMark's MemTest86.

> some less-than-perfect bus signals

They are also responsible for some mysterious memory compatibility problems in PCs. All PC builders have this experience - on a certain motherboard model, some DIMMs work, some doesn't, but they are all off-the-shelf parts that follow JEDEC standards.

Often there are minor variations in electrical characteristics from DIMMs to DIMMs. If the signal integrity on the motherboard is marginal, there will be mysterious problems.

Try Linpack. It's probably the most stressful of anything I've ever seen, and also a very realistic workload (solving linear equations) --- machines used for scientific computing are doing that all the time. It will stress both the CPU and RAM.

(Many overclockers hate it and think it's "too extreme" because it causes their otherwise "stable" overclocks to instantly fail. I love it because it shows how an unstable system will sometimes effectively calculate 1+1=3. I don't consider a system stable unless it can pass a full day of Linpack without a single error.)

> Many overclockers hate it and think it's "too extreme" because it causes their otherwise "stable" overclocks to instantly fail.

I guess it's okay to game on a system that's right on the verge of failure; what's the worst that happens, your game crashes, maybe you corrupt a drive? Hopefully there's nothing important on it. But I like your regimen for more serious computing.

Memtest86+ (http://www.memtest.org) is what hardware enthusiasts use to troubleshoot memory issues despite the built-in Windows memory test being preinstalled and easier to access (no need to boot from a USB, etc) so I'm assuming it's much better than the Windows one.

FYI Last I tried memtest86+ (in June?), it had been outdated for quite some time and downright crashed on newer hardware. The non-free alternative called Memtest86 (by Passmark) works just as well if needed.

Nowadays I use memtest86 that supports UEFI. On Windows, "RAM Test" paidware works well.

I'm not exchanging the DIMMs because of that

Why? 1 bad bit, even an intermittent one, is enough for me to condemn RAM. Memory that doesn't remember what was last written is simply not fit for purpose. Even without faulty hardware, most software is already buggy enough as-is.

> Memory that doesn't remember what was last written is simply not fit for purpose.

IMHO, this means that every memory that's vulnerable to rowhammer and related techniques is defective, and/or has been specified to run at irresponsible refresh timings.

No access pattern should ever be able to change the value of bits not being accessed, that is the definition if faulty memory. And I'm astonished that there isn't a class-action or something.

"Normalization of deviance" comes up a lot lately. It's like the frog being boiled, we just came to accept that 100% of RAM is defective by design.

I wrote pretty much the same thing when rowhammer first came out:


Instead of mass recalls and class-action, the authors of memory testing utilities were persuaded to make rowhammer tests optional and off-by-default[1], and there is one with this massive bunch of BS that basically says "it works most of the time so you may choose to ignore it":


I recall seeing a discussion where someone basically said "100% of RAM would fail this test, so we shouldn't enable it by default" --- conveniently neglecting to mention that older DDR3 and before wouldn't.

Relatedly, I've noticed prices of used RAM, particularly DDR/DDR2, appears to have gone up recently. Other used computer parts are also selling at surprisingly high prices --- 10+-year-old motherboards and CPUs, pre-DDR3 era stuff. I wonder if that's due to decreasing production, increasing demand from retrocomputing enthusiasts, or increasing demand from those who know about this and don't want new RAM anymore.

> we just came to accept that 100% of RAM is defective by design.

Computing itself is broken. Every product from physical to application state is broken in some subtle way which is then countered by a higher layer having some extra code (or component) to adjust it.

But it happens, or it becomes "accepted", at a certain level of complexity. When you're down at the Ben Eater or Gigatron scale, you can prove with simple logic and timing diagrams that every possible machine state is correct.

Somewhere in the quest for higher speeds, RAM runs with timings and refresh intervals that allow rowhammer. And since nobody wants to take the speed hit of having memory that's actually correct, it just.... it's okay now?

Well, I disagree. I'd like to find out how to configure my memory controller to be rowhammer-proof, even if that means a refresh cycle after every single access cycle. And then we can build performance starting from that assumption that correctness is required.

As an old mentor once said, if you start out medium-speed but wrong, you'll get faster at doing it wrong. Start slow but right, and then you get faster at doing it right.

> And since nobody wants to take the speed hit of having memory that's actually correct, it just.... it's okay now?

Yeah, same thing with Spectre... part of the ToS actually states you can't benchmark Intel products with Spectre mitigations applied now (apparently that's a thing)

Because, as stated, the problem can easily be circumvented. Much easier in fact than getting to the DIMMs on the MB (in many other laptops which prioritize low weight and thinness, there wouldn't even be that option unless you're comfortable with a SMD rework station).

Agreed, if the memory would be deteriorating (who knows by which process), the device ought to be replaced. It hasn't however in the last three years, so that seems to have been a singular event (not sure if cosmic rays can permanently damage a RAM cell, but those things are tiny now).

It's much more cost effective to use software to work around hardware failures, then to rely on (never perfectly) dependable hardware -- compare Google FS vs IBM Mainframe.

It's fine if you can recognize all corruption but I doubt in this situation.

There are some things in software development that are obviously wrong to most people.

There are some things in software development that are obviously wrong to a few people.

And there are some things people have a hunch we are doing wrong but nobody can crystallize it.

Removing redundancy in code is great most of the time, but it's not a panacea. NASA had to contend with physical failures of memory, and catastrophic costs of failures in 'production'. They solved this problem by consensus pools of three, on physically separate hardware and in some cases using multiple manufacturers. Inability to reach consensus would invoke failsafes.

I have a vague suspicion about how we condense the very most critical bits of our software down to the fewest bits of data and instructions. This may ultimately be a policy we reject. One bad bit and you can end up taking the opposite action of the one you should have performed.

One thing I've often wondered about NASA's triple redundancy: what system calculates or determines the consensus? Is it also a programmable computer, just smaller?

Usually the driven element itself. From [0]:

"One reason why the redundancy management software was able to be kept to a minimum is that NASA decided to move voting to the actuators, rather than to do it before commands are sent on buses. Each actuator is quadruple redundant. If a single computer fails, it continues to send commands to an actuator until the crew takes it out of the redundant set. Since the Shuttle's other three computers are sending apparently correct commands to their actuators, the failed computer's commands are physically out-voted79. Theoretically, the only serious possibility is that three computers would fail simultaneously, thus negating the effects of the voting. If that occurs, and if the proper warnings are given, the crew can then engage the backup system simply by pressing a button located on each of the forward rotational hand controllers."

[0]: https://history.nasa.gov/computers/Ch4-4.html#:~:text=Its%20....

Just as a follow up spacex has written a bit about their systems, which follow the same "actuator is the judge" approach: https://space.stackexchange.com/a/9446

> One bad bit and you can end up taking the opposite action of the one you should have performed.

And bit flips do happen



Even more so in space where there's no shielding from cosmic rays whatsoever.

hugely impressed at the skills displayed.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact