Hacker News new | past | comments | ask | show | jobs | submit login

Assuming the basic assumptions hold. Google for example doesn’t really care much if individual searches are slightly less accurate so low data center temperatures should be less important for them than say a financial institution. https://www.cnet.com/news/google-computer-memory-flakier-tha...

Facebook’s hardware designs might generalize well, but perhaps not.




I...I don't think the article you're quoting supports your conclusions. It concludes that running (slightly) too hot doesn't meaningfully impact error rates, and that even if you run at the recommended temperature, memory errors will happen anyway.

So the conclusion is that, whether you're a financial institution or Google, you need ECC, at which point running too hot doesn't have any appreciable effect anyway.


“Previous research, such as some data from a 300-computer cluster, showed that memory modules had correctable error rates of 200 to 5,000 failures per billion hours of operation. Google, though, found the rate much higher: 25,000 to 75,000 failures per billion hours.”

First it’s old data, but google having observed higher error rates suggests different choices. Location, power supplies, motherboard design etc can all play a role. Further, they should be optimizing for slightly different things. Finally, they got data for temperatures from running their actual production system at these temperatures while assuming it would cause even more problems.


> but google having observed higher error rates suggests different choices.

Alternatively, it suggests better data from a larger, real-er world study.

>Finally, they got data for temperatures from running their actual production system at these temperatures while assuming it would cause even more problems.

I'm not sure what you're saying here. To be able to detect the effect of temp, they ran at both higher and lower temps and compared.


Better data, is unlikely to be the cases as these are simply reports of ECC faults. Every large data center can collect this information equally easily. The ranges are simply for location or time specific variations not some uncertainty metric. You can read other articles about their hardware choices and why that may be an issue.


> Every large data center can collect this information equally easily.

Right, but they aren't published in controlled studies, usually. The article mentions that prior to the Google study in question, the next best example used 300 machines.

So yes, compared to contemporary (and since then!) public studies of ECC faults, I'd say that the Google study is pretty darn authoritative. You're welcome to cite other recent examples to the contrary though (with DDR 1 and early DDR 2 RAM, of course. Modern sticks fault less).

> The ranges are simply for location or time specific variations not some uncertainty metric.

I'm not sure what you're talking about. Section 5.2 and 5.3 of the paper is about the effect of temperature and utilization. It shows that temperature has a negligible effect when controlling for utilization, while the reverse is untrue.


I disagree that 300 computers was simply not enough data to be useful benchmark. But, you can find studies using more RAM if you go looking.

Supercomputers of that era had far more than 300 nodes, and you can find studies of their memory error rates. The Roadrunner supercomputer for example had 19,440 compute nodes with 4GB of ram each.

PS: The architecture was odd with 6,480 Opteron processors and 12,960 Cell processors + 216 System x3755 I/O nodes, but it’s still using commodity RAM.


But, as is mentioned didn't have long term ram reliability numbers published.


Just one example among many for the Jaguar supercomputer, with 18,866 nodes using DDR2. https://arch.cs.utah.edu/arch-rd-club/dram-errors.pdf. 250,000 correctable memory errors per month so quite a bit of data to work with.

This is simply a low effort type of paper on an important issue. So, there are plenty of them out there.

PS: If you want to compare here is one for DDR3 https://www.cs.virginia.edu/~gurumurthi/papers/asplos15.pdf




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: