There was a great TR from IBM describing memory system design for one of their PowerPC chips. Summary: Do your board layout, follow all these rules [a big list], then plan to spend six months in a lab twiddling transmission line parameters and re-doing layout until you're sure it works . . .
Analog is a black art.
Long story short was that spansion changed from gold to copper feeder wires inside their memory chip and was in every so many chips causing bit flip problems.
Our end product is in the automotive space averaging 300,000 units per year. It's a non safety related component.
Hiding complexity behind abstractions is what allows us to build complex things.
For a small example, Look at the simplest SR latch circuit... It's metastable because it feedsback into itself.
in Ubuntu 14.04, run this to bring all the dependencies for building: sudo apt-get build-dep memtest86+
Update: just finished running the test on my cheap Lenovo laptop. Not affected. phew! :)
Previous discussion https://news.ycombinator.com/item?id=8713411
It's interesting to see this issue addressed in 2015. In 1980 I worked at Alpha Microsystems and designed a memory chip test program which used translation tables based upon information we required chip manufacturers to give us in order for their chips to be used in the systems we sold.
That approach required us to only put one type of memory chip on a memory board. But back in the day microsytems were expensive and customers expected them to be well-tested.
Back in mid-1979, just before I went to college, I looked at the current alternatives and picked Alpha Micro as the company to go with for a system to computerize a bunch of doctor's offices. It worked very well, and then another one a few years later when they used one to help systematize a company providing satellite TV gear.
1. Note that chip-kill/Extended ECC/Advanced ECC/Chipspare which are all similar server vendor methods for 4-bit correction will prevent this problem. These methods are enabled on the better reliability server systems.
2. This failure mode has been known by the DRAM industry for a couple years now and the newest DRAM parts being produced have this problem solved. The exact solution varies by DRAM vendor. I wish I could go into specifics but I am unaware of any vendor that has stated publicly their fix.
How new exactly? The newest tested in the paper is from July 2014 and that still has the problem.
It doesn't fix systems deployed right now, and could be used for attacking hypervisors and other multitenant systems. Might make an interesting class of local privilege escalation attacks to try probabilistically of otherwise correctly-secured systems too.
I know there are other parameters of the memory controller that could be changed to cause corruption, e.g. reducing the refresh rate or tweaking the timings, but that is likely to yield random corruptions in normal use instead of this precise one.
To me, the real solution seems to be stop making DRAM with such high density processes until some design changes can be done that make them as reliable as before, because at some point it just stops behaving like real memory anymore and turns into a crude approximation of it; memory should be reliable and store the data it holds, without any corruption regardless of access pattern.
A good first check for security companies - examine all known attacks for fence instructions, which are rare. (Without a fence instruction, hammering on the same addresses will just cycle the caches, and not go out to DRAM.) Look at the code near them for a hammering loop.
This is a promising attack, because it might be able to break through a virtual machine boundary.
A test for this should be shipped with major Linux distros, and run during install. When someone like Amazon, Rackspace, or Google sends back a few thousand machines as rejects, this will get fixed.
(Fences guarantee only memory ordering, and are typically implemented by flushing to cache, not to RAM.)
Note that for the test to do row hammering effectively, it must pick two addresses that are in different rows but in the same bank. A good way of doing that is just to pick random pairs of addresses. If your machine has 16 banks of DRAM, for example (as various machines I've tested do), there should be a 1/16 chance that the two addresses are in the same bank. This is what the test above does. (Actually, it picks >2 addresses to hammer per iteration.)
Be careful about running the test, because on machines that are susceptible to rowhammer, it could cause bit flips that crash the machine (or worse, bit flips in data that gets written back to disc).
High speed DRAM reads influence nearby cells. Reproduced on 110 of 139 mem modules after 139k reads on intel and amd. 1 in 1.7k cells affected.
Second: While the problem looks real enough, the tests to demonstrate it are not realistic. Hammering the same rows with consecutive reads does not happen in the real world due to caches which the get around via flushes. I'd like to see more data on how bad the abuse needs to be to cause the problem. Will 2 reads in a row cause errors? 5? 10? 100? They never address how likely this is to be a real-world problem. I don't doubt that it is, but how often?
Secondly, the DRAM makers don't currently provide enough information to reliably know what neighbors to refresh. I suppose they could have used their guesses to test on the FPGA rig but given the rest of the paper I'm reasonably satisfied that they have correctly identified the problem and that their solution would work.
I can see "exploit resistant memory" being a selling point, maybe soon.
Want to wrap yourself around another challenging aspect of high-speed design? Power distribution system design (PDS). You can design perfect boards based on solid transmission line and RF theory and have them fail to work due to issues such as frequency-dependent impedances and resonance in the PDS.
You're essentially saying that producing DRAM that doesn't work like memory should, regardless of access pattern, is excusable because it's "commodity"?
If the attack can only be reproduced by custom hardware, why should anyone care?
Also, precise patterns of access to DRAM would require disabling the L1 and L2 caches. Doesn't that sort of thing require privileged instructions?
With caching in place, memory accesses are indirect. You have to be able to reproduce the attack using only patterns of cache line loads and spills.
They evict cache lines using the CLFLUSH x86 instruction, which I believe is unprivileged.
There's not that many DRAM manufacturers in the world. Micron, Samsung, Elpida, Hynix are all excellent bets.
Edit: Just checking, it was at 120, looks like it got artificially promoted to 25, then sank quickly.
How does that happen?
I think it's a good thing, even if it's somewhat arbitrary and has multiple false-positives and false-negatives. I've always been a fan of stochastic processes - they exhibit lots of good behavior, and comparatively few pathological flaws.
And with regards finding and promoting the sort of submissions they think deserve attention but have been missed, I think that's a good thing.
But I've seen other "communities" where the owners have taken a policy of "... just letting the chaos of the community sort itself out" and in my experience they have quickly descended into cesspits. HN is doing significantly better, and I believe that to be, at least in part, to the "aggressive moderation."
So with respect, overall, I disagree.
I guess my feeling is that with a site structured around karma and upvotes it feels messy and complicated to add the "*but sometimes completely changed by moderators."
I use sites like this to get something closer to the firehose, not something hyper curated. Give it to me warts and all, at least I'll know what I'm getting then.
I don't spend that much time here and I don't have much idea what the staff is doing or why. But I don't think your sentiment makes any sense. That's like saying "Let a city of 100k+ people just sort it out, without governance." It wouldn't be a city if that happened. You can argue about what kind of governance is best, but there has to be some kind of governance. Even "Burning Man" began instituting a kind of city planning once it got big enough.
Our plan is to turn as much moderation over to the community as possible; getting there is going to require additional mechanisms.
HN has one axiom: it's a site for stories that gratify intellectual curiosity. All the rest is details. The existing mechanisms—upvoting and /newest—are insufficient for the community to sort out which stories are best by that standard. The trouble with upvoting is that stories that gratify intellectual curiosity regularly attract fewer upvotes than stories that are interesting for other reasons. The trouble with /newest is that not enough users will sift through it to find the best submissions. That's easy to understand: it's tedious. Wading through hundreds of posts to find perhaps 3% that gratify intellectual curiosity, does not gratify intellectual curiosity! The very reason people come to HN is a reason not to want to do that, and so we have a tragedy of the commons.
So if upvoting and /newest are suboptimal for HN, what should we do? In the long run, the answer hopefully is to have a new system that lets the community take care of it. But in the short run, before we know what the new system should be, our answer is (a) to experiment, and (b) to do things manually before trying to systematize anything.
Over the last several months, we've tried various things. For example, we tried randomly placing a story from /newest on the front page. We didn't roll that out to everybody, but we did for enough users to make it clear that it wasn't going anywhere. The median randomly selected story is too poor in quality for this to work.
Of our experiments, the one that has produced by far the best results (to judge by receipt of upvotes, non-receipt of flags, and comments about the article being good) is reviewing the story stream for high-quality submissions and occasionally lobbing one or two of them onto the bottom of the front page. That's the experiment ColinWright was referring to. The idea is, from the pool of stories that would otherwise fall through the cracks, for humans to pick candidates for second exposure. These get a randomized shot at the front page long enough for the community to decide their fate. Most fall away quickly, but some get taken up, and so the HN front page sees more high-quality stories. It's important to realize that this is a supplement to the ordinary mechanism of users upvoting stories from /newest. That works the same as always.
This is not a permanent system—just an experiment to gain information—and one lesson we've drawn from it is that moderators should not be doing all the reviewing. We'd prefer it to be community-driven, and anyway there are too many stories and too few of us to look at them all. At the same time, the results have been so salutary that I feel obliged to continue doing it until we can replace it with something better.
What system might we build so the community can do this work and we don't have to? If upvoting and /newest can't do it, what could?
HN already has a mechanism for mitigating the problems with upvoting: flagging. Where upvoting works worse than one would expect, flagging works better. (We do have to compensate for bad flagging, but surprisingly little.) But flagging only helps weed out inappropriate stories; it does nothing to help the best stories surface. So one thought is that we need a mechanism similar to flagging, but positive rather than negative.
As for /newest, if the problem is that there's no incentive to do it, we need to make it rewarding. HN already has a reward mechanism: karma. Perhaps users who put in the work of sifting through new stories could be rewarded in karma .
Put these two thoughts together and the idea that emerges is of a story-reviewing mechanism, similar to flagging but focused on identifying good stories, where any user who puts in the effort and does a good job is rewarded in karma for service to the community.
The challenge is in defining a good job. It can't be something you could write a computer program to do—short of writing a program to identify all the best stories, of course, in which case you deserve all the karma. If a story eventually gets lots of upvotes (and few flags), that would be one way of scoring a good review. But there need to be other ways, because there are many more good stories than slots on the front page.
And with that you pretty much have a core dump of our current thinking on story curation—subject to change as new ideas emerge.
1. HN could use a new way of earning karma anyhow. A common criticism, which I think has merit, is that the current system is a rich-get-richer affair where most gains accrue to a clique of old-timers and there's little chance for anyone new to catch up.
I'm on this site a fair amount and don't feel super informed about the ongoing plans (was surprised to hear about the artificial movement of stories, for instance). Any details you could provide there would be really helpful, and I'm sure the community would appreciate knowing what's in store. Maybe even a dedicated page for news about the hn platform/algo and experiments.
> The existing mechanisms—upvoting and /newest—are insufficient for the community to sort out which stories are best by that standard. The trouble with upvoting is that stories that gratify intellectual curiosity regularly attract fewer upvotes than stories that are interesting for other reasons.
I think this fact is undebatable but the question is which scales better? Large communities with heavy moderation have frequently descended into corruption quickly. Again, I think communicating your plans/experiments would be really helpful (and also gratifying of intellectual curiosity!)
Also, you'd be surprised at how much the information in discussions like this makes its way into circulation. People do find this stuff, and I think it's more fun to run across it this way.
This is undoubtedly true, but perhaps mitigated by the fact that the marginal value of a single karma point is far greater for a user with a current score of 30 as compared to a user with a current score of 30,000.