Hacker News new | past | comments | ask | show | jobs | submit login
Every 7.8μs your computer’s memory has a hiccup (cloudflare.com)
570 points by jgrahamc on Nov 23, 2018 | hide | past | favorite | 94 comments

> and why each ring has three wires woven through it (and I still don’t understand exactly how these work).

The rings are made from hard-magnetic ferrite. Basically permanent magnets. To write these, a positive or negative current pulse is applied to a write line. This will induce a voltage in the read line. The magnitude of the induced voltage depends on whether the write pulse coincides with the remanent field (the polarity of the ring). This allows you to store one bit. Note how reading and writing are the same thing, except that when reading you always use e.g. a positive current and then re-write the bit immediately after. So writing = single cycle operation. Reading = destructive and hence two cycles.

This is inefficient though, because for each bit you need two wires. Genius idea: Let's split the write line up into two write lines, each carrying half the write current. Lay out cores in a matrix and you can address individual cores, because only the core where both wires cross will reach the field strength necessary to be re/de-magnetized. The third wire is the read line, which is looped through all cores. This works because only a single bit is read at a time.

Since I've been maintaining core memory recently, let me add a few comments. Core memory is built from ferrite rings called cores, which can be magnetized in two different directions. By passing a current through a core one way, it can be magnetized the matching direction, and passing a current the other way magnetizes the core the other direction.

The important property for reading a core is that when a core flips from one state to the other, the changing magnetic field will induce a voltage on the sense line threaded through all the cores. But if the core stays the same, you don't get a voltage. Thus, by flipping a core to 0, you can detect if the core was in the 1 state previously. This allows you to read a core, but the read is destructive since the core is now 0.

Another key property of the cores is hysteresis. A small current through a core has no effect on it, but above a threshold the current can flip the core. This is very important because it means that you can put cores in a grid, and energize one X line and one Y line. Only where the two lines pass through the same core will the core receive enough current to potentially flip. This coincident-core technique is what made core memory practical. This grid of cores is called a core plane. For a 16-bit computer, you stack up 16 core planes and read or write them in parallel.

Finally, to write a core, you first access it to ensure it is flipped to 0. Next you access it with currents in the opposite direction, flipping it back to 1. But what about the bits in the word that you don't want flipped to 1? The trick is the inhibit line, which passes through all the cores in the a, in the opposite direction. If you put a current through the inhibit line in a plane at the same time as the write, the currents partially cancel out, and the core doesn't flip, so it stays as 0.

Thus, each core typically has 4 wires: X and Y wires in a grid to select the core, a sense line that passes through all the cores in a plane to read the value, and an inhibit line that passes through all the cores in a plane to inhibit the writing of a 1. Some systems combine the sense and inhibit lines into one line, but that increases noise. (You want the sense line to switch directions so noise from the X and Y lines cancels out (that's why it's usually diagonal). But the inhibit line is forced to run a particular direction so the current cancels out.)

If this isn't more than you ever wanted to know about core, I've written about the core memory in the IBM 1401 in detail here: http://www.righto.com/2015/08/examining-core-memory-module-i...

Great explanation and article. Discussion (2015) here: https://news.ycombinator.com/item?id=10143700

> Basically permanent magnets.

Basically iron, which can be magnetized or demagnetized just like you can use a magnet to magnetize a steel needle.

It's worth point out that there are still plenty of static RAM in modern chips and those obviously don't require refreshing. The CPU cache would be an example, but there are also plenty of internal SRAMs in ASICs and FPGAs everywhere.

The refresh needed by DRAM is also why you can't merely suspend-to-RAM by merely keeping it powered up, the bits will flip. You need a special self-refresh mode that keeps updating every cell.

While implementing suspend-to-RAM on an embedded system I wanted to test that my self-refresh was working correctly, so I decided to test the null hypothesis: I'd suspend my ROM code without enabling auto-refresh, wait a second or so, wake up and smell the ashes. Except... it seemed to work anyway. The data was still here.

It turns out that while the refresh period of DRAM is usually a handful of ms at most, you will still have a relatively low bitflit rate even after several seconds. In hindsight it makes sense: in production you want an error rate of effectively 0% (especially if you don't have ECC) across potentially billions of cells. Most cells can probably hold their value for much, much longer than the refresh period but you need to catch the full bell curve and it needs to keep working across the entire temperature and voltage range for all chips.

I have a few war stories about this from back in the day when I worked in hardware design: we made a chip that was supposed to have embedded SRAM but the cell design vs process was a bit off and it turned out to leak charge slowly, making it a DRAM. No problem : we modified the software to hit every row every so often which in effect implemented refresh. Also the converse problem: if you have a bug in your hardware that results in refresh being disabled it can be hard to detect because as noted above cells typically have a retention time on the order of 10's of seconds. Add in the effect of programs running, which causes accesses that as a side effect refresh the row targeted, and it can take a long time before you realize there is no refresh. The symptoms can be subtle and confusing, because for the most part memory _is_ being refreshed, just not in the way you expect.

I think you have just finally solved a problem with a piece of hardware that I was asked to work on decades ago. Never could figure out exactly what caused the problem but reading this I suspect you are correct in that that particular system had a broken refresh design. It worked most of the time and then would crash in the weirdest ways, and it did contain 64K of dynamic RAM, I never thought to check the refresh circuitry. I was tasked with debugging the software end of things, and you know the proverb about programmers faced with a hardware problem. Extra silly of me because I really should have started with checking the hardware first, the assumption that the problem was software related clearly may not have been a valid one.

The refresh circuitry was provided by a video chip and this is most likely where things went wrong, either in the configuration or in the actual wiring.

Thank you!

This is relevant to the hack you can do when trying to retrieve secrets from ram from a live computer: pull the dimms out, blow some cold air on them and then stick them in another computer that's ready to dump the contents. I've heard its a good attack against full-drive encryption because the keys are sitting in plaintext in RAM. I've also heard it works for up to minutes after the computer has been turned off.

What you're referring to is called a "cold boot attack".

Cold boot attack usually refers to reading the DIMMs from the machine they are installed in, by rebooting it. The GP described a slightly different thing: physically removing the RAM modules and dumping their contents with a different host.

It would be cool if computers were designed to shutdown and clear RAM if they detected rapid cooling.

Ideally this monitoring process would continue during hibernation.

I guess there wouldn’t be anything possible if power was completely removed, but at least then an attacker is dealing with only being able to begin cooling after shutdown and cells start being lost.

The system should also balk if restarted within a short period of time, but a rapid temperature reduction. This would help prevent non-removal attacks (Eg: rebooting with a minimal OS to pull whatever one could -> great for MACs).

Generally you pull the ram before you cool it, so this wouldn’t do much.

Paranoid folk crazy glue their RAM, and embed in alumina-filled epoxy (which conducts heat well).

For all the hate the modern macbook design gets, this is one positive of having soldered components.

i was about to poo poo this idea but yeah, for non-removal this would be good

There is progress towards encrypting memory, although there remains limitations, complexity and performance penalty challenges: https://lwn.net/Articles/752683/

Gaming consoles have encrypted RAM for more than a decade.

You say "gaming consoles", but really I think I've only seen it in the Xbox 360. Was it deployed anywhere else that I missed?

PS4 and XB One both have "secret" memory areas which I think are encrypted.

We use these features to store small amounts of very sensitive data, key material for example.

Didn’t the Wii do something similar? Also it’s not just “gaming consoles”: https://www.cl.cam.ac.uk/~rja14/tamper.html

Wii does not do RAM encryption (which is somewhat understandable given the fact that the RAM is mostly embedded 1T-SRAM that is not directly accessible anyway).

As far as I know the only two systems that use cryptographically meaningful memory encryption are Xbox 360 and zArchitecture mainframes.

Various secure crypto processors usually have some support for external "encrypted" SRAM, but as this typically involves microcontroller platforms and byte/word-wide accesses to said memory the implementation in the best case boils down to CTR mode (usually of some secret semi-proprietary algorithm of questionable cryptographic qualities) without IV or any authentication.

Oh wow. Thank you. I've been wondering all week why the CNC machines at my work store their data and info on battery powered SRAM after getting a warning that i'll lose everything if I don't change said battery in a couple weeks.

I always assumed everything was written to disk and loaded into memory on startup but apparently not. Everything critical to the machine running is stored on this SRAM.

It actually makes a lot of sense now why the manufacturers did it this way. I was really confused why in this day and age I have to change an SRAM battery stashed somewhere deep in the computer inside the control panel.

That's also how saves worked on some old consoles like the Game Boy. No flash, you have a SRAM with a battery, when the battery runs out you lose your save.

That can't be right, I thought. Because it didn't match up with my childhood memories.

But, it seems the cartridge itself (and not the gameboy which I incorrectly assumed from your post) has a battery which according to internet[1] lasts about 15 years.


Right. The SRAM was also on the cartridge. Nothing was saved to the GameBoy itself.

Also, in terms of old video games, this is a visualization of cpu instructions per scan line. https://www.youtube.com/watch?v=Q8ph2OVqZeM

Around 8m24s you can see how the cpu pauses while ram is refreshed on the SNES. You can see that it’s basically once a scanlije. I found it helpful to visualize how often RAM has to be refreshed (ballparked).

Ya I knew that. I learned that the hard way after my snes games started erasing the save data when I powered off the system. My nes was fairly dead by that point so i missed those. Then I learned my n64 games would be next.... :(

I just couldn't figure why fairly modern CNC machines used the same technology as old video game cartridges.

When did they start erasing your save data? I have a few SNES titles that still save correctly.

It depends on if they used SRAM at all, or how the SRAM was used. For example, Pokemon was notorious for killing its SRAM battery early because of running things like a real-time clock. SRAM itself doesn't take a whole lot of juice.

Also don't worry too much about your n64 games, most of them used EEPROM instead of SRAM. Some NES and SNES cartridges used EEPROM too, but it was much more expensive at the time.

If you have any soldering skills, or even if you don't, picking a up a few batteries and a soldering iron is dirt cheap and you could fix your old games pretty easily.

I’ve heard that Nintendo’s standard was that the battery should last ten years, but I don’t I don’t how true that is. They definitely last much longer. I would imagine that the chemistry in the battery would have a problem (e.g. corrosion) before it runs out of energy.

My free with Nintendo power dragon warrior cartridge[1] still saves. In fact of about 20 NES carts with save games, only 1 has needed a battery so far. I expect most to die soon though.

1990; it's the one game I know the year I got it for sure, since it was a promontional giveaway.

There are methods to hot-swap the batteries without losing your saves.

The annoying thing is that you need a soldering iron on the games. The CNC machine probably doesn’t.

I always thought it would be a fun project to replace the SRAM with MRAM, but MRAM is all 3.3V or less and GB and NES SRAMs were typically 5V.

Three diodes in series should take care of that, and if you find you need to step up the output voltage it should be possible to either put some fast buffers in between or, alternatively, to replace whatever buffers there are already on the board with ones that are more tolerant to voltage differences on their inputs. It remains to be seen how mcuh the MRAM will like the 5V inputs, that is something the datasheet should be able to clear up.

whatever it is would have to be bidirectional, since the same pins are used for input and output. They probably make a level converter designed specifically for sram, but I haven't had time to look for one.

A 245 or something like that, of course the speed requirements will be critical. It will be a tricky modification to do properly if you want reliability. In the end it might not be worth it.

For gameboy, the pokemon cartridge I checked uses a Hyinx GM76C256CLFW, which is 85ns for address acquisition and 45ns for output-enable to valid data. Tightest timing is it requires only 5ns for ~OE being asserted to the outputs being high-Z.

It doesn't multiplex address and data pins, so there are only 8 pins that need bidirectional conversion.

The TI SN74LVC8T245 has a max time for OH on the 5V pins of 10ns, which is probably good enough.

All the NES games I just tried to check require a security bit I don't have to open, so I couldn't check any of them.


I ordered the right driver bit and amazon suggested CR2032 batteries, which I found amusing.

Cool, curious how that will work out.

> I always assumed everything was written to disk and loaded into memory on startup but apparently not. Everything critical to the machine running is stored on this SRAM.

I don't know your machine but the old process controllers I worked on a long time ago would store machine state in battery backed RAM. This was so that the machine could safely recover from a fault or power outage.

This is because a hard drive unit would cost too much for the CNC machine.

Really? A SD card is a few dollars.

Reverse engineering the 30 year old CNC and making a proprietary-SRAM-card emulator probably costs several hundred thousand (consulting overhead, a few engineer months of effort, downtime risk). It would likely be cheaper to simply replace the machine.

A "hard drive unit" probably refers to a contemporary piece of kit that can be purchased new-old-stock for several thousand.

Or someone is selling a $30 kit on eBay that nobody has thought to look for.

Hard to tell, but one thing is certain: the cost of an SD card wouldn't even cause a rounding error in this equation.

I can do it for $1K by overcharging you by x3. For starters SRAM is not proprietary :), and the reason we used to use SRAM for settings storage was always the price of durable component.

Sure, just pop it open, read the SRAM engraving, RTFM, maybe probe around a bit, whack together an arduino sketch, and wire up some glue circuitry. Now why didn't I think of that?

That's interesting, makes me want to see the graph of power vs cost for extra error correcting data but with longer refresh cycles on the memory (for that suspend-to-RAM application).

It is worth to point out that the CPUs have much more pronounced hiccups like SMIs (system management interrupts) on Intel ones. These can be turned off at the cost of making the server able to actually overheat (SMIs are for example used to check if the CPU is on fire).

Good servers do not need periodic SMIs, stuff like temperature sensors are handled by the BMC. Of course it depends on the benches and it can be a pain to find them, but people using the Linux realtime patch for example really need servers that do but abuse SMIs.

Periodic SMIs in theory should only be used for stupid stuff like emulating a PS/2 keyboard and mouse from a USB one, which are safe to disable. SMIs are used to handle machine checks, access to persistent storage (including UEFI variables), and various RAS (reliability/availability/serviceability) features.

Well, I don't know for a fact everything those interrupts do. What I know is that I had to turn those off (along with bunch of other things) to meet strict realtime guarantees for my proof of concept algorihmic trading framework I did for a brokerage house. This was few years back on Haswell on best hardware you could buy including top bin Xeons dedicated for algotrading that were clocking 5GHz by default (frequency locked, sleep states turned off, etc).

On the other hand I never noticed anything funky with regards to memory latency that would cause me to investigate. This is probably because I would already treat memory like a remote database and try to do as much as possible within L2 and L3.

The budget for entire transaction (from the moment bytes reached NIC to the moment bytes left NIC as measured by external switch) was 5us so 0.1us was below noise threshold.

Yes, you have to turn the periodic SMI off but it's not a problem with respect to overheating or detecting memory errors.

> FFT requires input data to be sampled with a constant sampling interval.

You can use something like Lomb-Scargle to get a periodogram without needing it to be evenly spaced. This also has the benefit of potentially making the Nyquist limit much softer (depending on how randomly-variating the sample interval is)[1].

[1]: https://arxiv.org/abs/1508.02717

>clock_gettime(CLOCK_MONOTONIC, &ts);

IIRC this call has an overhead of a few ns. Isn't that close to or even higher than the time it takes to perform the action being measured on every loop? If so the author is just measuring clock_gettime. It can be confirmed on the author's system by simply calling clock_gettime without calling the measured action (movntdqa / cache flush) and comparing the results. An alternative approach is to call clock_gettime before and after the loop, not on every iteration, and then take an average.

So long as the overhead is predictable (which is an interesting assumption) then it'll just contribute noise and not affect the frequency analysis.

Taking an average is the wrong technique here. The author is trying to measure variance between equivalent code due to hardware events. I mean, on average, DRAM access is very fast! But sometimes it's not, and that's what the article is about.

And in any case on a modern x86 system that call (which is in the VDSO, it's not a syscall) just reads the TSC and scales the result. That's going to be reliable and predictable, and much faster than the DRAM refresh excursions being measured.

clock_gettime overhead is on the order of ~20 ns (or about 60 cycles) on most systems implementing VDSO gettime, and is quite stable. The author is trying to measure something on the order of 75 ns.

I've actually been measuring clock_gettime(CLOCK_REALTIME) vDSO call lately, and when called and already hot (ie L1I cache) it is still 350 ticks as measured by rdtscp. I even had an open stack overflow question on this.


How are you getting 60 cycles?

I have measured it several times in various places with fairly consistent results. Of course, if you are on a platform which doesn't offer VDSO for your clock, or which disables or virtualizes `rdtsc` then the results could be much longer.

One of the places I measure it is in uarch-bench [1], where running `uarch-bench --clock-overhead` produces this output:

    ----- Clock Stats --------
                                                      Resolution (ns)               Runtime (ns)
                           Name                        min/  med/  avg/  max         min/  med/  avg/  max
                     StdClockAdapt<system_clock>      25.0/ 27.0/ 27.0/ 29.0        27.1/ 27.4/ 27.6/ 30.6
                     StdClockAdapt<steady_clock>      25.0/ 26.0/ 26.9/ 94.0        27.0/ 27.0/ 27.1/ 32.6
            StdClockAdapt<high_resolution_clock>      26.0/ 27.0/ 27.0/ 28.0        27.1/ 27.5/ 27.7/ 30.0
                  GettimeAdapter<CLOCK_REALTIME>      25.0/ 26.0/ 25.7/ 27.0        25.1/ 25.5/ 25.6/ 48.3
           GettimeAdapter<CLOCK_REALTIME_COARSE>       0.0/  0.0/  0.0/  0.0         7.2/  7.3/  7.3/  7.3
                 GettimeAdapter<CLOCK_MONOTONIC>      24.0/ 25.0/ 25.5/ 27.0        24.7/ 24.7/ 24.9/ 27.2
          GettimeAdapter<CLOCK_MONOTONIC_COARSE>       0.0/  0.0/  0.0/  0.0         7.0/  7.2/  7.2/  7.3
             GettimeAdapter<CLOCK_MONOTONIC_RAW>     355.0/358.0/357.8/361.0       357.4/358.2/358.1/360.5
        GettimeAdapter<CLOCK_PROCESS_CPUTIME_ID>     432.0/437.0/436.4/440.0       434.7/436.0/436.2/440.9
         GettimeAdapter<CLOCK_THREAD_CPUTIME_ID>     422.0/426.0/426.1/431.0       424.6/427.1/427.2/430.4
                  GettimeAdapter<CLOCK_BOOTTIME>     363.0/365.0/365.3/368.0       364.2/364.5/364.7/367.7
                                       DumbClock       0.0/  0.0/  0.0/  0.0         0.0/  0.0/  0.0/  0.0

The Runtime column shows the cost. Ignoring DumbClock (which is a dummy inline implementation returning constant zero), note that the clocks basically group themselves into 3 groups: around 7 ns, 25-27 ns and 300-400 ns.

The 7 ns group are those that are implemented just by reading a shared memory location, and don't need any rdtsc call at all. The downside, of course, is that this location is only updated periodically (usually during the scheduler tick), so the resolution is limited.

The 25ish ns group are those that are implemented in the VDSO - they need to do an rdtsc call, which is maybe half the time, and then do some math to turn this into a usable time. Note that CLOCK_REALTIME falls into this group on my system.

The 300+ ns group are those that need a system call. This used to be ~100 ns until Spectre and Meltdown mitigations happened. Some of these cannot easily be implemented in VDSO (e.g., those that return process-specific data), and some could be, but simply haven't.

For what it's worth, I wasn't able to reproduce your results from the SO question. Using your own test program (only modified to print the time per call), running it with no sleep and 10000 loops gives:

    $ ./clockt 0 10 10000
    init run 15256
    trial 0 took 659834 (65 cycles per call)
    trial 1 took 659674 (65 cycles per call)
    trial 2 took 659578 (65 cycles per call)
    trial 3 took 659550 (65 cycles per call)
    trial 4 took 659548 (65 cycles per call)
    trial 5 took 659556 (65 cycles per call)
    trial 6 took 659552 (65 cycles per call)
    trial 7 took 659556 (65 cycles per call)
    trial 8 took 659546 (65 cycles per call)
    trial 9 took 659544 (65 cycles per call)
On my 2.6 GHz system, 65 cycles corresponds to 25 ns, so those results are exactly consistent with the uarch-bench results shown above. So either your system is weird, or you weren't running enough loops, or ... I'm not sure.

[1] https://github.com/travisdowns/uarch-bench

> Each bit stored in dynamic memory must be refreshed, typically every 64ms (called Static Refresh). This is a rather costly operation. To avoid one major stall every 64ms, this process is divided into 8192 smaller refresh operations.

It implies that the length of refresh operation is roughly linear in the number of bits being refreshed. Why is it impossible to parallelize this?

> Typically I get ~140ns per loop, periodically the loop duration jumps to ~360ns. Sometimes I get odd readings longer than 3200ns.

What's the cause of the 3200ns+ readings?

DRAM beats SRAM at density, power, and cost by pushing the refresh circuitry out of the NM memory cells and into the periphery of the NM cell matrix. There are only M amplifiers per NM matrix, so you can only refresh M cells at a time and must perform N refreshes per refresh interval to catch every row. Could you put in NM amplifiers to refresh all the cells at once? Sure, but then we would call it SRAM :-)

This is a very good comment, but I think your joke needs to be explained for most people out there. "Sense Amplifiers" are the part of "DRAM" which can perpetually hold a charge.

The rest of DRAM are tiny capacitors (kinda like batteries) that can only hold a charge for 64-milliseconds. Furthermore, a SINGLE read will destroy the data. So the DRAM design is to transfer the information to "sense amplifiers" each read, and then to transfer the information back at the end when the "row of data is closed".

Once you understand that DRAM capacitors are so incredibly tiny, RAS, CAS, PRECHARGE, and REFRESH suddenly make a LOT of sense.

* "Row" is all of your sense amplifiers.

* RAS: Transfer "one row" from DRAM into the sense amplifiers.

* CAS: Read from the sense amplifiers

* PRECHARGE: Write the sense-amplifiers back into DRAM. Sense-amplifiers are now empty, and ready to hold a new row.

* Refresh: Sense-amplifiers read, and then write, a row to "refresh" the data, as per the 64-milisecond data-loss issue. According to Micron, all Sense Amplifiers must be in the ACTIVE state (ie: after a Pre-charge. They are empty and ready for reading / writing of new data).

> Could you put in NM amplifiers to refresh all the cells at once? Sure, but then we would call it SRAM :-)

Indeed. Sense Amplifiers are the "part" of DRAM which act like SRAM. Sense Amplifiers do NOT lose data when they are read from. They do NOT need to be refreshed. Etc. etc. Sense Amplifiers are effectively, the "tiny" SRAM inside of DRAM arrays that makes everything work.

The very point of "DRAM" is to make most of your RAM be these cheap capacitors. So the only solution is to read and write data to the sense amplifiers, as per the protocol.

I'm not an expert in the field but I was under the impression that SRAM worked completely differently, using a bi-stable transistor circuit and no capacitor, something like that: https://upload.wikimedia.org/wikipedia/commons/a/a5/Transist...

Such a circuit is stable and doesn't need any refresh or amplification.

Was I mistaken?

No, I was using "amplifier" in a slightly more general sense to mean "a circuit that uses power to turn a weakly driven signal into a strongly driven signal." You are absolutely correct that a SRAM cell would drive a near zero signal closer to zero and that this behavior differs from a linear amplifier which would drive a near zero signal further away from zero.

Here was my conundrum: a more general term like "active circuit" risked leaving people behind while a more specific term like "buffer" or "driver" didn't highlight the analogy between SRAM and DRAM. I chose "amplifier" as a compromise, hoping that people who were familiar enough to worry about bistability would be comfortable with the generalized definition while people who barely hanging on would miss that detail entirely but still get my point. Sounds like I caught you in the middle. Sorry for the confusion.

Technically yes, since the gate (control input) of a field effect transistor is functionally a capacitor, with the source-drain connection acting as a very crude sense amplifier. (You can actually observe this with some discrete MOSFETs, by attaching them to a breadboard in series with a LED and tapping the gate line against VCC or ground to turn them on or off.)

> Why is it impossible to parallelize this?

One of the mentioned articles touches on this: http://utaharch.blogspot.com/2013/11/a-dram-refresh-tutorial...

> Upon receiving a refresh command, the DRAM chips enter a refresh mode that has been carefully designed to perform the maximum amount of cell refresh in as little time as possible. During this time, the current carrying capabilities of the power delivery network and the charge pumps are stretched to the limit.

I guess: it's actually hard to deliver power to refresh all the bits at once? Also note that ancient cpus like Z80 had the memory refresh machinery built into CPU as opposed to memory https://en.wikipedia.org/wiki/Memory_refresh

> What's the cause of the 3200ns+ readings?

No idea. Random noise? Timing interrupt? Some peripheral doing DMA transfer? Kernel context switch? System Management Mode?


Feel encouraged to run the code and try to debug it!

> I guess: it's actually hard to deliver power to refresh all the bits at once?

You can't refresh all the bits at once.

You only have something like 256kB worth of sense amplifiers across 2GB of RAM (Guesstimates from my memory: but the point is that you have much much FEWER sense-amplifiers than actual RAM). You need a sense amplifier to read RAM safely.

Each time you read from DRAM, it destroys the data. Only a sense amplifier can read data safely, store it for a bit, and then write it back. Since you only have 256kB of sense amplifiers, you have to refresh the data in chunks of 256kB at a time.

Not all sense amplifiers are "gang'ed up" together: they're actually broken up into 16-banks of sense amplifiers. But for whatever reason, modern DDR4 spec seems to ask all sense amplifiers to refresh together in a single command. In any case, you can at best, get 16x the parallelism (theoretically: since the spec doesn't allow for this) by imagining a "bank-specific refresh command".

That's a lot of complexity though, and I'm not even sure if you really gain anything from it. Its probably best to just refresh all 16-banks at the same time.

> modern DDR4 spec seems to ask all sense amplifiers to refresh together in a single command

Per bank auto-refresh could be exploited with elaborate memory controller algorithm trying to always prioritize refreshing unused/least used banks, except it was broken by design and you couldnt control which bank to refresh. Nobody even bothered implementing per row refresh counters to skip freshly read rows. Rowhammer is a real shitshow exposing sloppy memory engineering in the industry.

I don’t know modern RAM chip organization, but in the old days the bit cells were layed out in a square and an entire row would refresh in parallel. A 1Mb chip would then refresh 1Kb at a time taking 1K refresh cycles for the entire chip.

This is correct, and, the size of a dram "page" (which is not the same thing as a tlb page) has scaled up as memory chips have gotten larger.

This is the correct answer.

> Why is it impossible to parallelize this?

Within a single chip, it is impossible by design... not by any physical nature.

The DDR4 spec has 16-banks (organized into 4-bank groups), which could theoretically refresh individually. But that's not how the spec was written: a Refresh command will cause all 16-banks to start refreshing at the same time.

However, it is possible to "parallelize" this rather easily: that's why you have TWO sticks of RAM per channel. While one stick of RAM is going through a Refresh cycle, the other stick of RAM is still fully available for use.

My assumption is that it is a better protocol to issue all banks to refresh at the same time. Otherwise, you'd need to send 16x the Refresh commands (one for each of the 16-banks). At that point, most of your messages would be "Refresh" instead of RAS / CAS (open row, or open-column) commands, needed to read/write data.

If you really want parallelism, get multi-rank RAM or stick more RAM per channel. But even then, if the two sticks of RAM refreshed at the same time, you'd have fewer "Refresh" commands in general. So it still might make more sense for memory controllers to keep sticks of RAM all in sync with regards to Refreshes.

> It implies that the length of refresh operation is roughly linear in the number of bits being refreshed.

I think you are misreading it. The point to avoid locking all the memory each time, instead you only lock (via refresh) a small potion of memory each time.

It absolutely is linear with the number of rows being refreshed. When you send your dram chip a refresh command it's effectively being read a bunch of times. You cannot use that dram chip during that period.

You seem to be taking about banking, I think.

Maybe one of the reasons it's hard to parallelize is that the refresh operation requires electricity, and there's obviously a limit on how much of that it can use.

It's because of the grid layout where the sense amps are shared between rows. Thus you can fundamentally only refresh one row at a time. Wikipedia has a good page on DRAM refresh.

Well, you can't really read from multiple addresses simultaneously.

Well, you can parallelize it by adding more sticks of RAM.

Maybe the kernel timer tick?

Hopefully someone from cloudflare (</keyword-trigger> :) ) notices this - the `cloudflare-blog` repo has zero issues and I didn't want to ruin that, and I don't like Disqus, so...

I have a two-core system (ThinkPad T400), which means measure-dram is completely falling apart on the `CPU_SET(2, &set);` and ends with a "sched_setaffinity([0]): Invalid argument" error.

s/2/0/ fixes it completely.

Also - `decode-dimms` doesn't work on my system ("Number of SDRAM DIMMs detected and decoded: 0"). `dmidecode -t memory` is fairly informative, however, as is `lshw -C memory`. I'm not sure if both are fishing data out of DMI. I definitely wouldn't mind finding out if my (2x2GB DDR3) responds to I2C from some other program/technique.

These two things aside, this was a very fun article to read. Thanks!

Since you know (or suspect) the frequency, aren't there simpler ways to find the curve than full FFT?

The Goertzel algorithm is O(N) for a single frequency component, which beats the FFT’s O(N log N), where N is the length of the data. The FFT gives you the whole spectrum though, whereas Goertzel “costs” O(N) each time. This might matter if the peak is not exactly where you imagine (e.g. 7.8 instead of 8). Furthermore, there’s a been a ton of work put into optimizing FFTs for a wide range of platforms, special cases, etc, so the FFT might actually run faster.

Further reading: http://www.fftw.org/pruned.html

This is a wonderful article!

It gives a very clear and concise example of how we can measure things that are happening at the software level via some simple code, so it also serves as a very comprehensible introduction to how exploits like Rowhammer/Meltdown/etc. function.

very cool to see such a powerful insight from userspace, although there's quite a few steps before doing the main loop. I had to google everything: ASLR, MTRR, frequency scaling. Thank you for commenting it so well!

Watch out, there is a bit of cargo cult there. Particularly around the operations preparing the data for the FFT.

The C part is rather straightforward. The pinning is helpful, but not really needed. The MTRR was a dead end (but I left the code since it is fascinating). ASLR is kinda needed since it destroys determinism between runs.

With all these tricks I still struggle to get the code running on Kaby Lake. It seems more stable on older CPU generations.

On the early PCs you could easily control the refresh rate (it was done by the timer tick), and some applications, notably games, would reprogram it to refresh less and still get away with it while enjoying a small but possibly critical boost in performance:


Obligatory: What Every Programmer Should Know About Memory

From 2007, but still almost entirely relevant.


DDR4 btw has a few major differences:

1. Banks are now split into Bank Groups. Typically 4-bank groups, with 4-banks per group. That's 16-banks total.

2. While waiting for one bank to finish a command (ex: a RAS command), DDR4 RAM allows you to issue commands quickly to other groups, usually in 2-cycles or less. In effect, Bank Group 0 can work in parallel with Group1, Group2, and Group 3.

3. The 4-banks within Bank Group 0 can work in parallel, but at a much slower rate.

4. This allows DDR4 RAM to issue 16-commands in parallel, one to each bank group, as long as certain rules are followed. DDR3 RAM only had 8-banks. This allows DDR4 to be clocked much higher.

5. DDR4 uses less energy than DDR3.


That's about it actually. So the vast, vast majority of the information in that document remains relevant today.

Does it help if you have an APU? A processor with integrated graphics. Since scanout needs to happen at a regular rate every 16ms, does that allow reads along with refresh? In effect hiding or negating some of this?

Modern graphics systems are too flexible for this - you can reprogram the graphics memory layout in such a way that would break the refreshing. But this technique was used on early PCs (with CGA graphics) and many of the 80s microcomputers.

Curious to know where the 1billion ns/s came from. No idea how I would have determined to use that in the fft code.

It's just a conversion factor to make the units come out nicely. He could have easily done T = 1/f to give the refresh interval in seconds, but with such a short duration, it helps to convert it to nanoseconds (1e9 ns = 1 s).

Oh, I see now. Thanks a lot. Makes sense. 1bil ns/1s

Maybe this is why CloudFlare thinks my San Jose, CA IPs are in San Jose, JP and blocks my requests?

It's a blog post about something more or less unrelated to CloudFlare that just happens to be on their blog. This comment is lacking in any real substance other than snark.

Have you tried reaching out to them to try and resolve the issue?

I had to reach out to the vendor who is hosting their services behind CloudFlare. They confirmed my IPs were blocked due to a GeoRule, the error looks like this on my end:

  The owner of this website (*service.com*) has banned the country or region your IP address is in (JP) from accessing this website.
This vendor had to white-list me in CloudFlare. I've also reached out to my hosting provider about this oddity. Prior to this hiccup yesterday, I've been using these IPs with the service provider for many months w/o issue.

e: formatting

Most companies I know of use the maxmind db for geo ip lookups. Would be interesting to check if your ip(s) are being mapped to Japan in the recent version of maxmind’s db.

I scare my computer by butting my mouse over the windows update button

Applications are open for YC Winter 2022

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact