
Every 7.8μs your computer’s memory has a hiccup - jgrahamc
https://blog.cloudflare.com/every-7-8us-your-computers-memory-has-a-hiccup/
======
blattimwind
> and why each ring has three wires woven through it (and I still don’t
> understand exactly how these work).

The rings are made from hard-magnetic ferrite. Basically permanent magnets. To
write these, a positive or negative current pulse is applied to a write line.
This will induce a voltage in the read line. The magnitude of the induced
voltage depends on whether the write pulse coincides with the remanent field
(the polarity of the ring). This allows you to store one bit. Note how reading
and writing are the same thing, except that when reading you always use e.g. a
positive current and then re-write the bit immediately after. So writing =
single cycle operation. Reading = destructive and hence two cycles.

This is inefficient though, because for each bit you need two wires. Genius
idea: Let's split the write line up into two write lines, each carrying half
the write current. Lay out cores in a matrix and you can address individual
cores, because _only the core where both wires cross_ will reach the field
strength necessary to be re/de-magnetized. The third wire is the read line,
which is looped through all cores. This works because only a single bit is
read at a time.

~~~
kens
Since I've been maintaining core memory recently, let me add a few comments.
Core memory is built from ferrite rings called cores, which can be magnetized
in two different directions. By passing a current through a core one way, it
can be magnetized the matching direction, and passing a current the other way
magnetizes the core the other direction.

The important property for reading a core is that when a core flips from one
state to the other, the changing magnetic field will induce a voltage on the
sense line threaded through all the cores. But if the core stays the same, you
don't get a voltage. Thus, by flipping a core to 0, you can detect if the core
was in the 1 state previously. This allows you to read a core, but the read is
destructive since the core is now 0.

Another key property of the cores is _hysteresis_. A small current through a
core has no effect on it, but above a threshold the current can flip the core.
This is very important because it means that you can put cores in a grid, and
energize one X line and one Y line. Only where the two lines pass through the
same core will the core receive enough current to potentially flip. This
coincident-core technique is what made core memory practical. This grid of
cores is called a core plane. For a 16-bit computer, you stack up 16 core
planes and read or write them in parallel.

Finally, to write a core, you first access it to ensure it is flipped to 0.
Next you access it with currents in the opposite direction, flipping it back
to 1. But what about the bits in the word that you don't want flipped to 1?
The trick is the _inhibit_ line, which passes through all the cores in the a,
in the opposite direction. If you put a current through the inhibit line in a
plane at the same time as the write, the currents partially cancel out, and
the core doesn't flip, so it stays as 0.

Thus, each core typically has 4 wires: X and Y wires in a grid to select the
core, a sense line that passes through all the cores in a plane to read the
value, and an inhibit line that passes through all the cores in a plane to
inhibit the writing of a 1. Some systems combine the sense and inhibit lines
into one line, but that increases noise. (You want the sense line to switch
directions so noise from the X and Y lines cancels out (that's why it's
usually diagonal). But the inhibit line is forced to run a particular
direction so the current cancels out.)

If this isn't more than you ever wanted to know about core, I've written about
the core memory in the IBM 1401 in detail here:
[http://www.righto.com/2015/08/examining-core-memory-
module-i...](http://www.righto.com/2015/08/examining-core-memory-module-
inside.html)

~~~
jmiserez
Great explanation and article. Discussion (2015) here:
[https://news.ycombinator.com/item?id=10143700](https://news.ycombinator.com/item?id=10143700)

------
simias
It's worth point out that there are still plenty of static RAM in modern chips
and those obviously don't require refreshing. The CPU cache would be an
example, but there are also plenty of internal SRAMs in ASICs and FPGAs
everywhere.

The refresh needed by DRAM is also why you can't merely suspend-to-RAM by
merely keeping it powered up, the bits will flip. You need a special self-
refresh mode that keeps updating every cell.

While implementing suspend-to-RAM on an embedded system I wanted to test that
my self-refresh was working correctly, so I decided to test the null
hypothesis: I'd suspend my ROM code without enabling auto-refresh, wait a
second or so, wake up and smell the ashes. Except... it seemed to work anyway.
The data was still here.

It turns out that while the refresh period of DRAM is usually a handful of ms
at most, you will still have a relatively low bitflit rate even after several
seconds. In hindsight it makes sense: in production you want an error rate of
effectively 0% (especially if you don't have ECC) across potentially
_billions_ of cells. Most cells can probably hold their value for much, much
longer than the refresh period but you need to catch the full bell curve and
it needs to keep working across the entire temperature and voltage range for
all chips.

~~~
gleenn
This is relevant to the hack you can do when trying to retrieve secrets from
ram from a live computer: pull the dimms out, blow some cold air on them and
then stick them in another computer that's ready to dump the contents. I've
heard its a good attack against full-drive encryption because the keys are
sitting in plaintext in RAM. I've also heard it works for up to minutes after
the computer has been turned off.

~~~
voxadam
What you're referring to is called a "cold boot attack".

~~~
Scoundreller
It would be cool if computers were designed to shutdown and clear RAM if they
detected rapid cooling.

Ideally this monitoring process would continue during hibernation.

I guess there wouldn’t be anything possible if power was completely removed,
but at least then an attacker is dealing with only being able to begin cooling
_after_ shutdown and cells start being lost.

The system should also balk if restarted within a short period of time, but a
rapid temperature reduction. This would help prevent non-removal attacks (Eg:
rebooting with a minimal OS to pull whatever one could -> great for MACs).

~~~
celeritascelery
Generally you pull the ram before you cool it, so this wouldn’t do much.

~~~
mirimir
Paranoid folk crazy glue their RAM, and embed in alumina-filled epoxy (which
conducts heat well).

~~~
Phlarp
For all the hate the modern macbook design gets, this is one positive of
having soldered components.

------
lmilcin
It is worth to point out that the CPUs have much more pronounced hiccups like
SMIs (system management interrupts) on Intel ones. These can be turned off at
the cost of making the server able to actually overheat (SMIs are for example
used to check if the CPU is on fire).

~~~
bonzini
Good servers do not need periodic SMIs, stuff like temperature sensors are
handled by the BMC. Of course it depends on the benches and it can be a pain
to find them, but people using the Linux realtime patch for example really
need servers that do but abuse SMIs.

Periodic SMIs in theory should only be used for stupid stuff like emulating a
PS/2 keyboard and mouse from a USB one, which are safe to disable. SMIs are
used to handle machine checks, access to persistent storage (including UEFI
variables), and various RAS (reliability/availability/serviceability)
features.

~~~
lmilcin
Well, I don't know for a fact everything those interrupts do. What I know is
that I had to turn those off (along with bunch of other things) to meet strict
realtime guarantees for my proof of concept algorihmic trading framework I did
for a brokerage house. This was few years back on Haswell on best hardware you
could buy including top bin Xeons dedicated for algotrading that were clocking
5GHz by default (frequency locked, sleep states turned off, etc).

On the other hand I never noticed anything funky with regards to memory
latency that would cause me to investigate. This is probably because I would
already treat memory like a remote database and try to do as much as possible
within L2 and L3.

The budget for entire transaction (from the moment bytes reached NIC to the
moment bytes left NIC as measured by external switch) was 5us so 0.1us was
below noise threshold.

~~~
bonzini
Yes, you have to turn the periodic SMI off but it's not a problem with respect
to overheating or detecting memory errors.

------
cyphar
> FFT requires input data to be sampled with a constant sampling interval.

You can use something like Lomb-Scargle to get a periodogram without needing
it to be evenly spaced. This also has the benefit of potentially making the
Nyquist limit much softer (depending on how randomly-variating the sample
interval is)[1].

[1]: [https://arxiv.org/abs/1508.02717](https://arxiv.org/abs/1508.02717)

------
afarah
>clock_gettime(CLOCK_MONOTONIC, &ts);

IIRC this call has an overhead of a few ns. Isn't that close to or even higher
than the time it takes to perform the action being measured on every loop? If
so the author is just measuring clock_gettime. It can be confirmed on the
author's system by simply calling clock_gettime without calling the measured
action (movntdqa / cache flush) and comparing the results. An alternative
approach is to call clock_gettime before and after the loop, not on every
iteration, and then take an average.

~~~
BeeOnRope
clock_gettime overhead is on the order of ~20 ns (or about 60 cycles) on most
systems implementing VDSO gettime, and is quite stable. The author is trying
to measure something on the order of 75 ns.

~~~
jnordwick
I've actually been measuring clock_gettime(CLOCK_REALTIME) vDSO call lately,
and when called and already hot (ie L1I cache) it is still 350 ticks as
measured by rdtscp. I even had an open stack overflow question on this.

[https://stackoverflow.com/questions/53252050/why-does-the-
ca...](https://stackoverflow.com/questions/53252050/why-does-the-call-latency-
on-clock-gettimeclock-realtime-vary-so-much)

How are you getting 60 cycles?

~~~
BeeOnRope
I have measured it several times in various places with fairly consistent
results. Of course, if you are on a platform which doesn't offer VDSO for your
clock, or which disables or virtualizes `rdtsc` then the results could be much
longer.

One of the places I measure it is in uarch-bench [1], where running `uarch-
bench --clock-overhead` produces this output:

    
    
        ----- Clock Stats --------
                                                          Resolution (ns)               Runtime (ns)
                               Name                        min/  med/  avg/  max         min/  med/  avg/  max
                         StdClockAdapt<system_clock>      25.0/ 27.0/ 27.0/ 29.0        27.1/ 27.4/ 27.6/ 30.6
                         StdClockAdapt<steady_clock>      25.0/ 26.0/ 26.9/ 94.0        27.0/ 27.0/ 27.1/ 32.6
                StdClockAdapt<high_resolution_clock>      26.0/ 27.0/ 27.0/ 28.0        27.1/ 27.5/ 27.7/ 30.0
                      GettimeAdapter<CLOCK_REALTIME>      25.0/ 26.0/ 25.7/ 27.0        25.1/ 25.5/ 25.6/ 48.3
               GettimeAdapter<CLOCK_REALTIME_COARSE>       0.0/  0.0/  0.0/  0.0         7.2/  7.3/  7.3/  7.3
                     GettimeAdapter<CLOCK_MONOTONIC>      24.0/ 25.0/ 25.5/ 27.0        24.7/ 24.7/ 24.9/ 27.2
              GettimeAdapter<CLOCK_MONOTONIC_COARSE>       0.0/  0.0/  0.0/  0.0         7.0/  7.2/  7.2/  7.3
                 GettimeAdapter<CLOCK_MONOTONIC_RAW>     355.0/358.0/357.8/361.0       357.4/358.2/358.1/360.5
            GettimeAdapter<CLOCK_PROCESS_CPUTIME_ID>     432.0/437.0/436.4/440.0       434.7/436.0/436.2/440.9
             GettimeAdapter<CLOCK_THREAD_CPUTIME_ID>     422.0/426.0/426.1/431.0       424.6/427.1/427.2/430.4
                      GettimeAdapter<CLOCK_BOOTTIME>     363.0/365.0/365.3/368.0       364.2/364.5/364.7/367.7
                                           DumbClock       0.0/  0.0/  0.0/  0.0         0.0/  0.0/  0.0/  0.0
    
    

The Runtime column shows the cost. Ignoring DumbClock (which is a dummy inline
implementation returning constant zero), note that the clocks basically group
themselves into 3 groups: around 7 ns, 25-27 ns and 300-400 ns.

The 7 ns group are those that are implemented just by reading a shared memory
location, and don't need any rdtsc call at all. The downside, of course, is
that this location is only updated periodically (usually during the scheduler
tick), so the resolution is limited.

The 25ish ns group are those that are implemented in the VDSO - they need to
do an rdtsc call, which is maybe half the time, and then do some math to turn
this into a usable time. Note that CLOCK_REALTIME falls into this group on my
system.

The 300+ ns group are those that need a system call. This used to be ~100 ns
until Spectre and Meltdown mitigations happened. Some of these cannot easily
be implemented in VDSO (e.g., those that return process-specific data), and
some could be, but simply haven't.

For what it's worth, I wasn't able to reproduce your results from the SO
question. Using your own test program (only modified to print the time per
call), running it with no sleep and 10000 loops gives:

    
    
        $ ./clockt 0 10 10000
        init run 15256
        trial 0 took 659834 (65 cycles per call)
        trial 1 took 659674 (65 cycles per call)
        trial 2 took 659578 (65 cycles per call)
        trial 3 took 659550 (65 cycles per call)
        trial 4 took 659548 (65 cycles per call)
        trial 5 took 659556 (65 cycles per call)
        trial 6 took 659552 (65 cycles per call)
        trial 7 took 659556 (65 cycles per call)
        trial 8 took 659546 (65 cycles per call)
        trial 9 took 659544 (65 cycles per call)
    

On my 2.6 GHz system, 65 cycles corresponds to 25 ns, so those results are
exactly consistent with the uarch-bench results shown above. So either your
system is weird, or you weren't running enough loops, or ... I'm not sure.

[1] [https://github.com/travisdowns/uarch-
bench](https://github.com/travisdowns/uarch-bench)

------
_cs2017_
> Each bit stored in dynamic memory must be refreshed, typically every 64ms
> (called Static Refresh). This is a rather costly operation. To avoid one
> major stall every 64ms, this process is divided into 8192 smaller refresh
> operations.

It implies that the length of refresh operation is roughly linear in the
number of bits being refreshed. Why is it impossible to parallelize this?

> Typically I get ~140ns per loop, periodically the loop duration jumps to
> ~360ns. Sometimes I get odd readings longer than 3200ns.

What's the cause of the 3200ns+ readings?

~~~
jjoonathan
DRAM beats SRAM at density, power, and cost by pushing the refresh circuitry
out of the NM memory cells and into the periphery of the NM cell matrix. There
are only M amplifiers per NM matrix, so you can only refresh M cells at a time
and must perform N refreshes per refresh interval to catch every row. Could
you put in NM amplifiers to refresh all the cells at once? Sure, but then we
would call it SRAM :-)

~~~
simias
I'm not an expert in the field but I was under the impression that SRAM worked
completely differently, using a bi-stable transistor circuit and no capacitor,
something like that:
[https://upload.wikimedia.org/wikipedia/commons/a/a5/Transist...](https://upload.wikimedia.org/wikipedia/commons/a/a5/Transistor_Bistable_interactive_animated-
en.svg)

Such a circuit is stable and doesn't need any refresh or amplification.

Was I mistaken?

~~~
jjoonathan
No, I was using "amplifier" in a slightly more general sense to mean "a
circuit that uses power to turn a weakly driven signal into a strongly driven
signal." You are absolutely correct that a SRAM cell would drive a near zero
signal closer to zero and that this behavior differs from a linear amplifier
which would drive a near zero signal further away from zero.

Here was my conundrum: a more general term like "active circuit" risked
leaving people behind while a more specific term like "buffer" or "driver"
didn't highlight the analogy between SRAM and DRAM. I chose "amplifier" as a
compromise, hoping that people who were familiar enough to worry about
bistability would be comfortable with the generalized definition while people
who barely hanging on would miss that detail entirely but still get my point.
Sounds like I caught you in the middle. Sorry for the confusion.

------
exikyut
Hopefully someone from cloudflare (</keyword-trigger> :) ) notices this - the
`cloudflare-blog` repo has zero issues and I didn't want to ruin that, and I
don't like Disqus, so...

I have a two-core system (ThinkPad T400), which means measure-dram is
completely falling apart on the `CPU_SET(2, &set);` and ends with a
"sched_setaffinity([0]): Invalid argument" error.

s/2/0/ fixes it completely.

Also - `decode-dimms` doesn't work on my system ("Number of SDRAM DIMMs
detected and decoded: 0"). `dmidecode -t memory` is fairly informative,
however, as is `lshw -C memory`. I'm not sure if both are fishing data out of
DMI. I definitely wouldn't mind finding out if my (2x2GB DDR3) responds to I2C
from some other program/technique.

These two things aside, this was a very fun article to read. Thanks!

------
emmelaich
Since you know (or suspect) the frequency, aren't there simpler ways to find
the curve than full FFT?

~~~
mattkrause
The Goertzel algorithm is O(N) for a single frequency component, which beats
the FFT’s O(N log N), where N is the length of the data. The FFT gives you the
whole spectrum though, whereas Goertzel “costs” O(N) each time. This might
matter if the peak is not exactly where you imagine (e.g. 7.8 instead of 8).
Furthermore, there’s a been a ton of work put into optimizing FFTs for a wide
range of platforms, special cases, etc, so the FFT might actually run faster.

Further reading:
[http://www.fftw.org/pruned.html](http://www.fftw.org/pruned.html)

------
JohnBooty
This is a wonderful article!

It gives a very clear and concise example of how we can measure things that
are happening at the software level via some simple code, so it also serves as
a _very_ comprehensible introduction to how exploits like
Rowhammer/Meltdown/etc. function.

------
jackhalford
very cool to see such a powerful insight from userspace, although there's
quite a few steps before doing the main loop. I had to google everything:
ASLR, MTRR, frequency scaling. Thank you for commenting it so well!

~~~
majke
Watch out, there is a bit of cargo cult there. Particularly around the
operations preparing the data for the FFT.

The C part is rather straightforward. The pinning is helpful, but not really
needed. The MTRR was a dead end (but I left the code since it is fascinating).
ASLR is kinda needed since it destroys determinism between runs.

With all these tricks I still struggle to get the code running on Kaby Lake.
It seems more stable on older CPU generations.

------
userbinator
On the early PCs you could easily control the refresh rate (it was done by the
timer tick), and some applications, notably games, would reprogram it to
refresh less and still get away with it while enjoying a small but possibly
critical boost in performance:

[https://www.reenigne.org/blog/how-to-get-away-with-
disabling...](https://www.reenigne.org/blog/how-to-get-away-with-disabling-
dram-refresh/)

------
quotemstr
Obligatory: What Every Programmer Should Know About Memory

From 2007, but still almost entirely relevant.

[https://lwn.net/Articles/250967/](https://lwn.net/Articles/250967/)

~~~
dragontamer
DDR4 btw has a few major differences:

1\. Banks are now split into Bank Groups. Typically 4-bank groups, with
4-banks per group. That's 16-banks total.

2\. While waiting for one bank to finish a command (ex: a RAS command), DDR4
RAM allows you to issue commands quickly to other groups, usually in 2-cycles
or less. In effect, Bank Group 0 can work in parallel with Group1, Group2, and
Group 3.

3\. The 4-banks within Bank Group 0 can work in parallel, but at a much slower
rate.

4\. This allows DDR4 RAM to issue 16-commands in parallel, one to each bank
group, as long as certain rules are followed. DDR3 RAM only had 8-banks. This
allows DDR4 to be clocked much higher.

5\. DDR4 uses less energy than DDR3.

\-------

That's about it actually. So the vast, vast majority of the information in
that document remains relevant today.

------
phkahler
Does it help if you have an APU? A processor with integrated graphics. Since
scanout needs to happen at a regular rate every 16ms, does that allow reads
along with refresh? In effect hiding or negating some of this?

~~~
ajenner
Modern graphics systems are too flexible for this - you can reprogram the
graphics memory layout in such a way that would break the refreshing. But this
technique was used on early PCs (with CGA graphics) and many of the 80s
microcomputers.

------
throwaway77790
Curious to know where the 1billion ns/s came from. No idea how I would have
determined to use that in the fft code.

~~~
theunamedguy
It's just a conversion factor to make the units come out nicely. He could have
easily done T = 1/f to give the refresh interval in seconds, but with such a
short duration, it helps to convert it to nanoseconds (1e9 ns = 1 s).

~~~
throwaway77790
Oh, I see now. Thanks a lot. Makes sense. 1bil ns/1s

------
edoceo
Maybe this is why CloudFlare thinks my San Jose, CA IPs are in San Jose, JP
and blocks my requests?

~~~
Xorlev
It's a blog post about something more or less unrelated to CloudFlare that
just happens to be on their blog. This comment is lacking in any real
substance other than snark.

Have you tried reaching out to them to try and resolve the issue?

~~~
edoceo
I had to reach out to the vendor who is hosting their services behind
CloudFlare. They confirmed my IPs were blocked due to a GeoRule, the error
looks like this on my end:

    
    
      The owner of this website (*service.com*) has banned the country or region your IP address is in (JP) from accessing this website.
    

This vendor had to white-list me in CloudFlare. I've also reached out to my
hosting provider about this oddity. Prior to this hiccup yesterday, I've been
using these IPs with the service provider for many months w/o issue.

e: formatting

~~~
yashap
Most companies I know of use the maxmind db for geo ip lookups. Would be
interesting to check if your ip(s) are being mapped to Japan in the recent
version of maxmind’s db.

------
that_lurker
I scare my computer by butting my mouse over the windows update button

