The rings are made from hard-magnetic ferrite. Basically permanent magnets. To write these, a positive or negative current pulse is applied to a write line. This will induce a voltage in the read line. The magnitude of the induced voltage depends on whether the write pulse coincides with the remanent field (the polarity of the ring). This allows you to store one bit. Note how reading and writing are the same thing, except that when reading you always use e.g. a positive current and then re-write the bit immediately after. So writing = single cycle operation. Reading = destructive and hence two cycles.
This is inefficient though, because for each bit you need two wires. Genius idea: Let's split the write line up into two write lines, each carrying half the write current. Lay out cores in a matrix and you can address individual cores, because only the core where both wires cross will reach the field strength necessary to be re/de-magnetized. The third wire is the read line, which is looped through all cores. This works because only a single bit is read at a time.
The important property for reading a core is that when a core flips from one state to the other, the changing magnetic field will induce a voltage on the sense line threaded through all the cores. But if the core stays the same, you don't get a voltage. Thus, by flipping a core to 0, you can detect if the core was in the 1 state previously. This allows you to read a core, but the read is destructive since the core is now 0.
Another key property of the cores is hysteresis. A small current through a core has no effect on it, but above a threshold the current can flip the core. This is very important because it means that you can put cores in a grid, and energize one X line and one Y line. Only where the two lines pass through the same core will the core receive enough current to potentially flip. This coincident-core technique is what made core memory practical. This grid of cores is called a core plane. For a 16-bit computer, you stack up 16 core planes and read or write them in parallel.
Finally, to write a core, you first access it to ensure it is flipped to 0. Next you access it with currents in the opposite direction, flipping it back to 1. But what about the bits in the word that you don't want flipped to 1? The trick is the inhibit line, which passes through all the cores in the a, in the opposite direction. If you put a current through the inhibit line in a plane at the same time as the write, the currents partially cancel out, and the core doesn't flip, so it stays as 0.
Thus, each core typically has 4 wires: X and Y wires in a grid to select the core, a sense line that passes through all the cores in a plane to read the value, and an inhibit line that passes through all the cores in a plane to inhibit the writing of a 1. Some systems combine the sense and inhibit lines into one line, but that increases noise. (You want the sense line to switch directions so noise from the X and Y lines cancels out (that's why it's usually diagonal). But the inhibit line is forced to run a particular direction so the current cancels out.)
If this isn't more than you ever wanted to know about core, I've written about the core memory in the IBM 1401 in detail here: http://www.righto.com/2015/08/examining-core-memory-module-i...
Basically iron, which can be magnetized or demagnetized just like you can use a magnet to magnetize a steel needle.
The refresh needed by DRAM is also why you can't merely suspend-to-RAM by merely keeping it powered up, the bits will flip. You need a special self-refresh mode that keeps updating every cell.
While implementing suspend-to-RAM on an embedded system I wanted to test that my self-refresh was working correctly, so I decided to test the null hypothesis: I'd suspend my ROM code without enabling auto-refresh, wait a second or so, wake up and smell the ashes. Except... it seemed to work anyway. The data was still here.
It turns out that while the refresh period of DRAM is usually a handful of ms at most, you will still have a relatively low bitflit rate even after several seconds. In hindsight it makes sense: in production you want an error rate of effectively 0% (especially if you don't have ECC) across potentially billions of cells. Most cells can probably hold their value for much, much longer than the refresh period but you need to catch the full bell curve and it needs to keep working across the entire temperature and voltage range for all chips.
The refresh circuitry was provided by a video chip and this is most likely where things went wrong, either in the configuration or in the actual wiring.
Ideally this monitoring process would continue during hibernation.
I guess there wouldn’t be anything possible if power was completely removed, but at least then an attacker is dealing with only being able to begin cooling after shutdown and cells start being lost.
The system should also balk if restarted within a short period of time, but a rapid temperature reduction. This would help prevent non-removal attacks (Eg: rebooting with a minimal OS to pull whatever one could -> great for MACs).
We use these features to store small amounts of very sensitive data, key material for example.
As far as I know the only two systems that use cryptographically meaningful memory encryption are Xbox 360 and zArchitecture mainframes.
Various secure crypto processors usually have some support for external "encrypted" SRAM, but as this typically involves microcontroller platforms and byte/word-wide accesses to said memory the implementation in the best case boils down to CTR mode (usually of some secret semi-proprietary algorithm of questionable cryptographic qualities) without IV or any authentication.
I always assumed everything was written to disk and loaded into memory on startup but apparently not. Everything critical to the machine running is stored on this SRAM.
It actually makes a lot of sense now why the manufacturers did it this way. I was really confused why in this day and age I have to change an SRAM battery stashed somewhere deep in the computer inside the control panel.
But, it seems the cartridge itself (and not the gameboy which I incorrectly assumed from your post) has a battery which according to internet lasts about 15 years.
Around 8m24s you can see how the cpu pauses while ram is refreshed on the SNES. You can see that it’s basically once a scanlije. I found it helpful to visualize how often RAM has to be refreshed (ballparked).
I just couldn't figure why fairly modern CNC machines used the same technology as old video game cartridges.
It depends on if they used SRAM at all, or how the SRAM was used. For example, Pokemon was notorious for killing its SRAM battery early because of running things like a real-time clock. SRAM itself doesn't take a whole lot of juice.
Also don't worry too much about your n64 games, most of them used EEPROM instead of SRAM. Some NES and SNES cartridges used EEPROM too, but it was much more expensive at the time.
If you have any soldering skills, or even if you don't, picking a up a few batteries and a soldering iron is dirt cheap and you could fix your old games pretty easily.
1990; it's the one game I know the year I got it for sure, since it was a promontional giveaway.
The annoying thing is that you need a soldering iron on the games. The CNC machine probably doesn’t.
It doesn't multiplex address and data pins, so there are only 8 pins that need bidirectional conversion.
The TI SN74LVC8T245 has a max time for OH on the 5V pins of 10ns, which is probably good enough.
All the NES games I just tried to check require a security bit I don't have to open, so I couldn't check any of them.
I ordered the right driver bit and amazon suggested CR2032 batteries, which I found amusing.
I don't know your machine but the old process controllers I worked on a long time ago would store machine state in battery backed RAM. This was so that the machine could safely recover from a fault or power outage.
A "hard drive unit" probably refers to a contemporary piece of kit that can be purchased new-old-stock for several thousand.
Or someone is selling a $30 kit on eBay that nobody has thought to look for.
Hard to tell, but one thing is certain: the cost of an SD card wouldn't even cause a rounding error in this equation.
Periodic SMIs in theory should only be used for stupid stuff like emulating a PS/2 keyboard and mouse from a USB one, which are safe to disable. SMIs are used to handle machine checks, access to persistent storage (including UEFI variables), and various RAS (reliability/availability/serviceability) features.
On the other hand I never noticed anything funky with regards to memory latency that would cause me to investigate. This is probably because I would already treat memory like a remote database and try to do as much as possible within L2 and L3.
The budget for entire transaction (from the moment bytes reached NIC to the moment bytes left NIC as measured by external switch) was 5us so 0.1us was below noise threshold.
You can use something like Lomb-Scargle to get a periodogram without needing it to be evenly spaced. This also has the benefit of potentially making the Nyquist limit much softer (depending on how randomly-variating the sample interval is).
IIRC this call has an overhead of a few ns. Isn't that close to or even higher than the time it takes to perform the action being measured on every loop? If so the author is just measuring clock_gettime. It can be confirmed on the author's system by simply calling clock_gettime without calling the measured action (movntdqa / cache flush) and comparing the results. An alternative approach is to call clock_gettime before and after the loop, not on every iteration, and then take an average.
And in any case on a modern x86 system that call (which is in the VDSO, it's not a syscall) just reads the TSC and scales the result. That's going to be reliable and predictable, and much faster than the DRAM refresh excursions being measured.
How are you getting 60 cycles?
One of the places I measure it is in uarch-bench , where running `uarch-bench --clock-overhead` produces this output:
----- Clock Stats --------
Resolution (ns) Runtime (ns)
Name min/ med/ avg/ max min/ med/ avg/ max
StdClockAdapt<system_clock> 25.0/ 27.0/ 27.0/ 29.0 27.1/ 27.4/ 27.6/ 30.6
StdClockAdapt<steady_clock> 25.0/ 26.0/ 26.9/ 94.0 27.0/ 27.0/ 27.1/ 32.6
StdClockAdapt<high_resolution_clock> 26.0/ 27.0/ 27.0/ 28.0 27.1/ 27.5/ 27.7/ 30.0
GettimeAdapter<CLOCK_REALTIME> 25.0/ 26.0/ 25.7/ 27.0 25.1/ 25.5/ 25.6/ 48.3
GettimeAdapter<CLOCK_REALTIME_COARSE> 0.0/ 0.0/ 0.0/ 0.0 7.2/ 7.3/ 7.3/ 7.3
GettimeAdapter<CLOCK_MONOTONIC> 24.0/ 25.0/ 25.5/ 27.0 24.7/ 24.7/ 24.9/ 27.2
GettimeAdapter<CLOCK_MONOTONIC_COARSE> 0.0/ 0.0/ 0.0/ 0.0 7.0/ 7.2/ 7.2/ 7.3
GettimeAdapter<CLOCK_MONOTONIC_RAW> 355.0/358.0/357.8/361.0 357.4/358.2/358.1/360.5
GettimeAdapter<CLOCK_PROCESS_CPUTIME_ID> 432.0/437.0/436.4/440.0 434.7/436.0/436.2/440.9
GettimeAdapter<CLOCK_THREAD_CPUTIME_ID> 422.0/426.0/426.1/431.0 424.6/427.1/427.2/430.4
GettimeAdapter<CLOCK_BOOTTIME> 363.0/365.0/365.3/368.0 364.2/364.5/364.7/367.7
DumbClock 0.0/ 0.0/ 0.0/ 0.0 0.0/ 0.0/ 0.0/ 0.0
The 7 ns group are those that are implemented just by reading a shared memory location, and don't need any rdtsc call at all. The downside, of course, is that this location is only updated periodically (usually during the scheduler tick), so the resolution is limited.
The 25ish ns group are those that are implemented in the VDSO - they need to do an rdtsc call, which is maybe half the time, and then do some math to turn this into a usable time. Note that CLOCK_REALTIME falls into this group on my system.
The 300+ ns group are those that need a system call. This used to be ~100 ns until Spectre and Meltdown mitigations happened. Some of these cannot easily be implemented in VDSO (e.g., those that return process-specific data), and some could be, but simply haven't.
For what it's worth, I wasn't able to reproduce your results from the SO question. Using your own test program (only modified to print the time per call), running it with no sleep and 10000 loops gives:
$ ./clockt 0 10 10000
init run 15256
trial 0 took 659834 (65 cycles per call)
trial 1 took 659674 (65 cycles per call)
trial 2 took 659578 (65 cycles per call)
trial 3 took 659550 (65 cycles per call)
trial 4 took 659548 (65 cycles per call)
trial 5 took 659556 (65 cycles per call)
trial 6 took 659552 (65 cycles per call)
trial 7 took 659556 (65 cycles per call)
trial 8 took 659546 (65 cycles per call)
trial 9 took 659544 (65 cycles per call)
It implies that the length of refresh operation is roughly linear in the number of bits being refreshed. Why is it impossible to parallelize this?
> Typically I get ~140ns per loop, periodically the loop duration jumps to ~360ns. Sometimes I get odd readings longer than 3200ns.
What's the cause of the 3200ns+ readings?
The rest of DRAM are tiny capacitors (kinda like batteries) that can only hold a charge for 64-milliseconds. Furthermore, a SINGLE read will destroy the data. So the DRAM design is to transfer the information to "sense amplifiers" each read, and then to transfer the information back at the end when the "row of data is closed".
Once you understand that DRAM capacitors are so incredibly tiny, RAS, CAS, PRECHARGE, and REFRESH suddenly make a LOT of sense.
* "Row" is all of your sense amplifiers.
* RAS: Transfer "one row" from DRAM into the sense amplifiers.
* CAS: Read from the sense amplifiers
* PRECHARGE: Write the sense-amplifiers back into DRAM. Sense-amplifiers are now empty, and ready to hold a new row.
* Refresh: Sense-amplifiers read, and then write, a row to "refresh" the data, as per the 64-milisecond data-loss issue. According to Micron, all Sense Amplifiers must be in the ACTIVE state (ie: after a Pre-charge. They are empty and ready for reading / writing of new data).
> Could you put in NM amplifiers to refresh all the cells at once? Sure, but then we would call it SRAM :-)
Indeed. Sense Amplifiers are the "part" of DRAM which act like SRAM. Sense Amplifiers do NOT lose data when they are read from. They do NOT need to be refreshed. Etc. etc. Sense Amplifiers are effectively, the "tiny" SRAM inside of DRAM arrays that makes everything work.
The very point of "DRAM" is to make most of your RAM be these cheap capacitors. So the only solution is to read and write data to the sense amplifiers, as per the protocol.
Such a circuit is stable and doesn't need any refresh or amplification.
Was I mistaken?
Here was my conundrum: a more general term like "active circuit" risked leaving people behind while a more specific term like "buffer" or "driver" didn't highlight the analogy between SRAM and DRAM. I chose "amplifier" as a compromise, hoping that people who were familiar enough to worry about bistability would be comfortable with the generalized definition while people who barely hanging on would miss that detail entirely but still get my point. Sounds like I caught you in the middle. Sorry for the confusion.
One of the mentioned articles touches on this:
> Upon receiving a refresh command, the DRAM chips enter a refresh mode that has been carefully designed to perform the maximum amount of cell refresh in as little time as possible. During this time, the current carrying capabilities of the power delivery network and the charge pumps are stretched to the limit.
I guess: it's actually hard to deliver power to refresh all the bits at once? Also note that ancient cpus like Z80 had the memory refresh machinery built into CPU as opposed to memory https://en.wikipedia.org/wiki/Memory_refresh
> What's the cause of the 3200ns+ readings?
No idea. Random noise? Timing interrupt? Some peripheral doing DMA transfer? Kernel context switch? System Management Mode?
Feel encouraged to run the code and try to debug it!
You can't refresh all the bits at once.
You only have something like 256kB worth of sense amplifiers across 2GB of RAM (Guesstimates from my memory: but the point is that you have much much FEWER sense-amplifiers than actual RAM). You need a sense amplifier to read RAM safely.
Each time you read from DRAM, it destroys the data. Only a sense amplifier can read data safely, store it for a bit, and then write it back. Since you only have 256kB of sense amplifiers, you have to refresh the data in chunks of 256kB at a time.
Not all sense amplifiers are "gang'ed up" together: they're actually broken up into 16-banks of sense amplifiers. But for whatever reason, modern DDR4 spec seems to ask all sense amplifiers to refresh together in a single command. In any case, you can at best, get 16x the parallelism (theoretically: since the spec doesn't allow for this) by imagining a "bank-specific refresh command".
That's a lot of complexity though, and I'm not even sure if you really gain anything from it. Its probably best to just refresh all 16-banks at the same time.
Per bank auto-refresh could be exploited with elaborate memory controller algorithm trying to always prioritize refreshing unused/least used banks, except it was broken by design and you couldnt control which bank to refresh. Nobody even bothered implementing per row refresh counters to skip freshly read rows. Rowhammer is a real shitshow exposing sloppy memory engineering in the industry.
Within a single chip, it is impossible by design... not by any physical nature.
The DDR4 spec has 16-banks (organized into 4-bank groups), which could theoretically refresh individually. But that's not how the spec was written: a Refresh command will cause all 16-banks to start refreshing at the same time.
However, it is possible to "parallelize" this rather easily: that's why you have TWO sticks of RAM per channel. While one stick of RAM is going through a Refresh cycle, the other stick of RAM is still fully available for use.
My assumption is that it is a better protocol to issue all banks to refresh at the same time. Otherwise, you'd need to send 16x the Refresh commands (one for each of the 16-banks). At that point, most of your messages would be "Refresh" instead of RAS / CAS (open row, or open-column) commands, needed to read/write data.
If you really want parallelism, get multi-rank RAM or stick more RAM per channel. But even then, if the two sticks of RAM refreshed at the same time, you'd have fewer "Refresh" commands in general. So it still might make more sense for memory controllers to keep sticks of RAM all in sync with regards to Refreshes.
I think you are misreading it. The point to avoid locking all the memory each time, instead you only lock (via refresh) a small potion of memory each time.
You seem to be taking about banking, I think.
I have a two-core system (ThinkPad T400), which means measure-dram is completely falling apart on the `CPU_SET(2, &set);` and ends with a "sched_setaffinity(): Invalid argument" error.
s/2/0/ fixes it completely.
Also - `decode-dimms` doesn't work on my system ("Number of SDRAM DIMMs detected and decoded: 0"). `dmidecode -t memory` is fairly informative, however, as is `lshw -C memory`. I'm not sure if both are fishing data out of DMI. I definitely wouldn't mind finding out if my (2x2GB DDR3) responds to I2C from some other program/technique.
These two things aside, this was a very fun article to read. Thanks!
Further reading: http://www.fftw.org/pruned.html
It gives a very clear and concise example of how we can measure things that are happening at the software level via some simple code, so it also serves as a very comprehensible introduction to how exploits like Rowhammer/Meltdown/etc. function.
The C part is rather straightforward. The pinning is helpful, but not really needed. The MTRR was a dead end (but I left the code since it is fascinating). ASLR is kinda needed since it destroys determinism between runs.
With all these tricks I still struggle to get the code running on Kaby Lake. It seems more stable on older CPU generations.
From 2007, but still almost entirely relevant.
1. Banks are now split into Bank Groups. Typically 4-bank groups, with 4-banks per group. That's 16-banks total.
2. While waiting for one bank to finish a command (ex: a RAS command), DDR4 RAM allows you to issue commands quickly to other groups, usually in 2-cycles or less. In effect, Bank Group 0 can work in parallel with Group1, Group2, and Group 3.
3. The 4-banks within Bank Group 0 can work in parallel, but at a much slower rate.
4. This allows DDR4 RAM to issue 16-commands in parallel, one to each bank group, as long as certain rules are followed. DDR3 RAM only had 8-banks. This allows DDR4 to be clocked much higher.
5. DDR4 uses less energy than DDR3.
That's about it actually. So the vast, vast majority of the information in that document remains relevant today.
Have you tried reaching out to them to try and resolve the issue?
The owner of this website (*service.com*) has banned the country or region your IP address is in (JP) from accessing this website.