
DDR4 SDRAM – Initialization, Training and Calibration - ivank
https://www.systemverilog.io/ddr4-initialization-and-calibration
======
rstuart4133
As the other have said, an excellent read if you want to understand how the
metal works now. These paragraphs below me away:

> if you program the CAS Write Latency to 9, once the ASIC/uP launches the
> Column Address, it will need to launch the different data bits at different
> times so that they all arrive at the DRAMs at a CWL of 9.

and:

> there could be changes in Voltage and Temperature during its course of
> operation. To keep the signal integrity and data access reliable, some of
> the parameters that were trained during initialization and read/write
> training have to be re-run.

So now the chips have to dynamically tune the circuity behind each pin to
suite the length of each trace it is connected to, and it's temperature.

Keep those fingers well away from the board during operation. In fact I
imagine a cockroach strolling across the motherboard looking for a comfortable
spot at the right temperature could well wreck havoc.

------
tus88
This is insane. We should be thankful there are smart people in the world who
can devise this complexity. So for the rest of us something apparently as
simple as a ram stick just works. Also why I have always been in minor awe at
electrical engineers.

~~~
baybal2
Welcome to semiconductor industry.

Value every bit of RAM you have, as it was won in an uneven battle against
entropy, and laws of nature.

I think I wrote before that in my teenage years, I thought of studying for a
process engineer. After living in Singapore, and meeting 2 retired TSMC senior
engineers there, I decided against that.

I had kind of a mentee-mentor relationship with them. Their told me of horrors
of "studying for a PhD for a coffee porter job," and me having to be ready to
endure that for many years to have a remotest chance to get into semi RnD.

The industry was just too competitive, and market dynamics are keeping to get
more and more adverse against newcomers with each passing year.

Each new generation of FABs is getting more and more automated, which means
more and more experienced process specialists are being thrown on the job
market with each new fab closure, and each of them wanting to go from FAB to
RnD.

They themselves said their decision to quit was thanks to that, and it
happened many years prior to them migrating to Singapore, and that it would've
been even worse at the time they were mentoring me.

~~~
willis936
I too started higher education with hopes of semiconductor RnD. I found my way
into a test and measurement lab and learned the stuff necessary for such a
gig, but found myself content with the work instead of hoping for some high
risk halo job. I’m still young but I’m doing relatively low pay work for a
university gig in a new field that interests me, rather than in a higher
paying/more stimulating industry job.

------
65a
Thanks, this is a phenomenally interesting read! You can see the algorithms
for the DDR3 version of this here:
[https://github.com/coreboot/coreboot/tree/master/src/vendorc...](https://github.com/coreboot/coreboot/tree/master/src/vendorcode/amd/agesa/f16kb/Proc/Mem)

------
Koshkin
This is amazing. Reminds me of the engine start sequence of the P-51D
"Mustang" fighter plane. (I have indeed wondered what is the process, in a
complex electronic device, of absorbing the initial shock of a power-up and
bringing itself into a stable working state.)

~~~
inamberclad
Hell, the P-51 is easier. Two fuel pumps, two magnetos, and twelve cylinders.
Switch the mags to both, start turning the engine, and add fuel [0].

[0]
[https://www.youtube.com/watch?v=iompxanAQgQ](https://www.youtube.com/watch?v=iompxanAQgQ)

------
subbdue
Hello .. I’m the author of this article. Reading all your comments was an
absolute blast.

Since there is some curiosity around temperature and voltage variation - here
are some more details for you folks to geek out on.

When you build a system with a DRAM interface, you typically specify 2
parameters \- A temperature range you guarantee its operation within. For
example, this range could be 0C-80C. \- Maximum rate of change of temperature
your system can handle. Example, +/-2C/min.

Now, to test if the system can withstand the above 2 parameters, while the
firmware is being developed it is put in a Thermal Chamber and experiments
such this are conducted: \- Do a cold soak for a few hours (i.e., power down
the system and leave it in a 0C chamber for a few hours). \- Then power on the
system and let the DRAM interface calibrate at this low temp \- Then start a
stress test which reads and writes to the memory, and simultaneously ramp up
the temperature of the chamber at the specified rate upto your maximum (2C/min
upto 80C in our example here) \- If the test fails, it typically means the
signal integrity is not good enough. Then you go back to the lab, probe the
DRAM interface and observe the signals on an oscilloscope (if you have to).
Then re-calculate/fiddle around with 6 parameters until you have it all
working. These parameters are 1\. The drive strength of transistors at the
Processor when its writing data to memory 2\. The termination resistance of
the transistors at the Processor when it is Reading data back from memory 3\.
Voltage reference (Vref) - This is the value the PHY[++] uses to decide if a
voltage level is a binary-0 or 1 4\. The set of 3 parameters above exist on
the DRAMs as well. Making it a total of 6.

It is easy to imagine what drive strength and termination of transistors mean.
But Vref is a bit more interesting.

In DDR4, binary-1 is represented by a 1.2V signal, but binary-0 is a floating
voltage value. It could be 0.2V or 0.4V, or whatever. It depends on the
termination at either end of the PCB trace. This type of a circuit is called
POD (Pseudo Open Drain). Since the level of binary-0 is variable, the DDR
controller calibration logic has to figure out where to place Vref so it can
reliably decode 1s and 0s.

Lastly, just like the cold soak experiment, we also do hot-soaks with a ramp
down and several other modalities to ensure the system is solid.

The PHY has delay registers within it which you can read to figure out the
result of calibration. When you power on a system after a cold-soak vs a hot-
soak, you'll see different values in these delay registers.

PHYs these days are very robust. They typically don't need periodic
calibration (re-tuning of delay registers) while operating in a typical data
center environment. Of course, it's a different story if the system if sitting
somewhere off on an oil rig.

— [++] The PHY is separate from the DDR controller. This is the actual analog
circuits at the edge of the processor sending out and receiving signals on the
PCB.

~~~
retSava
Thanks for the write-up, very interesting!

Wow, the cold/warm soaks sound like they make for a very slow iterative
process when it doesn't work on the first try. Do you have several systems
soaked at the same time? So if the first you test fails, you can adjust
something and test on a second setup while the first is re-soaked?

I also thought much of the difference in length on the PCB is compensated by
with those wiggly traces (so all have equal-ish length), but you still need to
compensate for it? Or is it just to gain a larger error margin?

~~~
subbdue
> _Wow, the cold /warm soaks sound like they make for a very slow iterative
> process when it doesn't work on the first try. Do you have several systems
> soaked at the same time? So if the first you test fails, you can adjust
> something and test on a second setup while the first is re-soaked?_

Thermal chambers are quite expensive, around $100,000 per unit. So bigger
shops such as Intel, AMD, Qualcomm probably have many. But I would be
surprised if smaller companies have more than a couple.

It is a painful process when a company develops their first system. As you
would guess, once they have a proven PCB design with DDR controller firmware,
the DDR sub-system design is reused in subsequent systems.

Now say you've been shipping the system for a couple of years. There is one
situation under which the above experiments will need to be performed again.

Say your system uses a 16GB DIMM. Micron and Samsung, the DIMM makers, are
always trying to improve their manufacturing process, moving to the next node
(14nm to 7nm) and so on. So every couple of years you'll find them EOL-ing
(End of life) a certain 16GB DIMM for a newer one. There is a chance you'll
start seeing failures with the new 16GB DIMM.

> _I also thought much of the difference in length on the PCB is compensated
> by with those wiggly traces (so all have equal-ish length), but you still
> need to compensate for it? Or is it just to gain a larger error margin?_

You are partially correct.

Check out this image: [https://www.systemverilog.io/ddr4-initialization-and-
calibra...](https://www.systemverilog.io/ddr4-initialization-and-
calibration#why-training)

PCB Board designers match the length of the data lines, which are hooked up in
a star-topology from the processor to different DRAMs on the DIMM.

But the address lines are hooked up using _fly-by topology_. So data signals
launched from the processor arrive at all the DRAMs at the same time. But the
clocks and address signals that are launched from the processor will reach
each DRAM on the DIMM at different times. So, initial calibration compensates
for this.

------
stmw
This is really worth reading to see how far DRAM has come from the Apple II or
IBM PC days.

------
jhallenworld
For Intel, one part of the UEFI BIOS is the MRC- Memory Reference Code. There
are POST codes related to this that you can find. This code performs the DDR3
/ DDR4 initial calibration described in the article. There is "rank margin
test" mode for it, to determine the eye width for each pin. To validate a
population of DIMMs, this code should be run over temperature.

------
Filligree
Hold on. This references periodic re-training as something switches and
similar should do.

Then what about a PC? Might my workstation become unstable if I don't restart
it if the room temperature changes?

~~~
WillSlim95
That is what the function of the DDR controller is, it does all that stuff for
you.

------
classified
Wow. I had no idea RAM is such a complex topic. Thanks for all the work that
went into this article!

------
kabdib
I remember reading a guide from IBM about training the memory path of some
Power architecture system. The tl;dr; was essentially "Follow <big set of
rules for circuit board layout> _exactly_ , as if your life depended on it,
and spend six months tuning and qualifying parts."

It's amazing that anything works at all.

