Hacker News new | past | comments | ask | show | jobs | submit login
Modern CPUs have a backstage cast (devever.net)
212 points by hlandau on May 30, 2023 | hide | past | favorite | 85 comments



"...this is interesting is because POWER9 is basically the first time the public got a real view of how sophisticated the backstage cast actually is of a modern server CPU."

Not quite correct; the OpenSPARC T1 and T2 were publicly released and available by 2008.

https://www.oracle.com/servers/technologies/opensparc.html

"Large parts of this process are handled by vendor-supplied mystery firmware blobs, which may as well be boxes with “???” written in them.

The maintainers of the me_cleaner script likely have the clearest view of what is known.

https://github.com/corna/me_cleaner


>Not quite correct; the OpenSPARC T1 and T2 were publicly released and available by 2008.

Points for mentioning this! But things have come a long way since 2008. You can get Intel ME-less machines from the 2008 era. Not sure if OpenSPARC T2 has any management cores.

>The maintainers of the me_cleaner script likely have the clearest view of what is known.

Yep, absolutely. Much of what we know is thanks to the efforts of researchers like these. See also the talks on finding the 'Red Unlock' mode of modern Intel CPUs.



> It's responsible for initialising the chip and getting it out of bed enough to the point where at least one of the main cores can run using cache-as-RAM mode

The somewhat surprising but true implication is that on boot, the CPU is initialized before the RAM is initialized. So there is a window of time during boot when the main core on the CPU is running instructions that cannot access the RAM. Even on register-starved x86 it is possible to write code without using RAM, but it certainly seems more convenient to me to treat the cache as RAM.

Documentation for a special compiler that compiles to code that doesn't use RAM: https://github.com/wt/coreboot/blob/master/util/romcc/romcc....


I got some exposure to this at SiCortex, where we had our own MIPS-based processors and so had to do many of these things in software. There was one ColdFire (embedded 68K) processor running μClinux per board, plus 27 of our own. This "Module Service Processor" would boot first, fetch a boot image from the one-per-system "System Service Processor" (pretty generic PC), load that via JTAG into each node's cache, then finally set each one loose to do things like memory registration and interconnect setup. This all set the stage for the actual Linux boot, which itself involved two stages with a switch_root in between. My very first assignment was to work on some of that MSP-to-node stuff, then later I had to dive into memory registration at least twice, even though both were pretty far from my real specialty. Small company, y'know.

This kind of low-level work is significantly more complicated than even most kernel developers realize - hence the need for articles like OP. Ditto for anything on large (more than single-board) systems. The intersection of the two was, frankly, a bit exhausting. Just keeping track of all the moving parts and their respective states induced a cognitive load that made debugging other already-hard problems that much more difficult. My hat's off for anyone who has kept on doing that stuff longer than I did, or who has to do it in an environment where vendors are keeping so many secrets.


That reminds me of a story about SiCortex.

My university (University of Colorado Boulder) bought one of a very few SiCortex systems ever sold. As an undergraduate I competed at SC07 and SC08 in the Cluster Challenge competition.

Our coach was a CU facility member, Doug, who also was responsible for the SiCortex box we bought. At SC08 he told us that another one of the teams was competing with a SiCortex box. We knew that they would win the LINPACK part of the challenge, but we didn't know that LINPACK was basically the only thing they managed to get working.

We had also heard rumors that SiCortex was in trouble financially at that time. When we were walking the show floor at SC08, we came across the huge SiCortex booth, which had 10 or so machines of different sizes (I believe the smallest was a 64-core workstation and the largest was a ~5000 whole rack system).

I remarked to Doug that SiCortex didn't look to be in such bad shape.

Doug turned to me and said, "25% of the machines SiCortex has ever made are in that booth".

The SiCortex idea was like VLIW. On paper the numbers look great. On highly optimized synthetic benchmarks it looks good. On real world code you find out how hard it is to get good performance.


Sounds about right. The processors were six-core 500MHz single issue, which was pretty damn slow even by the standards of the time (2006-2008), beefed up with some extra floating point and some relatively very fast interconnect stuff. The key is that there were a lot of processors - 972 in the big machine. So you really really had to rely on parallelism to get any kind of performance, and a lot of code doesn't parallelize well at all. Even in the HPC space, a surprising number of codes just aren't designed to work for more than about 64.

Also the machine positively sucked for linear integer code - like, say, compilers or OS kernels. One of the first things customers would do, naturally, was compile. Bad first impression. Also, it was nearly impossible to get a Lustre MDS for a thousand-node cluster (which is what the biggest machine was) to run for any length of time without falling over, because Lustre was designed around the assumption that the MDS would be bigger and beefier than anything else and have "poor man's flow control" in the form of a relatively slow network. In our case it was exactly the same and completely unprotected because the interconnect was the fastest part of the system by quite a margin. That was my nightmare for those two years. I've heard that Lustre has since added some flow control ("network request scheduler") but I was never able to benefit from that. PVFS2 worked better, and Gluster (which I worked on for nearly a decade afterward) would probably have been better still because it's more fully distributed and less CPU-hungry.

The reason I mention all this is that there's an important lesson: building a system with a very unusual set of performance characteristics is a terrible idea business-wise, because people won't be able to realize its potential. Not even in a fairly specialized market. They'll just think it's slow. Unless it's truly bespoke, literally a one-off or close to it, nobody will want it.

P.S. I actually had to visit CU-Boulder to debug something on that machine, with the aforementioned Doug. It became one of my favorite "war stories" from a 30-year career, but this has gone on long enough so I'll skip it.


Hey, some of us are listening, and would love to hear that story!


This stuff is certainly pretty rarified these days. I remember when the PPC970 came out people were shocked how difficult it was to bootstrap. IBM didn't really care as POWER4 (from which it was derived) was not a merchant chip and they had management processors (and very high margins) in all their machines to handle it. Apple was the launch partner and even back then had a lot of in house expertise doing this sort of work. Everyone else who tried to use it was in for some real pain and most of them gave up. The guys doing the eval boards with support from IBM literally posted this: https://web.archive.org/web/20060715134515/http://www.970eva...

TL;DR, the last line is "Once all of the above is completed, the processor will be able to successfully fetch instructions from a boot source. You are now effectively at the same point you would have been 5 months ago, had this been a standard 750 bringup... Board bringup from this point should be very straightforward and follow established methods."


That's an amazing document. Practically every sentence, though tersely stated, hints at hours (or worse) of experimentation and head-scratching. The "would have been 5 months ago" bit at the end is remarkably restrained. I'm certain I would have quit (or worse) by that point. Respect and condolences to whoever did this.


I had to do some of this while bringing up a 486 to run my own kernel. Very frustrating, to the point that I had the reset switch of the machine wired to a sustain pedal just so that I didn't have to dive under the desk all the time.


CAR is a gem. It's great for lite OSes in hostile envs.


Funny enough, a modern CPU doing CAR still has more memory than a PC from the 1980s. Presuming you statically recompiled them, you could run entire SNES games from a modern CPU’s cache!

(And that being said, now I’m wondering whether you could force eviction and retainment into L3 cache on demand, to achieve something like memory bank switching…)


SNES games? You could comfortably run Windows 95 within the L3 cache on many recent Intel processors (the one in my 2019-era MBP has 16 MB of L3 onboard; current generation processors go even bigger and Windows 95 only needs 4 MB).

It's not really clear to me from the limited bits of info that I've read whether or not L3 is guaranteed to be accessible when doing CAR, but, if it is, you've got enough memory available to do a lot of stuff. (And even the L2 cache is starting to get pretty big on the higher-end current-gen chips.)


Well, keep in mind that in the sort of state the computer is in when doing CAR, you don't get to talk to storage devices; nor do you get the benefit of having some kind of ROM on the bus. I know Windows 95 is happy to run from 4MB RAM with access to a hard drive; but how much memory would W95 need for a "bootable live-CD environment" where the disk image must be resident (if compressed) in memory along with all work RAM?

(This is why I compared to the SNES: if you have to map the SNES's RAM and [every bank of] the game's ROM, then you're looking at 4–16MB depending on the game. The SNES is pretty much the newest console whose games would entirely fit, I think.)


I think that the only thing that prevents you from ignoring the memory controller and initializing the rest of x86 board while still remaining in the CAR mode is the sheer ridiculousness of doing that. As for whether you have an memory-mapped ROM available I'm not exactly sure, but the high-level model of what x86 firmware does seems to imply, that the hardware maps an part of SPI Flash at the address range where there was an ROM chip on the 8086/286/386 PCs (the actual address ranges are different).


About 35Mb IMSMR. Win98 could be stripped to around 50Mb without a loss of functionality.

NB L3 is unified most of the time, but but with L2 you still need to distinguish between data/code.


L2 is unified on all CPUs I'm aware of. Only L1 is split between instructions and data.


Guess I got L1/L2 messed up with L3 (don't drink and comment, kids).

But I found out what Itanium 2 did had a split L2 cache.


For a benchmark test for a British computer magazine in about 1996, I hand stripped down an installation of Windows 95 so that it could fit on a mid-1990s SSD: which is to say, a PCI slot device with 16 MB of DRAM which appeared to the computer as a disk drive. So, effectively, a 16 MB SSD. Not gigabytes, megabytes.

So I can say with some authority that it is possible to have a complete running installation of Windows 95 in 16 megs of disk space; however I had to trim it pretty brutally to fit, so, for example, I think it had 2 fonts left, one of which was used for the widgets that display window-close boxes and so on.

The problem was that the object of the exercise was the benchmark how much quicker it was — but there wasn't enough space left to install any applications, and our benchmark tool used real applications. So, although I did several days work and I did get paid for it, the effort was at heart in vain. But in principle yes you could run Windows 95 entirely out of 16 MB of cache memory, and if you had say 32 MB of cache memory it would be no problem at all.


The Ryzen 5800X3D has 96 MB of L3 cache, enough to fit the biggest NeoGeo or N64 games.

The Zen 4-based Genoa-X will have up to 1 GB of L3 cache! You could comfortably fit Windows 95 there...


In fact, the largest POWER9 CPUs have up to 110MB of L3... and Zen 4's L3 apparently maxes out at 384MB(!!).


> Zen 4's L3 apparently maxes out at 384MB(!!).

That's with them holding back and not adding any v-cache. If you stacked an extra 64MB on each of the 12 compute dies, you'd have 1152MB.


Man now I want to see this.


What does hostile mean in this context?


It might be desirable to not trust RAM as other agents who have access to physical memory could be tampering with it.


Any machine you may not have True Control over e.g. outside of JTAG range, in the cloud, runs proprietary software.


A modern motherboard can update it's BIOS from a USB stick WITHOUT a CPU or memory installed.

Think about that. The motherboard "knows" how to read a FAT file system from a USB mass storage device, verify it's digital signature and flash it with no main CPU or memory.


I assume this is done with a microcontroller on the board.


And also born out necessity, given that many Intel and AMD boards can't be booted with a too new CPU if the BIOS doesn't know about it - not even for flashing a new BIOS - so you needed to borrow an old CPU just for the sake of upgrading the BIOS.


It was originally to solve the issue where if you lost power flashing your bios you’d brick the system irrevocably. Now even if the bios is corrupt and the system won’t boot you can reflash a known good firmware with stock settings to get back up and running.


The ARC processor was formerly in the northbridge of the chipset.

Intel has since replaced this with an 80486 in modern designs; perhaps it also is implemented in the northbridge.

https://en.wikipedia.org/wiki/ARC_(processor)


I think you're talking about the ME but I don't think the ME is responsible for "BIOS" flashing. I think it must be a separate microcontroller. This is kind of the point of the original blog post: don't go looking for "the microcontroller" because there isn't just one; there are many.


Usually this is still host firmware and not a secondary controller, at least on x86 platforms. To detect/use the USB controller you still need to configure the chipset/root complex and do an initial PCIe bus scan to set up PCIe BAR apertures. After that occurs you need a (primitive) USB stack that is able to talk to the USB controller to enumerate the USB devices as well as block storage and filesystem layers. All of this code is implemented as a collection of DXE drivers that implement various UEFI protocols which are initialized in the UEFI DXE phase that runs after SEC and PEI phases. On Intel platforms PEI does things like training the DRAM and PCIe links so memory is always available to DXE. Unfortunately, there's still a lot of code that needs to run to get this to work.

After FW update binaries are located it's not uncommon to write them to a scratch flash and then reset the system. On reset somewhere in the flow the scratch flash is checked for an update and then the hardware sequenced flashing registers in the chipset are utilized to actually flash the firmware. Another reset is performed to boot from the freshly flashed firmware.

There are variations on this flow depending on the firmware implementation and platform/vendor which can simplify it but that is usually the basic idea. Various microcontrollers are definitely employed for other platforms (even on x86, such as the embedded controller though these perform auxiliary tasks to the host firmware rather than the whole thing).

Updating without a CPU installed usually is on the embedded controller itself but that's not a normal update flow IIRC.


This reminds me of the "It's Time for Operating Systems to Rediscover Hardware" talk by Timothy Roscoe:

https://www.usenix.org/conference/osdi21/presentation/fri-ke...


As the author of https://superuser.com/a/347115/38062 and https://superuser.com/a/345333/38062, you have my sympathy about the "pack of lies" involving real mode and several wrong combinations of selector and offset.


It's also worth adding that none of this is new. There's always been a reason that the "C" in "CPU" has stood for "central". The idea that there are other, non-central, processors around the place goes back a long time.

Four particular ones come to mind:

* The DPT range of SCSI host bus adapter cards, many years ago, had an full blown MC680x0 processor on the card.

* Connor Krukosky, who famously installed a mainframe in his basement with a console front-end processor that was a PC machine running OS/2.

* PC/AT keyboards had on-board microcontrollers running programs.

* And of course who can forget the BBC Micro's Tube?

It's the short period in history where people thought that computers came with only one processor that is the real oddity. (-:


The Tube used the processor in the Tube as the CPU when it was connected but otherwise the CPU was the CPU in the BBC Micro itself. With the Tube CPUs connected (68K, Z80, 65C02, 32016 and more) the BBC processor served as I/O processor.

The elegant and well adhered to OS calls made this a straightforward process, if your program ran on the BBC standalone it would work across the Tube for the 65(C)02, but for other coprocessors you had to at a minimum recompile and probably rewrite quite a bit of your code.

https://sites.google.com/site/jamesskingdom/Home/computers-e...

In a typical PC there are > 10 actual processors in the various peripheral and controller chips, and then there is the management engine (a full blown computer in its own right) or equivalent and usually almost every peripheral will have one or more processors as well.


IMHO this makes the PiTube Direct project perhaps even more impressive: it attaches emulated vintage microprocessors to the BBC Micro Tube interface, implementing the Tube circuitry in software.

https://github.com/hoglet67/PiTubeDirect


Had to reverse engineer a real mode PCI option ROM once... that was extremely unpleasant [1]. And then of course there's "Unreal Mode".

Moreover Intel is just this week actually finally proposing removing real mode. [2] I'm a bit worried for what this means for emulation of old 16-bit Windows and DOS software under Wine (one of the great ironies that Wine can still run Win16 programs on an x64 host OS when Windows can't) - though I suspect the performance requirements of such software is so low by modern standards that emulating such programs wouldn't pose any challenge.

[1] https://www.devever.net/~hl/ortega [2] https://www.phoronix.com/news/Intel-X86-S-64-bit-Only


See https://news.ycombinator.com/item?id=36074093 for a more significant worry. Emulating a CPU is not affected as much as code that would otherwise have still run on the bare hardware.


As a mildly related curiosity, why didn't 4G memory address threshold on PCs get referred to as 'the bar'? I see in the first answer, both 1M and 4G RAM thresholds get referred to as lines, which matches the terminology mainframes used for 16M threshold. That would seem to correspond more closely with 1M than 4G...


Much of the openness of Power7/8/9 was encouraged by Google who wanted to have control over all the firmware, even the secret firmware. I think Google is also auditing PSP/ME source code but the public only sees the audit results.


“Turtles all the way down” Modern CPUs are so complex you need simpler ones to abstract it! Very cool breakdown of how power9 does this.


I miss these kinds of articles on the net. Is anyone else reminded of the CPU Praxis articles that were part of ARS Technica's early rise to popularity? I really miss those. This article, is of course, much shorter, but still, I miss that sort of content on the internet.


A while ago I bought some older AMD 8350 systems, which apparently are the last without a PSP, the platform security processor.

I did this as a sort of 'just in case' setup, was planning to put OpenSolaris on it and run things under Zones or LX zones and to run it as a backup server. Fast enough to get some work done and possibly more secure if the PSP is ever used/broken maliciously...


That may well end up being a very prescient move. Be prepared to be labeled a tinfoil hat type until then, but I definitely think you are wise to take a precaution.


> The Self-Boot Engine (SBE) (quantity: 1) is a core which is responsible for booting the entire system. It's responsible for initialising the chip and getting it out of bed enough to the point where at least one of the main cores can run using cache-as-RAM mode; it does little after that point. It has some SRAM to do its work in and uses a slightly custom variant of the 32-bit Power ISA, extended to support 64-bit loads and stores using adjacent GPR pairs. This core design is known as a PPE. It's the first thing that runs on the CPU die.

What I’m curious about is how does the ‘Self Boot Engine’ initialize itself in the first few miliseconds?

Maybe, a motherboard chip does the actual work, but that just invites the same question, at some point something must be initializing itself, how?


Eventually some core boots from a ROM on the die or (less frequently) off-chip flash storage.

As to how that is accomplished, well, the core's logic is designed so that it reads the instruction from the reset vector in ROM on the first clock cycle after reset is released. Software has control after that.


I'm still a bit confused about the exact mechanism, how does the core logic 'read' anything before it's initialized?

What is doing the reading?


Instruction fetcher + address lines and how these things are wired together. If you are really interested, there are various people having big sets of videos about creating an 8 bit CPU from scratch with TTL logic chips. Alternatively, or on top of that, you can dig up reference / hardware manuals about relatively simple CPUs / SoCs and look at the initialization parts and pieces.


Thanks for the info, though I think the point where I'm curious about is how exactly the 'TTL logic chip' itself starts-up.

Are there any 'how-to-make TTL logic chips from scratch' resources that you know of?

For example, the die images and schematic of the most basic TTL chip, the SN7400:

https://en.wikipedia.org/wiki/File:SN7400_1965.jpg

https://en.wikipedia.org/wiki/File:TTL-00-die-schema.jpg

Show odd patterns and I can't figure out how pushing electricity across it will make it do anything other than heat up.


Sorry for the late reply!

What you see on these dies can also be recreated manually with transistors, diodes, resistors, and maybe the odd capacitor. https://en.wikipedia.org/wiki/7400-series_integrated_circuit... shows further down the on die circuit. There is https://en.wikipedia.org/wiki/NAND_gate to give you further examples, plus there are links to TTL logic in general and all sorts of other gates. It should be easy to find pages that show how to recreate this stuff with actual transistors and resistors (or youtube videos for that matter).

The startup is pretty much just powering these chips and maybe having the input lines defaulting to for example "low" with resistors for a defined input and then output state (if the chips don't offer that themselves already -> chip manual). However, all that doesn't happen in zero time, same as signal propagation which needs time, but this is also part of such chip manuals to tell you about the timings, max frequencies etc. they can be operated at.


> It should be easy to find pages that show how to recreate this stuff with actual transistors and resistors (or youtube videos for that matter).

Well I haven't found any, even a how-to guide on linking together the gates on a breadboard and getting it to perform any operation that a TTL performs is way beyond the literature I've seen on the internet.

Do you know of some reference source?


Something like that? https://www.dummies.com/article/technology/electronics/circu...

or https://www.youtube.com/watch?v=sTu3LwpF6XI

and there are truckloads more of these.

There used to be electronics sets for kids with breadboards (some of them larger scale for easier physical handling then standard breadboards), and they'd had manuals with them explaining all the things. I remember that they also contained descriptions of basic gates like (N)AND, (N)OR, NOT, plus flip flops for storing 1 bit of data.


These all seem much simpler then replicating the real world functionality of even the simplest TTL chips.

There's a huge gap between making a single gate or a few gates, and making something even as basic as the SN7400 from 1965.

I don't think I'm the first person in history to have noticed this, but the lack of learning materials is very odd.


Yes, because these chips follow certain design goals the simple examples don't care about. For example, assume you start scaling things up, there is the problem of "load" (aka current). If you look at the chip manuals you will see how much current the inputs require to work, you will see how much max. current the outputs are able to handle, and now imagine you have to wire many of these chips together or drive other stuff with it. This is why for example for LEDs you may need an additional transistor or even a driver IC (for multiple LEDs) to get the LEDs working, because the chip itself is not able to handle the current load. There are things like switching frequencies, current load related stable output voltages which may require replacing passive elements (e.g. resistor) with active ones (diode / transistor), and so on. Read: This goes far beyond the core gate functionality to actually have something usable in a wide range of applications.


Right, so it seems impossible, currently, to completely understand even the simplest 8 bit CPU's boot-up process.

The difference between the SN7400 and gates on a breadboard seems to be a black-box. At least there is not any published material in the whole world that actually explains this gap, that I could find.

Have you managed to find any yet?

As an aside, this surprised me, there's not a record of anyone, prior to this conversation, even raising the question, across all the major search engines, academic libraries, published books, etc...


If I remember correctly there are actual simulations of a 6502 (the CPU used in C64 and other machines of that time) around, e.g. there was one done in Javascript even. Simulation as in simulating all the gates of the actual chip, and you can follow the process of what is happening.

People have been recreating CPU parts and I think also full blown CPUs in Minecraft with this redstone stuff (and had to "de-pig" their machines). That runs at about the same level as the above simulation, sort of.

There is also no "black magic" to a 7400, but with just superficial knowledge there is no way to break through that wall, i.e. you will have to go down that rabbit hole for quite some time before all the puzzle pieces fall into place. Then you should also be able to recreate a full 7400 on a breadboard, however, it may not look 1:1, but by then you are able to say why.

What I wonder though: What do you hope to gain from that? These are "just" transistor circuits, you provide power, they are "ready". Yes, there are timings involved: Nothing is instant, power is ramped up, signal distribution takes time. There are many 8 bit CPUs/SoCs around, just grab one of the manuals, often they are split into 2, one aiming at the hardware side of things (voltages, currents, frequencies / timings, etc.), and one at the "logic" and how to program all the parts and pieces. One thing that is shared by all of them: A reset line or at least some reset timings for things to settle in the chip to a defined state, because, as noted, things are not instant. Afterwards the initial piece of software takes over.


I don't understand how the behaviour of simulations and Minecraft CPUs relate to the real world.

Can you explain how you know they are correct approximations?

From my perspective it seems to still require knowledge, of the actual SN7400 for example, to verify.


Have a look at http://www.visual6502.org/JSSim/

That aside, I'm still wondering what you are on about. The "internal logic" of something like a 7400 is fully defined by the transistors, so to understand that, you have to understand how transistors work. Luckily, we are not dealing with analogue computers, but digital/binary ones, so things can be simplified to on/off. Transistors are something like switchable resistors, ideally between zero resistance ("on") and infinite one ("off"). This state is controlled by the "base" (BJT) or "gate" (FET). Which means if this base/gate has a defined state, the output will have a defined state, and that is where the specifics of the respective 74XX chip are important -> chip manual will give all the required details. But as such there is no "start-up procedure" or anything with these things as long as they have no internal state, and NAND gates don't have that. Once we talk about flip-flops, latches, registers, and similar things, then we have state, and usually also some way to reset that state. 8 bit CPUs all come with a reset line for some reason, or handle this automatically as part of their power-up.

And that is where these "we build stuff from 74XX chips to a 8 bit CPU" video series come in, because they show all the CPU components, their wiring, and potentially also give access to a full schematic so you can follow the traces and see + understand what happens as part of the power-up.


Thanks for posting the link, that's pretty cool, but it leads to my question of how do you know they are correct approximations?

> That aside, I'm still wondering what you are on about. The "internal logic" of something like a 7400 is fully defined by the transistors, so to understand that, you have to understand how transistors work.

What I'm on about is precisely that the 'internal logic' of the 7400 is unknown and not fully defined.

You make an assumption here, that does not seem to be backed up by any published source, "it's fully defined by the transistors".

For example, it might very well be reliant on other aspects, such as the length of the circuit traces to function correctly.


https://www.ti.com/product/SN74LS00

Datasheet for a 7400 type chip (or actually multiple). Go check it out.


You linked to a completely different product, it says on the site "SN74LS00", "4-ch, 2-input, 4.75-V to 5.25-V bipolar NAND gates".

Are you getting confused as to what a bipolar NAND gate is vs. what we were discussing (the SN7400 TTL chip)?


https://www.ti.com/product/SN7400 is also "4-ch, 2-input, 4.75-V to 5.25-V bipolar NAND gates". I mentioned BJT and FET, for the latter there is https://www.ti.com/product/SN74AHCT00

Might be useful to know what BJT and FET means and then you will see that standard / original TTL is always "bipolar", for example.


Do you not realize that the design that TI currently sells is different from the 1965 design that we were discussing?

They are even manufactured using very different technology.

Like I said, just because it shares the same part number and is labelled "bipolar NAND gate" does not mean that it's approximately the same.


Well, this obviously has nothing to do with boot-up of a CPU anymore, either, so I'm out now.


A CPU can contain hardcoded memory, it can detect the power-on event.

That we have an incredibly fancy chain of smaller CPUs booting bigger CPU is more surprising! Things should be initializing themselves. Those are 'just' voltage levels in the chip.

I would 'simply' have the main core of the main CPU come out of power-on-reset in a vaguely usable state, and then let regular firmware on a NAND spin up the RAM and setup everything else!

It's still not clear to me why is a second core needed to setup the first core the right way, instead of coming out of reset the right way


Frequently (I work on large SOCs) it's less risk and more flexible to let software do boot.

For example, there is some initialization step in the way between the big core and the instructions, and it is deemed easier/less risky to let software perform this initialization rather than do it in hardware and not have a fallback in case of hardware bugs. Or, if you have the fallback, then you've just reinvented the simple-core boot mechanism so you might as well ditch the hardware work.

I hate this trade-off because it means so much irritating work for the software engineer and frequently has downstream impact on other parts of the system (e.g. does PCIe link come up in time?) but the risk of silicon problems drives this design decision.


Since all POWER9 firmware is open, I can actually answer this:

The very first instructions executed by the SBE are from OTPROM (one-time programmable ROM, set at the factory). This is just a few instructions: https://github.com/open-power/sbe/blob/master/src/boot/otpro...

The first SBE code embodied in a mutable nonvolatile storage device executed is here: https://github.com/open-power/sbe/blob/master/src/boot/loade...

Interestingly the SBE code is actually stored on a serial I2C EEPROM on the CPU module itself, not on the host. This is quite unusual from an x86 perspective where there is usually not any mutable nonvolatile state on the CPU itself.

POWER9 is also a little unusual in that just powering on the CPU doesn't cause it to do anything by itself. You have to talk to the CPU over FSI (Flexible Service Interface), a custom IBM clock+data control protocol which is mastered by the BMC and used by the BMC to send a 'Go' signal after powering on the system's power rails. This FSI interface is basically the CPU's "front port" - think of an operator front panel on an old mainframe. It's kind of fascinating how IBM's system designs still reflect this kind of past. Indeed, until very recently (I think this only changed with POWER8, but it may have been POWER7) IBM's own POWER server CPUs weren't designed to boot, in the sense of being brought up in a self-sufficient way - they were externally IPLed by the FSP (IBM's version of a BMC).

Basically, what we think of as boot firmware would actually execute on the FSP - which is a PPC476 running Linux - so you have the boot firmware executing as a userspace Linux program on the FSP, doing register pokes to get the CPU initialised, do memory training, etc. all remotely via the FSI master, rather than it being done on the CPU itself. It would even load the hypervisor, PowerVM. The CPU itself traditionally was essentially held in reset until the service processor had completed all this init.

So the POWER CPUs have traditionally all been initialised externally, rather than via 'bootstrapping' in the literal sense. Even memory training was done this way, all via the CPU's front port. What's really amazing is that when IBM wanted to switch to a self-boot model around the time of POWER8 (I think), they looked at their boot/IPL code, which was designed under the assumption it would be running on an external processor and initialise the CPU by probing it. Their response was to write a C++-level emulation environment for all the internal API calls this firmware uses to access the hardware. Basically, you have hardware initialisation procedures written as though running on a separate machine, and then a C++ framework all of this is hosted in which pretends that this is still the case, but allows all of these hardware initialisation procedures to be reused without having to rewrite them all. Amazingly it works. Though the size of this boot firmware is so large, it even has to have a paging mechanism for its own code while running in CAR mode(!). http://github.com/open-power/hostboot

With POWER8/9 since IBM moved to a self-boot model, the FSI interface is often just used to send the 'Go' signal, and I believe by the BMC to query temperature information after boot. But the CPU won't do anything until you send that signal. It's kind of cute, really. Sending the 'Go' signal starts the SBE running. And yes, this means that all of the hardware initialisation procedures which were written to be used from a separate service processor are now never used that way.

This kind of 'front port' based design to system 'IPL' in which a service processor just pokes the CPU remotely to initialise it is pretty fascinating. It's like the idea of automating what a human operator would have done to initialise a mainframe long, long, ago (makes me think of the autopilot in Airplane). Though of course the amount of stuff you'd have to do flipping switches on an operator panel to initialise one of these CPUs may take a lifetime if done manually...

So this is an example of how IBM's technical history very definitely lives on and is reflected in their systems engineering today.

Worth noting an advantage of this FSI interface is that it's also effectively a hardware debugger interface. In fact I believe the Talos/Blackbird systems ship with the 'pdbg' hardware debugger tool accessible via BMC SSH shell. So these systems effectively have a hardware debugger built in, which is just a Linux userspace tool which pokes the CPU's debug registers over FSI. https://github.com/open-power/pdbg

The idea of 'front ports' seems pretty rare nowadays in most SoCs. Or rather, it actually does exist: it's called JTAG. In that regard IBM's traditional boot-via-FSI model isn't so different from the idea of initialising a chip at boot using JTAG. ...I'm fairly sure I've heard some horror stories in which some systems actually do this.


EDIT (leaving old for reference):

The reset pin "master core release" appeared with POWER9 to support BMC-less (IBM docs lingo: SPless) boot process, and depends on tying JTAG_TMS pin high to enable autostart mode (OpenPower and FSP tie that pin low and use FSI magic write to trigger SBE)

I believe the actual "go" signal is a pin separate from FSI since POWER8. Even in the BMC-less bootstrap (documented for power9 but afaik not used by anyone) you essentially need to have a bit of logic on motherboard responsible for sequencing startup and only releasing the cpu reset pins once power is stable etc.


Yeah. To my knowledge nobody has used the SP-less boot mode. But the 'Go' signal is definitely sent over FSI and not just a GPIO. The BMC code that does this is here:

https://github.com/openbmc/openpower-proc-control/blob/maste...


Thanks for the really detailed reply!

> The very first instructions executed by the SBE are from OTPROM...

How is the very first instruction executed by the SBE in an un-initialized state?


Not sure I follow. The SBE is designed to boot the system. If it requires any initialisation itself, it must be handled by a hardcoded state machine in the RTL logic.


> If it requires any initialisation itself, it must be handled by a hardcoded state machine in the RTL logic.

Thanks, that's what I was trying to ask about.

How does the 'hardcoded state machine in the RTL logic' in the SBE execute the very first instruction?


The SBE is a processor. It fetches instructions and executes them, starting from a reset vector. That reset vector is the OTPROM.


To clarify, I'm asking about how the 'hardcoded state machine' functions at the beginning of boot-up, not of the overall SBE that is initialized later.


I understood nothing (as a sysadmin) but this looks like a very interesting article for who can understand it.


I think you can see a modern CPU as a network. There are some beefy servers doing all the heavy lifting which is what the outsiders see. But there's also a few smaller servers here and there monitoring the system (or even responsible for powering on the entire network).


Author here. This is very much the case for a computer system as a whole also. Basically a network of cooperating microprocessors, including in I/O peripherals etc.

PCIe in particular is literally a packet-switched computer network - it has a physical layer, data link layer, and a transaction layer which is basically packet switched. There are even proprietary solutions for tunnelling PCIe over Ethernet.


To make it even funnier - Digital's last Alpha CPU, EV7, which was essentially the ancestor of AMD K8 (which finally brought "mesh" networking to mainstream PCs), actually had IP-based internal management network!

Each EV7 computer had, instead of normal BMC, a bigger management node connected to 10MBit ethernet hub (twisted ethernet, fortunately :P), and this network was then connected to things like I/O boards, power control, system boards... including to each individual EV7 CPU. Each so connected component had a small CPU with ethernet that was responsible for interfacing their specific component to the network, and when the system booted part of it involved prodding the CPUs over ethernet to put them into appropriate halt state from which they could start booting.


This kind of thing with functional domains accessible over Ethernet existed in at least one laptop as well, where you could connect to the "nodes" once you busted into it (my article): https://oldvcr.blogspot.com/2023/04/of-sun-ray-laptops-mips-...


And you have smaller one that basically pxe boot the bigger one and manage the power, cooling, etc. It is datacenters all the way down.

As someone that used to do embedded, there is a reason i felt most at home in erlang and elixir.

Their processes that share nothing and use message passing was really close to how it looks to build and code for an embedded platform.


Doesn’t mention the special core reserved for the NSA and other national security agencies :-)


I was surprised when I read how the Raspberry PI's GPU handles the booting.

https://forums.raspberrypi.com/viewtopic.php?t=266130




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: