
Godson-3: A Scalable Multicore RISC Processor with x86 Emulation (2009) - peter_d_sherman
https://www.computer.org/csdl/magazine/mi/2009/02/mmi2009020017/13rRUwvBy5P
======
monocasa
I really like the underlying approach. For those that can't read it (scihub is
having issues with the doi) they ran qemu targeting x86->mips, and then added
new mips instructions to get around what was taking a long time to emulate,
x86 flag generation on arthmetic ops, an x86 like FPU and VU, and additional
addressing modes.

What they came up with really reminds me of the inside of a modern x86 core
that sort of looks RISC-esque after the decoder, they're just doing the decode
in software. It's a really interesting halfway point between a traditional x86
core and transmeta's that way.

~~~
jbottoms
The Modcomp computer, 1970+, had a dynamically programmed instruction set
capability using a programmable ROM. It was targeted at applications written
for other systems, and for realtime problems that benefited by an instruction
set crafted for a specific problem.

~~~
monocasa
I've heard the argument from a few chip designers that the main thing the user
microprogrammable machines got you was a Harvard architecture. You got most of
your wins by pushing complex ideas into ucode where each step fetched from the
ucode ROM instead of the main memory bus. That's why on pretty much every
microcoded machine you see a memcpy instruction. It's the most obvious win for
getting the ifetches out of the way of your data throughput.

Once we started adding instruction caches, that big win started to not be as
big. Ultimately it's the caches that killed off that machine, and the ucode
interpreters you saw on the Smalltalk and Lisp machines.

One idea I would like to make it's way from ucode that hasn't yet though is
being able to define interrupt boundaries for the real time tasks you allude
to.

------
saagarjha
Found a mirror by putting the paper title on Google, for those who aren't IEEE
subscribers:
[https://home.deib.polimi.it/sami/architetture/godson.pdf](https://home.deib.polimi.it/sami/architetture/godson.pdf)

------
snvzz
Tangentially related, RISC-V's J Extension is under development for JIT
acceleration, both for emulating other ISAs and running Java and the like.

~~~
monocasa
Do you know where the working group for that lives? That extension always
seemed questionable to me. Most of those extensions are pretty short lived and
marketing driven on other ISAs. The Azul guys AFAIK didn't even put anything
interesting like that into their user ISA; their secret sauce was in the MMU.

~~~
dfrage
They had a custom user ISA instruction to make write or read barriers cheaper.

~~~
monocasa
What was that other than read or write barriers that happened to be cheap?

~~~
dfrage
Only that, but they did more user or between kernel and user level magic. See
the paper _The Pauseless GC Algorithm_ by Cliff Click, Gil Tene, and Michael
Wolf at
[https://www.usenix.org/legacy/events/vee05/full_papers/p46-c...](https://www.usenix.org/legacy/events/vee05/full_papers/p46-click.pdf)
; from the abstract:

> Azul Systems has built a custom system (CPU, chip, board, and OS)
> specifically to run garbage collected virtual machines. The custom CPU
> includes a read barrier instruction. The read barrier enables a highly
> concurrent (no stop-the-world phases), parallel and compacting GC algorithm.

From the paper, including some bits on the MMU you mentioned:

> Our read-barrier allows us to intercept and correct individual stale
> references, and avoids blocking the mutator to fix up entire pages. We also
> support a special GC protection mode to allow fast, non-kernel-mode trap
> handlers that can access protected pages.

> Having the read-barrier implemented in hardware greatly reduces costs. In
> our case the typical cost is roughly that of a single cycle ALU instruction.

> The hardware TLB supports an additional privilege level, the GC-mode,
> between the usual user- and kernel-modes.... Several of the fast user-mode
> traps start the trap handler in GC-mode instead of user-mode.

> The TLB is managed by the OS in the usual ways, with normal kernel-level TLB
> trap handlers being invoked when normal loads and stores fail an address
> translation. Setting the GC privilege mode bit is done by the JVM via calls
> into the OS. TLB violations on GC-protected pages generate fast user-level
> traps instead of OS level exceptions.

> The hardware supports a fast cooperative preemption mechanism via interrupts
> that are taken only on user-selected instructions, allowing us to rapidly
> stop individual threads only at safepoints. Variants of some common
> instructions (e.g., backwards branches, function entries) are flagged as
> safepoints and will check for a pending per-CPU safepoint interrupt. If a
> safepoint interrupt is pending the CPU will take an exception and the OS
> will call into a user-mode safepoint-trap handler.

Then read starting with section 3.3 Hardware Read Barrier for the details.

See also _C4: The Continuously Concurrent Compacting Collector_ by Gil Tene,
Balaji Iyengar, and Michael Wolf for how they moved this to x86 hardware:
[https://www.azul.com/files/c4_paper_acm1.pdf](https://www.azul.com/files/c4_paper_acm1.pdf)

------
stevenqq
this is a 10yr old paper

~~~
Timothycquinn
Yes. OP: Please add (2009) to title.

~~~
peter_d_sherman
I would if I could, but HN will not allow me to edit the title at this point.
Apparently there's a time period from the point in time at which an article is
posted to which the title can be modified, but apparently we're past that
point of time now... My humble apologies, I would change the title if I
could...

------
frabert
Wasn't "x86 emulation" basically the WinChip's and Cyrix's strategy in the
late '90s-early '00s? I don't remember that ending too well, unfortunately...

~~~
Iwan-Zotow
No, they were "real" x86 chips

you're probably thinking about Transmeta

~~~
Symmetry
Which has returned from the dead as NVidia's ARM emulating Project Denver.

~~~
CalChris
Denver started out as X86 but then were scared away by Intel legal and
switched to ARMv8 with an architecture license from ARM.

