Hacker News new | past | comments | ask | show | jobs | submit login
CISC-y RISC-ness (tedium.co)
91 points by Amorymeltzer on April 26, 2023 | hide | past | favorite | 71 comments



They're microcontrollers, but the Parallax Propeller ( https://en.wikipedia.org/wiki/Parallax_Propeller ) and Propeller 2 ( https://www.parallax.com/propeller-2/ ) are really interesting architectures for several reasons. For one, they allow the code running on an individual core to be packaged up and reused like a module. They also include a high level language in ROM along with useful lookup tables and very versatile and capable I/O hardware.


The propellers used a really interesting architecture - they had a large number of cores but the cores lacked features like interrupts or core timers. That meant that things that would normally be peripheral functions became code on a core.


Exactly! So many tricks pulled off with a microcontroller require carefully timed hand-coded assembly loops, Chip Gracey managed to preserve that ever so useful fallback mode of microcontroller development while enabling multi-core applications. And as a result many of those carefully hand timed software blocks are now available for download and reuse.

On single core microcontrollers, all that carefully timed assembly ends up interleaved with other logic in ways that make it nearly impossible to untangle for reuse.


Sounds like XMOS. The problem is you end up paying the extra cost of doing things in software but 99% of the time you just want a completely standard I2C, SPI, UART, AXI or whatever.


Yeah, XMOS did a similar thing targeting audio (and I guess now AI), where this works brilliantly because sometimes you need very weird interfaces and custom filters, and you also generally need to do tricks to pull tons of performance out of your cores. Parallax attempted to do the same in general-purpose microcontrollers, and I'm not sure they have really come out ahead for it - most of the time you really do just need I2C or SPI, and you have a 100 MHz 32-bit ARM with a full suite of peripherals as your alternative.


Was there ever a compiler that targeted the Crusoe architecture directly without having to go through the translation layer? I remember there was a big deal about how the thing would translate your code the first time you ran it and keep the translation in a cache, but I always had the impression that the translation layer was a bit unoptimized because it was a step away from the original code. Also, everybody ever who has tried to make a VLIW architecture for general computing has ended up with a slow and expensive processor that can't compete.

The article makes a big deal about CISC vs. RISC, but in the end all of the CISC architectures ended up being RISC under the hood with front end instruction translation for the more complex CISC instructions. The transition to 64 bit should offer chip manufacturers a golden opportunity to ditch old disused complexity, like multiple addressing modes and variable length instructions.


A then Transmeta employee (of mild fame) says that the raw Transmeta architecture isn't really suited as a target for normal end user code, as things like memory protection don't actually exist in the "real" architecture. (https://marc.info/?l=linux-kernel&m=105606848227636&w=2)

Some people did do a bit of dive into the raw architecture (https://www.realworldtech.com/crusoe-exposed/) but I don't think anyone ever really tried to create a compiler for it, and it would be a moving target anyway as the different Transmeta CPUs seem to use different instruction sets widths (128 bit wide on Crusoe, 256 bit on Efficeon).


The "Transmeta employee (of mild fame)" is Linus Torvalds. I don't mean to spoil the surprise; I just want to encourage people to read his explanation who might have overlooked it otherwise. :)


CMS was written in C and translated with a custom P95/P2000 backend for good ol' GCC. However, your question is of course about a publicly available compiler. Well, it would be a bit tricky as the ISA was never public and the firmware tried very hard to keep the innards secret.

An interesting fact was that CMS could occasionally (not always, but not rarely either) do a better job than the ahead-of-time compiler as CMS benefits from: statistics collected at runtime and can generate code under assumption without compensation code. When the assumptions are found to be violated, it can just recompile the code again.

The challenge is always having the compilation overhead be [more than] paid off by the runtime gain. Sometimes it worked, others it didn't. Unfortunately, people really notice when it doesn't work which added to a negative perception of Crusoe. The conventional processors (even out-of-order) have far more predictable performance.


An interesting slashdot comment[0] about running directly on the Crusoe:

> The article makes it pretty clear why Linux can't run directly on the Crusoe: Linux expects the hardware to have a virtual memory manager, which the Crusoe doesn't have. Consequently, any port of Linux will need to be running on an emulated memory manager.

As a side note, the Crusoe is also missing native support for certain other helpful features: Memory protection -- without that, a segfault can take out the entire OS. Running code from user memory -- without this, any application code will need to be piped through the OS to the CPU.

[0] https://developers.slashdot.org/comments.pl?sid=96231&cid=82...


Interesting. Linus Torvaldes was once a Transmeta employee.


That might be the one fact that most people know about Transmeta. Among other things, he wrote the original x86 interpreter that was the first to touch cold code. Before he joined, x86 code was immediately translated (which was not ideal for many reasons).


I still got one running, from the software side, its effectively an i586:

  vendor_id       : GenuineTMx86
  model name      : Transmeta(tm) Crusoe(tm) Processor TM5800
  cpu MHz         : 800.023
  flags           : fpu vme de pse tsc msr cx8 sep cmov mmx longrun lrti constant_tsc cpuid
  bugs            : cpu_meltdown spectre_v1 spectre_v2 spec_store_bypass l1tf mds swapgs itlb_multihit mmio_unknown
But you can't do much with such a machine nowadays.


I had a Fujitsu Loox laptop with a Carusoe processor when they came out. A friend picked it up for me in Japan. It was awesome because 1) it was physically very small but had a really nice display, 2) it had way above average battery life and 3) it had a built in DVD player in the ancient times when that was a very desirable feature. The downside was...it was really slow compared to 'equivalent' Intel offerings. It ended up being my travel laptop because it could fit comfortably on an airplane tray. It was fine for writing/email and great at watching movies, but it was never going to be my daily driver.


Wait, so it also suffers from the same CPU bugs like Meltdown and Spectre? That's an incredible level of verisimilitude in the CPU emulation.


You do realize that TM5800 just makes up the CPUID in software and the Spectre etc flags are a kernel interpretation of this?

TM5800 might be subject to Spectre, but I posit that it would be very tricky to exploit as the x86 code you write isn't what's being run and in fact the code execution will look vastly different.


I was in Japan in 2005 and saw these tiny laptops running Transmeta's processor in the wild. The university I was studying in had them in their library. They ran windows and everything worked well. I've never seen laptops that small and tiny in the US, even during the netbook phase.


i'm using a 400-gram gpd micropc; they ship with windows but support ubuntu, which is what i'm running. big problem is that the keyboard is too small for touch-typing, but they have three usb-a ports, a usb-c port, hdmi, ethernet, headphone jack, and 9-pin rs-232


I had one that someone had bought for me in Japan. It was a Fujitsu Loox[1]. Not only was it tiny, it had a remarkably good screen and a DVD drive built in. Nice, if slow, machine.

[1] http://museum.ipsj.or.jp/en/computer/personal/0063.html


Here is a paper from 2000 describing the Crusoe arch and its code morphing scheme in very broad strokes:

https://web.stanford.edu/class/cs343/resources/crusoe.pdf


Interestingly there were, I think, 3 companies working on similar projects at the time (including TM) - the others didn't make it to market (mostly due to the dot-com bust and the sudden drop in available funding).

A lot of the IP that people had to create involved getting around Intel's patents, some of which were both basic and likely bogus (prior art etc) but Intel was widely seen as having far more lawyers than anyone could afford so taking them on directly was best avoided

Not really mentioned in this article was TM's biggest score at the time, which was getting Linus to come work for them


There's a great paper from 2004 about reverse engineering the CMS of the Crusoe:

https://www.realworldtech.com/crusoe-intro/

To my knowledge the foreshadowed followup was never released.



That's great to see. Thanks for sharing! It's been almost 20 years since I read the first part.


This was interesting in 1995. By 2005-2010, SoCs included sophisticated cache architecture, memory controller, internal and external bus controller, transceivers, and maybe iGPU. All of which play a role in performance per watt. ISA was no longer preeminent. Plus, if you emulate multicore x86-64 with superscalar vliw scheduling you're still constrained by the memory model, too. There is only so much you can do.


I had a TC1000 with the Crusoe. In tasks like web browsing it was quite slow but it was surprisingly performant in games. I remember being shocked that I could play Star Trek: Bridge Commander with all the settings cranked up.


Given the nature of the JIT and the limited size of the translation cache (2 MiB IIRC), applications with huge code bases and little repeat rate (like browsers and MS Word) will perform less well. The poster child for Crusoe was PowerDVD, but Quake had also gotten a lot of attention. Note, in those days there was a surprising amount of self-modifying code. That stuff is really nasty to deal with for a JIT. There was hardware support for detecting this on (IIRC) a 1 KiB granularity (because code was intermixed with data, making the latter look like self-modifying code).


When NVidia adopted Transmeta's technology to target ARM instructions with their Project Denver they added a facility to just run native code for bits that weren't hot spots.


https://en.wikipedia.org/wiki/Project_Denver : "Project Denver was originally intended to support both ARM and x86 code using code morphing technology from Transmeta, but was changed to the ARMv8-A 64-bit instruction set because Nvidia could not obtain a license to Intel's patents"


Even though I'm not sure what you mean by "native", this not exactly my understanding. AFAICT, Denver added a hardware Arm instruction decoder. This greatly helps with the first part of the JIT and especially accelerates the interpreter.


"Self-modifying code" might be more common today than it was back then, given the popularity of managed languages that also use some type of JIT. Forcing the output of that JIT through another translation step would be really bad for performance.

If Transmeta had survived, perhaps they could have added support for different instruction sets to run e.g. Java or .NET "natively".


I used a Sony PictureBook for a bit in 2001 running Linux (this exact one: https://www.linuxjournal.com/article/4548) and it was totally fine. Bigger apps like StarOffice were a bit sluggish at start but then be snappy in use. The battery life was really good for the time, I used it as a coffee shop laptop (although that time there were only maybe five coffee shops with wireless in Seattle and I was responsible for three of them).

https://en.wikipedia.org/wiki/Sony_Vaio_C1_series


I remember seeing those in Popular Science and lusting after them.


I guess the wide architecture exploited the parallelism of games' graphics rendering code very well, just as GPUs do.


I'd submit the Mill, but it's not like it exists.


And like the proposed Mill design, the Crusoe had a skewed pipeline (the Mill folks call is phasing for some reason). That is, different parts of the instruction complete at different clocks, typically load then operations then stores allowing for single instruction loops for, say, copying an array while doubling the values in it. Easy for compilers to handle but I can't imagine writing assembly like that by hand.


I will buy one if they ever make it.


Are they still hacking on it? Maybe get it running with Hurd when it's done.


The last Mill CPU talk was put on Youtube 5 years ago, but they mentioned they were going silent because the talks were giving too much secret sauce away and racing patent filings by accident.

I never heard from them again.



Now that is super interesting. I'm glad they're still around trying to do what may be a computing revolution.


Then it'll take 25 years for them to put out the required motherboard.


this article is terribly clueless. high-level languages such as basic had made pure assembly language a novelty? the sparc is a 'chipset'? transmeta invented vliw 11 years after multiflow (which isn't even mentioned)?

i don't have any complaint about people writing about things they don't know very much about; it's a great way to learn. i just don't think they should put on this fake authoritative tone, because they can mislead other people. i mean that's how we got the cda and dmca


Ah, the promise of VLIW. I played with one VLIW architecture when it came out (TI 320 C6000) and it looked exciting but after a while we got tired of waiting for the SW magic to materialize, I think similar problems have generally been the downfall of VLIW systems.

We did look at handwriting assembly code, since we were already intimately familiar with the classic TI 320C30s and C40s (fantastic DSPs!) but it was just so crazy. Even TI told us not to bother and wait for their super cool compilers.


It sounds very much like what current CPUs do, with x86 being mapped into execution units dynamically via microcode, except (I assume ? article was very low on detail) in case of Crusoe it just did it for big VLIV instruction blobs.


No there are major differences. A non-exhaustive list:

* Crusoe/Efficeon/... used a software JIT (called CMS)

* Ran on a very custom architecture (P95/P2000) designed for JITted code with speculation (software controllable checkpoints, "assert" instructions, speculative cache, small lookup tables for translating x86 addresses to P95 etc).

* It had no dynamic scheduling, reorder buffer, renaming, etc.

The thesis was to eliminate the power/area hungry parts of microprocessors by having the "compiler" (= JIT) do it up front.

Some things worked well (very power efficient, small die). Others less so: huge cold code penalty, limited ability to deal with large cache misses (like IA-64).

Transmeta had a huge impact on the industry; Intel has officially credited Transmeta and Transmeta's LongRun with getting Intel focused on power. Both NVIDIA and "another company" have explored the CMS idea.

There is so much to the Transmeta story and the failure was more of a business issue than a technical one. This was a completely new approach and Transmeta needed more iterations to refine the idea.

ADD: A VLIW "instruction" (called a "molecule" in TM terminology) is compiled, fetched, issued, and (mostly) executed as a (parallel) unit. That is very very different from what modern superscalar microprocessors do. They issue instructions from many issue queues. The instructions that execute in the same cycle usually do so mostly as a consequence of data-flow and the current state of the pipeline. Large x86 instruction translated to microcode typically leads to many µops that do not execute in parallel, but are intermixed with other µops.


Yup, since the NexGen Nx586->AMD K6 and Intel P6 (Pentium Pro) generations, which were contemporaries of Transmeta's founding in ~1995, modern x86 parts are pretty much a dynamic JIT in silicon in front of whatever fancy internal architecture the vendor designed that generation.

Honestly, if you look at recent parts, the internal designs are _super_ wide multi-issue like a VLIW, it's just a difference of sophistication vs. coupling in your JITy thing.

Let's use Intel Sunny Cove as an example because it's recent and there is a good diagram on Wikpedia: https://en.wikipedia.org/wiki/Sunny_Cove_(microarchitecture)

Because they're wide multiple issue (SC: 8 execution units wide), the internal execution is fairly VLIW-like. Unlike an exposed VLIW, it's plausible possible to actually fill all those pipes because they're doing dynamic out-of-order issue on a window of (sort of hard to count, SC:something like 50) instructions coupled to all the register renaming and memory access scheduling to keep instructions out of each others way.

Transmeta's proposition was not really super wild, Multiflow (and very briefly Apollo) were building VLIWs in the 80s, so that wasn't crazy. Their core trick was splitting the difference between the dynamic decomposition driven out of order/multi issue/superscalar designs that used relatively dumb, shallow heuristics but were very dynamic and close to the hardware (like above), and the compiler-driven RISC/VLIW designs that tried to do fancier scheduling and optimization on larger units but were much more static and more removed from the execution process.

The fast dumb heuristics close to the hardware won in a _big_ way over all competitors.


For some reason, the thing i most remember about Transmeta is their appearances in After Y2K, where they are some sort of sinister cabal with an underground base:

https://joyoftech.com/geekycomics/Aftery2k/y2Karchives/125.h...

https://joyoftech.com/geekycomics/Aftery2k/y2Karchives/158b....



https://en.wikipedia.org/wiki/VIA_C3 is almost as interesting as the Crusoe, but not quite like the iAPX 432.


In what way? I ran one of those for a while.

It was just a low end, low power CPU. It ran standard X86 code, and I had Linux running on it. I'd say it was still not ideal, as the system I had had a CPU fan and the inside of the case got pretty hot. I recall the CPU supported crypto and RNG acceleration, which I think was a nice and fairly rare feature at the time.

Other than that, VIA's hardware sucked. The CPU is unimpressive and the board I got had an embedded NIC that corrupted network packets. That one was fun to figure out.


The "RISC-ish instruction set underlying the x86 facade"-part: https://en.wikipedia.org/wiki/Alternate_Instruction_Set

Edit: also the "cool chip in a can" packaging from one of the contemporary computer conventions. And possibly the mini-ITX era was also kickstarted with a C3-powered motherboard, I think...?


Interesting!

Unfortunately I developed a searing hatred of VIA hardware back then. Besides the slow CPU and buggy board, I also managed to run into the SB Live + VIA chipset = disk corruption issue.


Also, the successor(?) https://en.wikipedia.org/wiki/Intel_i860, which is where the Windows NT development began.


No, the iAPX 432's successor was the Intel i960, which confusingly is completely unrelated to the i860.

By the way, it's interesting to read the iAPX 432 architecture paper along with the Case for RISC paper, since the 432 paper argues the exact opposite. Essentially, due to increasing software costs, you should put as much as possible into hardware. The semantic gap between hardware and programming languages should be minimized by implementing high-level things such as objects and garbage collection in hardware, which they call the Silicon Operating System. The instruction set should be as complete as possible, with lots of data types.

I should also mention that the 432's instructions were not byte-aligned: they were anywhere from 6 to 321 bits long. Just decoding the instructions took a complex chip, with a second chip to run them. This is the opposite of the RISC idea to make instruction decoding as simple as possible.

The paper says, "The iAPX 432 represents one of the most significant advances in computer architecture since the 1950s."

http://www.bitsavers.org/components/intel/iAPX_432/171821-00...


The capabilities based software architecture was pretty interesting too, particularly in the light of security problems, ubiquitous distributed systems, and the adoption of OO. See also

https://en.wikipedia.org/wiki/IBM_AS/400

Of course Java provided a similar architecture in many ways on a mainstream architecture the same way that Common Lisp made Lisp machines obsolete. Java never really got a universal architecture for serialization and persistence the way the AS/400 did.

That said, I find AVR-8 pretty interesting in that it is the last 8-bit architecture and is, from the viewpoint of the assembly programamer, technically superior to all those machines I had or craved in the 1980s.


And of course David Patterson did a take down of the performance of the 432 (which curiously doesn’t seem to mention RISC).

https://dl.acm.org/doi/pdf/10.1145/641542.641545


> which curiously doesn’t seem to mention RISC

RISC-I probably hadn't been written about publicly when he submitted that paper.


Seems unlikely.

'RISC I: A Reduced Instruction Set VLSI Computer' appeared in May 81

https://dl.acm.org/doi/10.5555/800052.801895

(and of course the original RISC paper was even earlier in 1980)

The iAPX 432 paper in June 82:

https://dl.acm.org/doi/10.1145/641542.641545

I should add that I only skimmed the 432 paper for references so I might be wrong on that!


That's quite an argument. "Software is getting too hard to write, so we will do the work in hardware instead, since hardware is so much easier to develop."

It's pretty ballsy of Intel's marketing department to so blatantly lie to your face like that. Maybe they were big believers of the "Big Lie" rhetorical device? If you say something so outrageous people are less likely to question it because you must be a genius with loads of inside knowledge to assert something so completely crazy on the face of it.


In the late 70s to early 80s that wasn't an obviously wrong idea, especially because in the posts above, as I understand it, when they say hardware it mostly means microcode, which is more software than hardware. At that moment many architectures like the iAPX432, Lisp machines, Xerox machines, the VAX, etc. were microcoded.


It may have been interesting, but I had to work on a Japanese PC-like machine that used one of those ... god was this computer slow.


I was the first person to play Diablo 2 on a Crusoe.


How far along are you? :)


Still waiting for it to load, probably.


There was a scandal where thermal throttling was a big issue. I dont remember if this was a problem from Transmeta or a particular laptop manufacturer, but the CPU could only run at full speed for short bursts.


Five years later (if that!) this was expected behaviour on any x86 laptop!

My original Macbook Air ran its Core2Duo at the rated 1.6 GHz for about 5 seconds before dropping to 1.2 GHz. Thirty seconds later it dropped to 1.0 GHz, and a couple of minutes later to 800 MHz where it could run all day.

In 2019 I had at the same time two machines with i7-8650U CPU: a NUC and a Thinkpad X1 Carbon. The NUC could build riscv-gnu-toolchain (which I did many times a day, as I was helping develop new RISC-V instructions) in 20 minutes with mild throttling (to 3.2 GHz) while the Thinkpad took over 30 minutes with extreme throttling.


I'd think that VLIW would have problems with the instruction cache, long instructions = more cache needed.


This could be an issue with extreme RISC too. You have to execute so many instructions compared to more "balanced" ISAs that it also fills up the cache.


RISC-V has often been described as being "too extreme" in its minimalism but, despite the extra instructions needed (not as many as critics claim via worst case micro-examples), riscv64 code is still considerably smaller than amd64 or arm64 code.

Both amd64's "extreme instruction size variability" and arm64's "extreme only one instruction size" philosophies give worse results than riscv64's "two instructions lengths make small code that is still easy to parse in a wide machine". You'd have thought Arm of all people would have known that, but oh well...

Source: looking at the output of "size" on binaries in the the last couple of years of Fedora / Debian / Ubuntu etc versions that support all three.


And what most do not realize is that RISC-V not only has the best size, but is also competitive in number of instructions, especially so within loops, where it matters the most.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: