Ascenium wants to reinvent the CPU and kill instruction sets altogether

freemint · on July 15, 2021

A more technical description:

Hardware: DSP slices (Multiply ACcumulate) and look up tables (LUTs) in fixed function blocks connected to registers hooked up to a cross bar switch.

Software: LLVM target of what amounts to an FPGA with a ton of DSP slices

Company: Has been around for a long time 2005(?) under a slightly different company name, huge marketing push due to raising of funds.

Learn more: https://llvm.org/ProjectsWithLLVM/Ascenium.pdf

baybal2 · on July 15, 2021

Very informative. Basically they have reinvented an FPGA with fixed blocks...

> The idea of a compiler-based software solution embedded in an architecture would theoretically allow the Aptos processor to interpret workload instructions and distribute them across processing resources in such a way that the amount of work being parallelised is as close as possible to the theoretical maximum, whilst taking advantage of having much lesser architectural inefficiencies than instruction-based processors.

Whether there will be an instruction set, or not, you will have to have some kind of convention for the bitstream for the re-programmable logic, and very likely you will re-implement a kind of an instruction to do that with high level of efficiency.

Modern CPUs already all run microcode on some kinds of a programmable fabric in the front-end to give some degree of programmability for how the front-end, and back-end interact.

jcranmer · on July 15, 2021

The actual hardware details are kind of danced around a lot in both this article and the somewhat more detailed article it links to (https://www.nextplatform.com/2021/07/12/gutting-decades-of-a...).

From what I can tell, it appears to be a mix between a systolic array and Itanium. The systolic array piece is that it's a grid of (near?) identical ALUs that communicate with adjacent ALUs in lieu of registers. But it also seems that there's an instruction stream element as well, something like the function that each ALU is processing changes every clock cycle or few? It definitely sounds like there's some non-spatial component that requires faster reconfiguration times than the FPGA reconfiguration logic.

As for viability, as another commenter points out, GPUs are currently the competitor to beat. And beating them requires either having stellar results in the don't-rewrite-code space or targeting use cases that GPUs don't do well at. The latter cases involve things like divergent branches or irregular/indirect memory accesses, and these cases tend to be handled very poorly by compilers and very well by the standard superscalar architecture anyways.

dragontamer · on July 15, 2021

Not even that.

Xilinx's processors have a VLIW instruction set feeding a SIMD-DSP surrounded by reconfigurable LUTs.

Ascenium will have to try to displace Xilinx's processors, which are clearly aiming at this "systolic processor" like architecture.

https://www.xilinx.com/support/documentation/white_papers/wp...

freemint · on July 15, 2021

To be fair DSPs generally tend to be VLIW even without SIMD as seen in this video on the Sega Saturn https://www.youtube.com/watch?v=wU2UoxLtIm8 .

dw-im-here · on July 16, 2021

We (I work for ascenium) do operate with a notion of op, in the sense that we specify ALU ops happening at a given location. I'm not sure that the idea we're trying to convey when we say killing the ISA can be easily understood, but it ties together with the fact that we instruct the core in a far more detailed fashion than your typical CPU during compile time.

In short, our machine code (control words) is far more imperative than typical RISC instructions. RISC specifies what should happen, ours specifies the what _and_ the how

Also, I don't think an FPGA is a very good description of what we do, but on the other hand I did use the same analogy to describe what I do when I started working here, so it's not terribly wrong either

nynx · on July 16, 2021

Are you hiring (interns)?

dw-im-here · on July 16, 2021

We're hiring, but not interns

nynx · on July 16, 2021

Gotcha. Thanks for replying.

spijdar · on July 15, 2021

So I won't pretend to be anywhere close to having the domain knowledge to really understand this, but this feels like a sort of more logical extreme version of the rational behind VLIW processors like Itanium, e.g. "remove hard logic from the microprocessor and put it at the compiler level"

It was my understanding that this approach failed, partially because it's really hard to compute/guess certain things about code ahead-of-time, and modern CPUs with their branch predictors, prefetchers and speculative execution do more at runtime than a compiler could effectively do. Has this changed enough for this to be generally useful, or are they hoping to market this for niche use-cases?

giantrobot · on July 15, 2021

I think it's both hard to guess things about the code but also the data that's coming in. The "Sufficiently Advanced Compiler" could make a lot of good decisions if it knew ahead of time the shape of the data, e.g. a big uninterrupted batch of fixed transactions.

That's why the target for this sort of technology is GPGPU work. The streams of data are very regular and largely a series of batch jobs.

For interactive systems with millions of context switches and branches all the "Sufficiently Advanced Compilers" all fall down. There's just not enough commonality between consecutive operations for ahead of time optimizations to occur. Hardware that does great in batch jobs ends up suboptimal for the insanity that is interactive code.

flatiron · on July 15, 2021

This reminded me especially of https://en.m.wikipedia.org/wiki/Transmeta

mhh__ · on July 15, 2021

Transmeta is probably the only approach that could really make VLIW work in a general purpose system, I think. Dynamic processors get a good chunk of their speed from being able to (aside from schedule, maybe not as well as a compiler, in the good times, but from being able to re-schedule in adverse conditions) - if you have the power of software to be able to do that speculation then I think it could work.

Hard to know with Transmeta because they had to implement X86 which is a legal minefield - FWIW I've read a lot of Transmeta engineers saying that they were completely sold on the idea but just couldn't make it stick in time - I'm too young to have been around when it was being produced so I don't know.

my123 · on July 15, 2021

And then that design philosophy ended up at NVIDIA. (in the Tegra processor family, with the Denver, Denver 2 and Carmel CPU cores)

I wonder what will happen next... they didn't release a new CPU core since 2018 now. (and they are on the gen using stock Arm cores in the cycle, so Tegra Orin gets Cortex-A78AE)

Taniwha · on July 15, 2021

I think that Transmeta (and the others working on the same stuff at the same time) had fixed instruction sets, along with hardware and software to recompile x86 code into that instruction set.

The difference here I think is not that the chips don't have an instruction set, they do, but they don't have an ARCHITECTURAL instruction set, the next version of the chip will have a different instruction set and a matching LLVM back end - they expect you to recompile your code for every new CPU.

What I don't see in the literature is any mention of MMUs and system level stuff - I'm sure it's there

hermitdev · on July 15, 2021

Didn't we already have this with Java? Write once, run anywhere and the hardware would magically adapt the incoming code to run on the new hardware? Except now, it's LLVM byte-code instead of Java (and x86 asm)?

I'm not trying to be cynical here, but I can see how this would sound like that. I guess I'm just confused about what this actually is and how it is new/different than all of the things that have been tried before.

Taniwha · on July 15, 2021

I think the difference (my impression from little data :-) is that this is (mostly) not JIT but instead more in depth static compilation of basic blocks into code.

The big change here is the abandonment of the concept of an architectural ISA - it depends on software people (all of them/us) giving up assembler - I think it's probably the right way to approach high ILP VLIW-like cpus - it means you don't get hung up on your old CPU designs as you move forwards

beginrescueend · on July 15, 2021

Yeah, when I saw it was a Software-Defined Instruction set, I immediately thought of Transmeta, as well.

robmccoll · on July 15, 2021

Extracting enough fine grain parallelism is hard. Manually creating it is tedious and difficult. Memory is slow. Latency hiding is hard. Ahead-of-time speculation is hard. Runtime speculation is hard. Good luck.

dragontamer · on July 15, 2021

Your comment is very high quality, despite being short.

> Extracting enough fine grain parallelism is hard. Manually creating it is tedious and difficult.

A big win for NVidia was tricking enough programmers into manually describing large swaths of parallelism available for a GPU to take advantage of.

Manually creating fine grained parallelism through classical structures (mutex, semaphores, async, etc. etc.) is tedious and probably counterproductive. But it seems like describing __some__ extremely common forms of parallelism is in fact very useful in certain applications (matrix multiplications).

OpenMP, CUDA, OpenCL, (and even Fortran/Matlab to some extent)... even Tensorflow... are showing easy ways to describe a parallel problem.

CPUs seem to be the best architecture at discovering latent parallelism in general purpose code that the original programmer wasn't aware of. (Typical programmers are not aware of the multiple-pipelines that a modern CPU tries to fill up... and yet that code continues to accelerate from generation-to-generation... now with Apple's M1 proving that even 8-way decode is reasonable and provides performance gains). CPU designers seem to be experts at finding fine-grain parallelism as it is, and I don't think anyone can beat them at it. (Only another CPU designer will probably make something better).

---------

But for these alternative compute paradigms: such as GPU-compute, there are far fewer programmers and compiler experts working on that architecture. Its far easier to find little gold-nuggets here and there that lead to improved execution.

SIMD-languages such as C-Star, Lisp-Star, CUDA, OpenCL, OpenMP... have been describing parallel patterns for decades now. And yet, it only feels like we've begun to scratch that surface.

georgeecollins · on July 15, 2021

Everything you are saying makes good sense, but I wanted to ask: Is taking a computer program and then re-configuring for some exotic and complex CPU a great problem for AI?

I bet people on this forum know why that is / isn't true and I am curious.

dragontamer · on July 15, 2021

If you consider "compiler theory" an AI problem, then sure.

But most of us just call that compiler theory. If you're going down into FPGA routing, then maybe we also call that synthesis (slightly different than compilers, but very very similar in problem scope).

Today's compilers have an idea of how modern CPUs execute instructions, and optimize accordingly. (Intel's ICC compiler is pretty famous for doing this specifically for Intel chips). Any FPGA synthesis will similarly be written to optimize for resources on the FPGA.

------------

EDIT: Just so that you know... compiler theory and optimization is largely the study of graphs turning into other, provably equivalent but more efficient, graphs.

All computer programs can be described as a graph traversal. Reconfiguring the graph to be shorter / smaller / more parallel (more instruction-level parallelism for the CPU to discover), etc. etc. leads to faster execution.

You don't really need deep-learning or an AI to do a lot of these graph operations. There's a few NP-complete problems (knapsack problem: minimizing memory moves by optimizing register allocation) along the way that maybe an AI could try to solve. But I have my doubts that throwing a deep-learning AI at NP complete problems is the best strategy (especially given the strength of modern 3SAT solvers)

gopalv · on July 15, 2021

> You don't really need deep-learning or an AI to do a lot of these graph operations.

That is right, but we're probably not hunting for absolutely optimal solutions every time.

It usually boils down to heuristics - everytime there's a graph reorganization problem where a sub-optimal, but better solution is viable, the solutions always involve short-cuts derived from the domain knowledge in there.

This is usually very similar to the AI training operations where there's a guess from an engineer, experiments to validate and modify the heuristic guess to improve the results - finally to give up when the results look acceptable.

AIs would do well at parameter tuning, at the very least.

My register allocation work would have near infinite loops reordering priorities, until I started forcing invariants into the system which didn't exist in the original design.

And the usual arguments about "code is compiled once, executed millions of times" doesn't quite apply to a JIT.

brandmeyer · on July 15, 2021

> If you're going down into FPGA routing, then maybe we also call that synthesis (slightly different than compilers, but very very similar in problem scope).

Synthesis is the process of taking constructs in a high-level language (SystemVerilog, VHDL, etc) and translating the code into the primitives provided by the target architecture (LUT, blockram, DSP, distributed RAM, muxes, etc). I agree that this looks like an ordinary compiler pipeline.

Place and route is a totally different beast. There is potential for an AlphaGo-like structure to help out here tremendously, where an AI has been trained to produce a cost estimate that aids a decision process and steers the search space.

dragontamer · on July 15, 2021

> Place and route is a totally different beast. There is potential for an AlphaGo-like structure to help out here tremendously, where an AI has been trained to produce a cost estimate that aids a decision process and steers the search space.

As far as I can tell, that's just another typical "Damn it, yet ANOTHER NP-complete problem..." that pops up every few minutes when looking through compiler code.

EDIT: Place and route is finite in scope (there's only a finite number of blocks on any FPGA or ASIC design). I haven't though too hard about the problem, but it seems like its similar to knapsack problems. It also seems related to non-NP complete problems like Planar Graphs (https://en.wikipedia.org/wiki/Planar_graph).

The current state-of-the-art 3SAT solvers are a different branch of AI than deep learning. I'm sure deep learning can have some application somewhere, but... the state of the art is pretty damn good.

There's probably a lot of provably optimal subproblems (ex: Planar) that are simpler than NP-complete. But then the final solution is going to be some complicated beast of a problem that comes down to guess and check (of which, 3SAT solvers are the current state of the art).

jcranmer · on July 15, 2021

Your assessment is pretty much correct; it boils down to a constrained optimization problem (a la linear programming or SAT). And this is a family of problem-solving techniques for which we already have decently powerful algorithms.

Historically, elements of constrained optimization have been considered AI--most intro to AI courses will probably cover some variant of backtracking solver or various hill-climbing techniques--but it doesn't really mesh with what modern AI tends to focus on. Deep learning isn't likely to provide any wowza-level results in such a space.

jerf · on July 15, 2021

I post this in the spirit of being corrected if I am wrong, the Internet's core skill.

If I'm reading this correctly, the competition for this technology isn't CPUs, it's GPUs. If that's the case, it seems like they'd be able to show a lot of artificially inflated numbers vs. CPUs simply by running already-known-GPU-heavy workloads on their hardware vs. CPUs and show massive performance gains vs. the CPUs... but what about an optimal GPU algorithm?

Traditionally, making the hardware that can support reconfigurability at a highly granular level (FPGAs, for instance) has never been able to catch up with conventional silicon, because the you can't recover the quite significant costs of all that reconfigurability with improved architecture. The conventional silicon will still beat you with much faster speeds even if you take a penalty vs. some theoretical maximum they could have achieved with some other architecture.

Plus, again, GPUs really cut into the whole space of "things CPU can't do well". They haven't claimed the whole space, but they certainly put their marker down on the second-most profitable segment of that space. Between a strong CPU and a strong GPU you've got quite a lot of computing power across a very wide set of possible computation tasks.

I am not an expert in this space, but it seems like the only algorithms I've seen lately people proposing custom hardware for at scale are neural network evaluation. By "at scale", I mean, there's always "inner loop" sorts of places where someone has some specific task they need an ASIC for, encryption being one of the classic examples of something that has swung back and forth between being done in CPUs and being done in ASICs for decades now. But those problems aren't really a candidate for this sort of tech because they are so in need of hyperoptimization that they do in fact go straight to ASICs. It doesn't seem to me like there's a huge middle ground here anymore between stuff worth dedicating the effort to make ASICs for (which almost by definition will outperform anything else), and the stuff covered by some combination of CPU & GPU.

So... how am I misreading this, oh great and mightily contentious Internet?

pclmulqdq · on July 15, 2021

You are correct that the main competing technology is throughput compute (GPUs, TPUs, ML processors) rather than "general-purpose" CPU-style compute with batch size of 1. The way we use CPUs is fairly unique, though, where we time share a lot of different code on each core of a system. Trying to move CPU applications to an FPGA-style device would be a disaster.

I used to believe that FPGAs could compete with GPUs in throughput compute cases thanks to high-level synthesis and OpenCL, but it looks like most applications have gone with something more ASIC-like than general-purpose (eg TPUs and matrix units in GPUs for ML).

dragontamer · on July 15, 2021

CPUs __ARE__ an ASIC.

You can get a commodity CPU today at 4.7+ GHz clock and 64 MB of L3 cache (AMD Ryzen 9 5950X). There's no FPGA / configurable logic in the world that comes even close. AMD's even shown off a test chip with 2x96 MB of L3 cache.

GPUs __ARE__ an ASIC, but with a different configuration than CPUs. An AMD MI100 comes with 120 compute units, each CU has 4x4x16 SIMD-lanes and each lane has 256 x 32-bit __registers__. (AMD runs each lane once-per-4 clock ticks)

That's 32MB of __registers__, accessible in once every 4th clock tick. Sure, a clock of 1.5GHz is much slower but you ain't getting this number of registers or SRAM on any FPGA.

> I am not an expert in this space, but it seems like the only algorithms I've seen lately people proposing custom hardware for at scale are neural network evaluation

The FPGA / ASIC stuff is almost always "pipelined systolic array". The systolic array is even more parallel than a GPU, and requires data to move "exactly" as planned. The idea is that instead of moving data to / from central registers, you move data across the systolic array to each area that needs it.

----------

CPUs: Today's CPUs __discover__ parallelism in standard assembly code (aka: Instruction level parallelism). It spends a lot of energy "decoding", which is really "scheduling" which instruction should run in which of some 8 to 16 pipelines (depending on AMD Zen vs Skylake vs Apple M1). Each pipeline has different characteristics (this one can add / multiply, but another one can add / multiply / divide). Sometimes some instructions take up the whole pipeline (divide), other times, multiply takes 5 clock ticks but can "take an instruction" every clock tick.

CPUs spend a huge amount of effort discovering the optimal, parallel, execution for any chunk of assembly code, and today is scanning ~400 to 700 instructions to do so (the rough size of a typical CPU's reorder buffer). That's why branch prediction is so important.

In effect: CPUs are today's ultimate MIMD (multiple instructions / multiple data) computers... with a big decoder in front ensuring the various pipelines remain full.

-----------

GPUs: SIMD execution. If parallelism is known ahead of time, this is a better architecture. Instead of spending so much space and energy on discovering parallelism, you have the programmer explicitly write a parallel program from the start.

GPUs are the today's ultimate SIMD computers (single instruction / multiple data) computers.

-------------

Systolic Array: Will never be a general processor, must be designed for each specific task at hand. Can only work if all data-movements are set in stone ahead of time (such as matrix multiplication). Under these conditions, is even more parallel and efficient than a GPU.

moonbug · on July 15, 2021

Remind what the ASIC acronym means

dragontamer · on July 15, 2021

Application specific integrated circuit.

The Intel / AMD CPUs are an ASIC that execute x86 as quickly as possible. Turns out to be a very important application :-)

My point being: you aren't going to beat a CPU at a CPU's job. A CPU is an ASIC that executes assembly language as quickly as possible. Everything in that chip is designed to accelerate the processing of assembly language.

--------

You win vs CPUs by grossly changing the architecture. Traditionally, by using a systolic array. (Which is very difficult to make a general purpose processor for).

Brian_K_White · on July 15, 2021

I think it's at least a fair question to ask what makes "execute x86" such an important application?

Sure the x86 cpu is the best way to execute x86 instructions, but so what?

I do not actually care about x86. I don't write it or read it.

I write bash and c and kicad and ooenscad and markdown etc..., and really even those are just todays convenient means of expression. The same "what's so untouchable about x86" is true for c.

I actually care about manipulating data.

I don't mean databases, I mean everything, like the input from a sensor and the output to an actuator or display is all just manipulating data at the lowest level.

Maybe this new architecture idea can not perform my freecad modelling task faster or more efficiently than my i7, but I see nothing about the macroscopic job that dictates it's already being mapped to hardware in the most elegant way possible translating to x86 ISA and executing on an x86 asic.

dragontamer · on July 15, 2021

> I think it's at least a fair question to ask what makes "execute x86" such an important application?

x86, ARM, POWER9, and RISC-V are all the same class of assembly languages. There's really not much difference today in their architectures.

All of them are heavily pipelined, heavily branch predicted, superscalar out-of-order speculative processors with cache coherence / snooping to provide some kind of memory model that's standardizing upon Acquire/Release semantics. (Though x86 remains in Total-store ordering model instead).

It has been demonstrated that this architecture is the fastest for executing high level code from Bash, C, Java, Python, etc. etc. Any language that compiles down into a set of registers / jumps / calls (including indirect calls) that supports threads of execution are inevitably going to look a hell of a lot like x86 / ARM / POWER9.

----------

If you're willing to change to OpenCL / CUDA, then you can execute on SIMD-computers such as NVidia Ampere or AMD CDNA. Its a completely different execution model than x86 / ARM / POWER9 / RISC-V, with a different language to support the differences in performance (ex: x86 / POWER9 have very fast spinlocks. CUDA / OpenCL has very fast thread-barriers).

There's a CUDA-like compiler for x86 AVX + ARM-NEON called "ispc" for people who want to have CUDA-like programming on a CPU. But it executes slower, because CPUs have much smaller SIMD-arrays than a GPU. (but there's virtually no latency, because x86 AVX / ARM-NEON SIMD registers are in the same core as the rest of their register space. Like... 1 clock or 2 clocks latency but nothing like the 10,000+ clock ticks to communicate to a remote GPU)

----------

Look, if webservers and databases and Java JITs / Bash interpreters / Python interpreters could be executed by something different (ex: a systolic array), I'm sure someone would have tried by now.

But look at the companies who made Java: IBM and Sun. What kind of computers did they make? POWER9 / SPARC. That's the fruit of their research: the computer architecture that best suites Java programming according to at least two different sets of researchers.

And what is POWER9? Its a heavily pipelined, heavily branch predicted, superscalar out-of-order speculative core with cache-coherent acquire/release semantics for multicore communication. Basically the same model as a x86 processor.

POWER9 even has the same AES-acceleration and 128-bit vector SIMD units similar to x86's AVX or SSE instructions.

You get a few differences (SMT4 on POWER9 and bigger L3 cache), but the overall gameplan is extremely similar to x86.

gnufx · on July 15, 2021

There are some strange assertions there apart from the definition of ASIC. ISAs aren't assembly languages. I could believe there are may be more ARM and, particularly, RISC-V chips without all those features than with. Since when has C the language specified what it compiles to, to exclude Lisp machines? I read about Oak (later Java) on my second or third generation of SPARC workstation, when I'd used RS/6000. Few people remember what Sun did market to run Java exclusively. IBM might argue about the similarity of our POWER9 to x86 but I don't much care.

mhh__ · on July 15, 2021

In comparing with CPUs I would also want to see thorough measurement of latency too.

dw-im-here · on July 15, 2021

Yes you are

denton-scratch · on July 15, 2021

When they say "kill instruction sets", do they mean "abandon trying to replicate the instruction sets of other manufacturers"?

I don't grok how a processor could not have instructions; I thought an instruction set was simply the set of instructions implemented by some processor model. I read the article because I was intrigued at the idea of a processor with no instructions. Or perhaps, with some kind of fluid instruction set, or loadable instruction set.

But that doesn't seem to be what it's about; the article says that what they want to abandon is "deep pipelines".

gfody · on July 15, 2021

there are whole other paradigms, eg https://m.youtube.com/watch?v=O3tVctB_VSU

incrudible · on July 15, 2021

Here's a comprehensive list of every time the "magic compiler will make our CPU competitive"-approach worked out:

Animats · on July 15, 2021

Yes. This has been tried before. A lot. It's straightforward to put a lot of loosely coupled compute units on a single chip, or at least a single box. The question is, then what?

- ILLIAC 4 (64-cpu mainframe, 1970s): "A matrix multiply is a master's thesis, a matrix inversion is a PhD thesis, and a compiler may be beyond the power of the human mind".

- Connection Machine. (SIMD in lockstep, which just wasn't that useful.)

- NCube (I tried using one of those, 64 CPUs in an array, each with local memory, message passing hardware. It was donated to Stanford because some oil company couldn't find a use for it. Someone got a chess program going, which works as a distributed search problem.

- The Cell CPU in the Playstation. (Not enough memory per CPU to do much locally, and slow access to main memory. Tied up the entire staff of Sony Computer Entertainment America for years trying to figure out a way to make it useful.)

- Itanium. (I went to a talk once by the compiler group from HP. Optimal instruction ordering for the thing seemed to be NP-hard, without easy approximate solutions.)

Those are just the major ones that made it to production.

But then came GPUs, which do useful things with a lot of loosely coupled compute units on a single chip. GPUs have turned out to be useful for a reasonable range of compute-heavy tasks other than graphics. They're good for neural nets, which are a simple inner loop with massive parallelism and not too much data sharing. So there may now be a market in this space for architectures which failed at "general purpose computing".

notacoward · on July 15, 2021

Also Multiflow and more recently Convey.

http://www.multiflowthebook.com/ https://www.dmagazine.com/publications/d-ceo/2012/december/c...

Multiflow definitely reached production. I even worked at a company that contracted to write some software for them (though I personally wasn't involved). They were not a total flop, but obviously not a stellar long-term success either. I'm not sure if Convey actually reached production, but their approach seems much more similar to what Ascenium is trying to do.

incrudible · on July 15, 2021

> The Cell CPU in the Playstation. (Not enough memory per CPU to do much locally, and slow access to main memory. Tied up the entire staff of Sony Computer Entertainment America for years trying to figure out a way to make it useful.)

At least towards EOL, developers did figure out what to use it for, like using the massive readback performance on the SRAM for tiled deferred shading:

https://de.slideshare.net/DICEStudio/spubased-deferred-shadi...

Good luck teaching that trick to your magic compiler.

dragontamer · on July 15, 2021

NVidia's CUDA into PTX into SASS is pretty darn good.

The SASS assembly on NVidia's instruction set seems to be manually creating read/write barriers at the assembly program level instead of leaving it up to the decoder. PTX exists as the intermediate step before SASS for a reason: all of that read/write barrier placement is extremely complicated.

CUDA cheats by making the programmer do a significant amount of the heavy lifting: the programmer needs to write in an implicitly parallel style. But once written in that manner, the compiler / computer can execute the code in parallel. The biggest win for NVidia was convincing enough programmers to change paradigms and write code differently.

dleslie · on July 15, 2021

ICC

When Intel shipped a compiler that artificially, and purposefully, crippled performance of resulting binaries when run on an AMD CPU.

Not what you had in mind, but that dirty trick compiler did give Intel an advantage.

spijdar · on July 15, 2021

While ICC did do AMD processors dirty by intentionally disabling optimizations that they (technically) supported in a clearly subversive move, this "worked" much better as a tactic because ICC did legitimately produce better code for Intel CPUs than either MSVC or GCC, and can still produce better optimizations.

At least years ago, if you really wanted to rice up your system on Gentoo, you'd combine ICC with -O3 and get a small but measurable performance bump.

gnufx · on July 15, 2021

The Intel compiler(s) -- I don't know whether ifort and icc are really distinct -- may produce better code than GCC, and vice versa. The bottom line I got for a set of Fortran benchmarks on SKX was essentially a tie with options as similar as I could make them. (It's a set that seemed to be used for marketing proprietary compilers.) If icc is as reliable as ifort, I wouldn't want to build my OS with it.

incrudible · on July 15, 2021

> Not what you had in mind, but that dirty trick compiler did give Intel and advantage.

You can only pull off that trick if you already effectively own the market. Otherwise nobody would use your compiler, at least not exclusively.

rbanffy · on July 15, 2021

There were dozens of incidents of compilers that looked for known benchmark code and optimized the hell out of those cases. At the time I remember some debate as to whether it was "fair", on one side people saying it was not, on the other people saying that they actually improved benchmark-like code.

fulafel · on July 15, 2021

It has somewhat worked out for GPUs, if you phrase it more charitably (eg "our chip needs specialized compilers and programming languages to perfom well, but will pay off well enough that a critical mass of developers will use them").

Not that GPUs and their proprietary fragmented & buggy tooling are that nice for developers even now, 15-20 years into the attempt, and the vast majority of apps still don't bother with it. And of course the whole GPGPU thing was just riding on the wing of gaming for most of its existence so had a really long artificial runway.

mkj · on July 15, 2021

Are the hyperscalers (proclaimed target market?) likely to be willing to give up control of the compiler stack to a third party like that? Generally the trend seems to be keeping software expertise in-house.

notacoward · on July 15, 2021

They'll insist that the toolchain be open source, then they'll make their own local modifications which they "never get around to" releasing back.

rbanffy · on July 15, 2021

They could write their own compilers, as long as the thing is well documented.

ginko · on July 15, 2021

>The company, helmed by Peter Foley, CEO and co-founder, who previously worked on Apple's Apple I and Apple II computers as well as a long list of hardware-design focused companies.

This doesn't seem right.. As far as I know the Apple I was pretty much exclusively designed by Steve Wozniak.

luma · on July 15, 2021

According to his LinkedIn, he "Developed chips for the Mac and Mac II, including the Apple Sound Chip." Presumably the author here didn't catch the difference between the Mac and the original Apple computers.

homarp · on July 15, 2021

http://www.byrdsight.com/apple-macintosh/ describes his work at Apple.

Nothing on Apple 1, mostly on Mac

Taniwha · on July 15, 2021

I've worked with Pete, he's the real deal, worked on early MAC hardware, did the 'hobbit' chip - a Crisp implementation intended for the Newton (cancelled after working silicon came back)

GeorgeTirebiter · on July 15, 2021

Hobbit was used by EO for the EO 440 and EO 880. https://en.wikipedia.org/wiki/EO_Personal_Communicator and the chips, for their time, were astounding: https://en.wikipedia.org/wiki/AT%26T_Hobbit

Taniwha · on July 15, 2021

I don't think that they used the chip that Pete did for Apple

femto · on July 15, 2021

Related to Reconfigurable Computing?

https://en.wikipedia.org/wiki/Reconfigurable_computing

It was big in the 1990's, but never took off. Maybe it's time has come?

fouc · on July 15, 2021

Nice, NISC for "No Instruction Set Computing".

I can see it possibly disrupting the CPU industry in 10-20 years. Seems like a classic scenario right out of Clayton M. Christensen's "Innovator's Dilemma" book.

pavlov · on July 15, 2021

If you run serverless NoSQL on NISC before the rooster crows, you must go outside and weep bitterly.

timeu · on July 16, 2021

The Next Platform has an interview with the CEO, which goes about more into detail regarding the motivations: https://www.nextplatform.com/2021/07/12/gutting-decades-of-a...

desireco42 · on July 15, 2021

So I am not an expert, but from their description, they move more complex parts to compiler... To me that sounds a little bit like Forth on the Chip, just accumulator and some basic instructions? Is this fair to say or I am not understanding it?

geocrasher · on July 15, 2021

Do I understand correctly that this reinvention of the CPU moves the microcode to the code being ran (such as an OS kernel) rather than the CPU itself, giving the compiler the responsibility to use the CPU efficiently?

taylodl · on July 16, 2021

With this processor when you get pwned you get pwned hard - that's what I see happening anyway, but I have no domain knowledge so I could be way off base.