The processor pipeline has its out of order execution handled by the compiler, not by hardware, so there is some debate about whether this is an in order or out of order processor. Danilak says that instruction parallelism in the Prodigy chip is extracted using poison bits, which was popular with the Itanium chip which this core resembles in some ways and which are also used in Nvidia GPUs. The Prodigy instruction set as 32 integer registers at 64-bits and 32 vector registers that can be 256 bits or 512 bits wide, plus seven vector mask registers. The explicit parallelism (again, echoes of Itanium) is extracted by the compiler and instructions are bundled up in sizes of 3, 8, 12, or 16 bytes.
So it’s a big VLIW chip. Maybe interesting for the domain, but the crazy specs make sense in that context. With deep OOO designs hitting almost 5 ghz on TSMC 5 nm, it’s not surprising to see an in order VLIW design hit 5.7. Impressive effort by new company if it pans out though. It would be good to see a renaissance in high end chips and architectures like the mid 1990s.
Yeah. I wonder if its a classic VLIW or if the pipeline is exposed and/or skewed.
exposed: The results of operations that take more than one clock cycle don't necessarily appear at their destination the cycle after the instruction is executed. Think branch delay slots but potentially for multiplies and loads too.
skewed: Loads, processing, and stores can happen on subsequent clock ticks so simple loops don't necessarily need prologues and epilogues.
A compiler can handle either of these for a particular CPU pretty easily but they tend to eliminate binary compatibility. Code morphing as in the Transmeta lineage like to use both of these. Other VLIWs with barrel multithreading, like the Hexagons DSPs in Snapdragon SOICs, don't need them.
EDIT: The Mill guys have a video on how this works on their system. They've got their own names for things for some reason but they do a good job of explaining how this all works: https://millcomputing.com/docs/execution/
In my time as an FPGA developer, I wrote a lot of tiny processing cores that were exposed and skewed VLIW designs for specific applications. If the code isn't changing much, they work extremely well in a very small silicon area and power footprint. However, they are awful to program.
In supercomputing, this sort of thing sometimes appears and works well (like the Pezy-SC2 chips that recently came out). However, they usually fail on general purpose computing tasks.
Yeah. Normally sophisticated software scheduling works great on DSP-like tasks where the memory access patterns are very predictable but suffer a lot where you tend to have unexpected cache misses that a deep OoO system could just paper over.
I think the benefits of both could be had with "software assisted branch prediction/caching".
Ie. you transpile your x86 code to native code for your VLIW machine. Then you run that code for a few hundred clock cycles till bam "cache not ready on time exception" which fires when you try to execute an instruction that is expecting to read memory from the cache, but the cache isn't yet populated with that value. Then you re-run your transpiler which will produce new code which either does a better job of reading the data into the cache ahead of time, or issues different instructions which take more time and read data direct from RAM.
Remember a software transpiler sometimes has more information than a typical deep OoO CPU, because it can use a lot more memory for state (eg. remembering that a particular branch or memory access won't be cached), and it can even persist state across system reboots. It can also do far more expensive optimizations and save the results, something a deep OoO CPU can't do because all the optimizations need to be doable in hardware.
This seems like it would give fantastic benchmark results, if the benchmark result shared was the final version and the benchmark was identical from run-to-run.
Regarding the “too good to be true” part, I wouldn't take them seriously based on what they have published (and more crucially what they have withheld, like an ISA manual or sources of a GNU toolchain port), except they managed to hire a very senior GCC developer. I trust that he did his due diligence before joining them, so I must assume what they are building is real.
Did they make any performance claims about the cross-ISA support? It might just be QEMU port with qemu-user and TCG.
> except they managed to hire a very senior GCC developer. I trust that he did his due diligence before joining them, so I must assume what they are building is real.
"Lots of money" and interesting challenge can be the due diligence, it's not like the product has to be realistic to get paid (it helps in the long run, but in the short run VC money pays the bills).
Hell, they might even believe in the product. Linus spent 6 years at Transmeta.
But if you actually look at the thing, it does not actually seem like an out-and-out scam, a 5.7GHz VLIW design is not even remotely in the realm of the impossible.
And rosy promises which end up crashing and burning are exactly what transmeta achieved.
I wonder how they manage that? It's straightforward to do if a page fault represents an error that halts the execution of a thread but not if you need to be able to keep going after the OS reads in the data from the hard drive. That is, if you speculatively load values that aren't going to be used in a loop then the fact that you had poisoned values sitting around in some registers doesn't matter. But if that data "should" have been there but wasn't and you store or branch based on it then you've got to roll back to the load, page in the data, then continue from the checkpoint. And if you're able to do all of that why not just go full out of order?
I'm not sure if this answers your question but VLIW processors feature speculative loads.
"speculative load" can be seen as data prefetch to a register, but with a extra instruction to put before reading/using the register. This instruction check that the memory content has been received, otherwise the instruction stall the pipeline. It allows to reorder a load before a conditional branch.
To determine whether the load has been received, there is an additional structure in the microarchitecture that tracks speculative loads in-flight.
When a context switch occurs, this data-structure is overwritten, so there are additional mechanisms to replay the load in such cases.
The pipeline state can be checkpointed with a fixed amount of logic gates that’s proportional to the amount of pipeline state. Out of order execution requires tracking dependencies in structures that scale superlinearly with the size of the out of order window.
It isn't bound by the pipeline length. Even without pipelining at all, with poisoning you can do a faulting load into a register and then just mark the value in the register as poisoned. As you perform operations with the value any results are also poisoned. Then maybe dozens or hundreds of cycles later when you store or branch on one of the poisoned values the fault occurs. So pipeline length really doesn't have anything to do with it.
On Itanium you essentially had to double check your values for poison before operating on them in a lot of cases and that caused big performance penalties on conventional code. Other systems restrict the shenanigans you can get up to with your MMU and solve the problem that way but that means you can't run a traditional OS with mmaped files and memory paging and such. I'm not sure what the Tachyum people are doing or if they've got some clever idea to get around all of this.
This sounds fairly obviously fake. 128 cores at a TDP of 950W means only ~7.4W per core, which would be a massive improvement on the i9-9990XE - which can only do 5Ghz - and requires 18.4W per core. A tiny, hitherto unknown company with only a handful of engineers managing to outperform Intel by this huge of a margin strains credulity past breaking point.
They also talk about air-cooling four of the 600W model in a 2U chassis! I'm not convinced how possible it is to handle 600W for any reasonable socket size with air cooling at all - the watts per mm2 ratios are going to be very high - but the idea that you can air-cool 2.4kW from the CPUs alone (and at these speeds there are going to be a lot of other very warm components in there) in a 2U chassis is simply crazy.
> This sounds fairly obviously fake. 128 cores at a TDP of 950W means only ~7.4W per core, which would be a massive improvement on the i9-9990XE - which can only do 5Ghz - and requires 18.4W per core
You’re comparing apples and oranges. The 9990-xe is on 14 nm process. This is on 5 nm. Beyond that, this appears to be an in order VLIW core. The 9990-xe is a deeply out of order core. In a modern processor, much of the power budget is taken up by the structures used for out of order execution, such as reorder buffers and schedulers. These structures often grow non-linearly with the amount of reordering capability of the CPU. Jettisoning them entirely saves huge amounts of power.
Well, the process size numbers between different companies have been somewhat meaningless for a while now, as they all measure different things. Intel's 14nm process is still larger than the 5nm TSMC process used here, but the difference is considerably smaller than the numbers suggest. It's a fair point about VLIW cores being a lot cheaper in terms of power budget though. I'm still extremely sceptical but I concede that at least on paper it's not impossible.
IBM had 5GHZ in 2014. [1] Clock speed is not a measure of performance alone.
Besides, most of the work in reaching a certain clock speed or target can be owed to the foundry (in this case TSMC which is world-leading, certainly beating Intel on most metrics at the moment.)
Comparing to the over-tuned enthusiast SKU of 2018 is not a fair comparison for either.
Also 600W is not impossible to cool, there are GPUs at that level of power for a while now.
The advantages of chopping up chopping your pipeline stages in half so that each is 10 FO4s long rather than the 16 FO4s most people use. You've generally got 2 FO4s of latching and 2 of clock skew so IBM was seeing 6 FO4s of useful work per stage compared to 12 with Intel. Or at least the overhead was 4 per stage in the mid 2000s, I've got no idea what they are in the early 2020s.
And, if you have enough threads per core, it's relatively simple to switch to another thread when an instruction stalls. Unfortunately, most our software is designed for machines with few fast cores.
Given the main comparison target seems to be the H100 and
> The processor pipeline has its out of order execution handled by the compiler, not by hardware
This is not a direct competitor to general purpose CPUs, which makes the frequency claims a lot more realistic but also completely useless as a basis for comparison.
> The processor pipeline has its out of order execution handled by the compiler, not by hardware
I wonder how compilers improved at generating good ILP with VLIW designs. The Itanium suffered from poor performance because it was very difficult to generate optimally ordered code for it.
OTOH, with enough GHz, even a simple, inefficient, in-order architecture can be fast enough. That was what made RISC attractive in the 80's and 90's.
By signing a contract with TSMC? That's the fab according to the article. 5nm is not bleeding edge and as others move to N5P or N4, capacity gets freed at N5.
The M1 Max (8+2 cores) tops out at a CPU TDP of around 30W IIRC, on the 5nm node. These numbers aren't that unrealistic, are they? Intel isn't exactly known for power efficiency currently, aside from E-cores.
M1 Max runs 0.6-3.2GHz - this is the efficiency range. Running higher requires more voltage which increases the power squared, and then the freq. is linear. The infamous cubic power scaling for increased frequency 5.5GHz is just unrealistic.
Citing other comment ->
"Tachyum is writing press releases like a machine gun (2-3 a month), but they don't participate in MLPerf and avoid comparisons altogether. They talk about big facilities, but they have 43 employees (including ex-employees) on LinkedIn.
For me, only really hard, to make one solution, which will be market viable, mean, will sell on real open market in quantities enough to make project profitable.
For example, it is possible, if we consider some fantastic scenario, in which some big rich country will totally prohibit all other architectures, and enforce all people to buy only this.
But I only know big poor countries, who could consider such experiments.
This sounds like yet another company aiming to take advantage of the everything bubble, where in a company can make a claim without any shipping product, without any proven ROI, and still get a multi-billion dollar valuation. The team becomes exceedingly rich, and then they either get bought or go bankrupt (though still individually wealthy). A claim this large needs very substantial proof, and there’s nothing to show that this is more than an overzealous teenager mouthing off about something he/she/ze doesn’t understand.
Everything, even claims to be better than CPU,GPU and TPU at once.
But to make all claims in one business, in one solution, looks scam.
Because every vector needs it's own very effective leader, and in real life they will compete for shared resources. It is extremely difficult to organize in such way, so will got overall system working smooth and be economically viable.
In best case, I think possible for this company, to deliver some sort of very good number-crunching accelerator, and only then, to consider something more.
But now I see too much self-confidence and too much hidden details.
For example, all commodity hardware suffers from slow commodity DRAM and buses.
It is possible to make complex solution, with custom DRAM (HBM), custom high speed bus, etc, but its net cost will be prohibitive for commodity market.
So need some killer app, which will tolerate so high cost for some more important reason.
This could be something for big corp, or govt, or military, or anything, but I cannot see any fit, which is not already handled with already existed solutions.
For example, I could imagine, we got knowledge about 11-dimensional travel, but for this need too much computational power for current best hardware, so need something much more powerful, and it will pay off.
But I don't see any signs of such knowledge.
By having it's own compiler in the background. Transmeta did this 20 years ago, sadly the IP is in a bin somewhere inside Intel (?) or some other giant.
When we talk about speculation and out of order execution we imagine the CPU is a kind of giant dependency-graph engine but this isn't actually true. The genius of modern hardware design (although kicked off by Tomasulo in the 60s) is that you can "compress" this idea into a real circuit with a finite number of gates and SRAM etc.
The quality of the branch prediction lets you get away with a lot on a fairly dumb "throw stuff on a buffer, speculate across memory accesses, pop (commit) off the buffer when the pie's finished cooking" model inside the processor.
The draw of this and originally Transmeta is that you have an extremely wide dumb processor in front of a smart software frontend that can do much more complicated work and scheduling based on the all-important runtime information that static VLIW sorely lacked (and thus lead to Intel doing all kinds of stuff with Itanium, hence it ending up EPIC rather than a true VLIW spiritually).
Now, software is slow, the way Transmeta got around this is by having a physical cache (the "Tcache") for storing translated instructions in.
Not really. AFAIK it all amounts to something between 600 to 800Mhz for real world code, at best. About the same for affordable FPGAs.
That aside, I don't really get this nostalgy for these systems. I don't care about Doom, or some port of Quake. While 68K assembly was much nicer for me than anything common today, what do I get from that without a usable Browser, Office, "daily driver" apps? Show me how to port Firefox, Chromium or something functionally equivalent to these, and how those perform! :-)
If we're talking about an actual, modern 68060 CPU running at multiple GHz, then it would be trivial to run Firefox or Chromium -- just install Debian m68k and compile. :)
Apart from the nostalgy factor, I suspect there would be no actual benefit from such a system. I doubt m68k would compare well to ARM or x64 in terms of compatibility or modern-app performance.
Yes, mostly. It was essentially a Transmeta Efficeon with some facilities to run ARM code natively the first time through in case it wasn't going to be re-used.
They were going to have their cores be able to run both ARM and x86 binaries but IIRC patent issues with Intel prevented them from doing both so they just went with ARM. Whether you need the x86 patents to run have a Transmeta style processor running x86 might be something that can only be settled by going to court.
They are missing critical statements like "The emulation allows booting x64 Linux and gets XXXX score on [industry benchmark]".
They have no documentation on their native ISA, nor any indication of any software or even compilers ported to it.
I would guess the 4 year delay is them struggling to get the software side working with any reasonable performance. It isn't too hard to make a toy processor with a high clock rate in a simulator... But getting it to run linux and win benchmarks is much harder.
This triggers my bullshit detector however even delivering anything in this sector is deeply impressive so hat's off to them even if they're cheating as long as they ship.
I'm no foundry insider, but in other forms of manufacturing it's not entirely uncommon to sell slack time on your lines where you have tiny amounts of capacity unused by other contracts on an as-available basis. Yes, I'm suggesting their manufacturing may be the contract foundry equivalent of flying on standby.
The founder/CEO of this has already (co-)founded and sold 2 hardware companies: SandForce and Skyera, he probably has the right connections and can afford it.
Skyera was another one with ridiculous promises of density/price (flash storage arrays) that never really went anywhere. That hardly fills me with confidence for Techyum’s legitimacy.
Seems bullshit, it's obviously impossible to beat both state of the art CPUs, GPUs and even TPUs (lol) with a single architecture.
They can maybe do that if the architecture is configurable and they are using different configurations for their CPU, GPU and TPU competitors, but then it's still highly implausible that they can deliver more than 2x improvement over competitors, unless perhaps if it also costs proportionally more (but even then, you get decreasing yields for larger chips and inefficiencies if you make an SMP system).
> Seems bullshit, it's obviously impossible to beat both state of the art CPUs, GPUs and even TPUs (lol) with a single architecture.
Unseating x86/AMD64 is going to be hard because of the inertia they have. Whatever you make, even if it's faster, at least has to be able to run x86/64 binaries at a reasonable speed or it won't get wide adoption, unless it's a very captive market (mobile/Apple).
I'm skeptical of their claims but I would just say, the M1 has proven that if you have an appetite for throwing away compatibility and a great engineering team you can accomplish quite a bit. It seems you definitely can throw away the baggage that comes with needing compatibility and make fast, successful processors.
Might have something where hand-tuned assembly code can beat top tier CPUs and GPUs for very specific workloads (Although I guess it would have to be something straight-line enough that it can be statically scheduled, while not falling into the bucket of things that GPUs blow through).
There's been a good number of VLIW attempts that had great peak possible performance and efficiency, but fall down in many current consumer use cases. (Itanium, NVidia denver, many GPU/DSP architectures (though they focused on what that sort of architecture did well instead of trying to wedge it into other use cases).
Think about how wide the execution unit of a modern Intel/AMD CPU is compared to it's average IPC, if you could magic up an instruction stream that filled every port, and you could remove all the Out-of-Order/reordering/hazard logic it would make the cores a fair bit smaller and power efficient while giving the "same" peak possibly performance. But more ALU ports don't seem to the the bottleneck in current consumer software - keeping them filled is. Hence all the OoO/multithreading/big caches.
And the "Sufficiently Advanced Compiler" making up for OoO scheduling on the hardware itself has been a pipe dream for many years. Maybe this will be the one to break through the apparent barrier? But it's not like there haven't been a LOT of money and very clever people working on this exact thing before - and they have all fallen by the wayside compared to current large OoO CPU cores.
So to me it's just another cookie cutter attempt at VLIW, nothing special.
The market has changed. There are now applications for filling every ALU port and those usually run on GPUs, which have awful DX. This chip can run x64, ARM or RISCV whenever performance does not matter, but also run native binaries compiled with GCC.
I agree that the VLIW approach will likely fail to deliver in most other applications, but I do not see the product marketed to them either.
I'm surprised that wccftech doesn't get auto flagged on here like it does many other sites. It is well known for getting it's content from the rumor mill.
Does the dupe detector work on title text, or URLS? I posted the same article last night: https://news.ycombinator.com/item?id=31719519 The URL is the same, but I shortened the text differently (the article's title was too long for the submission form).
I see nobody pointing out the relatively low number of integer registers (32 64-bit registers [1]) for a VLIW that is intended to emulate an high-end Out-of-Order machine.
Modern high-end Out-of-Order processors feature over 160 physical integer registers.
This number is high because OoO processors use a very large instruction window, and every instruction in this window have to be free from register name false-dependency (Write-after-Read and Write-after-Write dependency). That's the register-renaming job to introduce extra physical register to hide false-dependency.
If they really emulate the behavior of an OoO processor we should expect more registers.
The large instruction window form OoO high-end processor provides latency tolerance, with VLIW processors, to increase latency tolerance, we usually increase the register count and employ more aggressive static scheduling techniques.
It is known that register-renaming technique allocates on average more registers than necessary, due to early allocation and late release, so with static scheduling it should be possible to make more efficient allocate/release.
But here the register count difference is too large, low register allocation/release efficiency of register-renaming is not enough to explain the difference.
As a comparison, Transmeta VLIW core has 64 integer register and 48 integer shadow-register which is equivalent to the physical register count of OoO processors of its time.
So either they don't really emulate an out of order processor, or they do some magic, or they don't really perform that well..
(or documents that mention 32 integer registers are incorrect)
The most likely hypothesis to me is that their benchmarks numbers are based on a selection of benchmarks that can exploit their vector unit, and other uncompetitive bench results have been left out.
Meaning that performance on typical x86 general purpose applications would be extremely low with this processor.
I wonder if you could make VLIW processors work better for general purpose loads by exposing a higher level machine code with an on-board JIT compiler that could optimize for runtime behaviour. In theory JITs should be even more effective for VLIW in order processors because runtime behaviour should have an even larger effect of what kinds of optimizations are needed, and the difficult and costly nature of optimizing such code coule benefit from very accurate hotspot optimization. You could of course also expose the lower level architecture for programs that do their own JIT.
The list of specs definitely sound fake, but maybe they actually did figure out how to statically schedule VLIW instructions well. I guess we’ll see in a year or two.
I think the "thing that has changed" is that we now measure cpu performance on much longer and more predictable runs. Not so much booting windows, more doing a hundred trillion FPMAC. But I still don't believe it.
Fake news, yeah maybe. What makes me incredulous is how bad the Tachyum name is. Although people without multiple startups before them can't come up with good names, like https://fgemm.com (coming soon). Apart from that!
I presume it would not be possible but mixing and matching ISA opcodes
across all the implemented ones would be interesting.
Pick the opcode that is most advantageous for the task happening rigth there.
Multiple clock regions is one option. Others pointed out large silicon areas. We also don't know how the packaging is done - it may be a bunch of smaller, 32-core, chiplets separated by interposers or memory controllers and IO modules running at lower speeds.
What's depressing is that these hardware specs are millions of times faster/more cores/more storage/more ram than 20 years ago but we still manage to produce laggy applications and websites.
How can something that should e.g. Visual Studio whose solution should easily fit into RAM take about 5 seconds to start, 5 seconds to search the MRU and 10+ seconds to load intellisense.
How can "search the combo box" on Octopus deploy take notceable seconds sometimes even though it is purely an in-browser operation?
That's just two examples, there are plenty of others.
Some interesting parts, although issues related to memory bandwidth and disk IO are not as relevant today.
I guess the most obvious parallel is that current OSs were designed for hardware that worked the way it did 30 years ago and now that those bottlenecks have disappeared, it is not trivial to rework the design to take advantage of these things.
I suppose the OS suffers from the tragedy of the commons in that each application developer assumes access to enough RAM and CPU to just work without considering how their large resource usage might affect all other applications/services on the system.
> How can something that should e.g. Visual Studio whose solution should easily fit into RAM take about 5 seconds to start, 5 seconds to search the MRU and 10+ seconds to load intellisense.
You could start all three in three threads and have everything finish in 10+ seconds, provided you have cores and memory bandwidth available.
Maybe Microsoft is providing too good desktops for the Visual Studio team. ;-)
I frequently advocate developers should, as an exercise, work on machines with more cores and lower single-thread performance. Core counts are only going to increase and being able to put all cores to work at the same time is an important competitive advantage.
Interesting bit I found researching them: https://www.nextplatform.com/2020/04/02/tachyum-starts-from-...
The processor pipeline has its out of order execution handled by the compiler, not by hardware, so there is some debate about whether this is an in order or out of order processor. Danilak says that instruction parallelism in the Prodigy chip is extracted using poison bits, which was popular with the Itanium chip which this core resembles in some ways and which are also used in Nvidia GPUs. The Prodigy instruction set as 32 integer registers at 64-bits and 32 vector registers that can be 256 bits or 512 bits wide, plus seven vector mask registers. The explicit parallelism (again, echoes of Itanium) is extracted by the compiler and instructions are bundled up in sizes of 3, 8, 12, or 16 bytes.