
Tachyum starts from scratch to etch a universal processor - rbanffy
https://www.nextplatform.com/2020/04/02/tachyum-starts-from-scratch-to-etch-a-universal-processor/
======
trishume
I really hope they open source their compiler and allow people to program
directly to the machine code.

Part of the problem with GPUs (and to a greater extent FPGAs) is that the
toolchains are often terrible, buggy and opaque. They also make it really hard
for people to write easier abstractions on top of them. I guess CUDA does do
better along many of those axes than alternatives at the cost of being vendor-
specific, but anything for this would be vendor-specific too.

So much code we write doesn't take advantage of GPU power because it's harder
to program for and also you pay a latency cost for transmitting your data
to/from the GPU. If this architecture makes GPU-style programming easier in
that you just switch to using a different style of programming in the middle
of your code and the CPU just uses different instructions without a big
latency penalty that would be very cool.

~~~
joe_the_user
You can look at Nvidia PTX as well as cuda. Ptx is their macro assembler code
(still higher level than machine but translated by the chip itself I think).

As far as open sourcing the compiler, I suspect that having good documentation
about what all the low-level instruction do would be as important if not more
important than compiler source code. Just the compiler code wouldn't tell you
why they do a given set of operation.

Moreover, if a group is designing the chip and compiler together, you may
wind-up with a situation where they only know the compiler stuff works, they
don't know what happens if you do various other things.

~~~
verall
The ptx is jit by the gpu driver for your device's specific architecture but
it's pretty fast. If you know you are targeting a specific device it's
possible to aot compile your ptx.

------
en4bz
My understanding of this product is that it is a revisiting of Very Long
Instruction Word (VLIW) as seen in Itanium 20 years ago. I think that the idea
of VLIW was a good idea that failed at the time due to the pure momentum of
x86 and Moore's law.

Now the Moore's is basically at its end and x86 is partially stagnating, at
least from Intel, other platforms like ARM are gaining traction and it seems
like a good time to revisit VLIW.

I think another key factor is that most applications now run on top of
platforms/frameworks rather than at the native level. This means you only need
to port Linux, the JVM, node, python, and a few others and you captured a
pretty large potential audience. Compare this to the mid 00's when moving to
Itanium meant porting all you native apps.

~~~
ur-whale
> VLIW was a good idea that failed at the time due to the pure momentum of x86
> and Moore's law.

It also failed because of Intel's very poor handling of the developers who
wanted to switch to the Itanium architecture and eventually gave up because
there was only support for big shops.

~~~
_ph_
Looking at so many architectures that failed over time, I do think this
strongly correlates with not having a critical mass of users/developers.
Sparc, PowerPC could have had a much better and longer life, if there had been
affordable motherboards in a common form factor available for enthusiasts.
That is especially important, if your CPU architecture is vastly different
from the mainstream. So if Intel had made reasonably priced Itaniums available
for the Linux crowd, software support could have increased considerably. I
think a lot of scientists would have loved Itanium systems for their great
float pointing performance. But the times, where scientists get equipped with
expensive work stations, ended somwhere in the mid-90ies.

Likewise, I hope that this new CPU becomes easily available to enthusiasts and
think that its viability even would depend on it.

------
cwzwarich
> The processor pipeline has its out of order execution handled by the
> compiler, not by hardware, so there is some debate about whether this is an
> in order or out of order processor.

The usual problem with these sorts of CPU microarchitectures for general-
purpose computing is that they can't absorb variable cache/memory load
latency. How is this one any different?

There is no ordinary compilation scheme that will solve this, even with
complete omniscience, since the same function with different arguments will
observe different latency. Maybe some magical feedback-driven JIT could do it,
but that was tried in the Itanium era and never really worked either.

~~~
ema
The Mill architecture has some neat ideas here where you can issue a load in
cycle 10 but specify it to arrive in cycle, say, 80 so the cpu can be busy for
another 70 cycles whether or not the loaded value is in the cache or not.

------
_ph_
It is great to see a new cpu architecture to come to market. It is very
interesting to see, that they pick up the VLIW architecture again. For true
progress, we need very different aproaches to compete. As they are using the
TSMC 7nm process, the processor will be produced on a cutting edge process. So
it won't be hold back by running on an inferior process. Many good design have
been killed in the past, because they were produced on a process that couldn't
compete with the market leaders. I wondern, how ell the Ithanium could
perform, if it were ported to 7nm. Many designs, which didn't work 10 to 20
years ago, could perform vastly better on a current process.

~~~
cwzwarich
Why would you think that VLIW microarchitectures would benefit from newer
process nodes more than other microarchitectures?

~~~
_ph_
Different architectures scale differently with the transistor count. Clock
speed for traditional desktop/server chips hasn't increased a lot over the
last decade. A lot of transistors have been thrown at speculative execution to
increase the effective speed by implicitly creating parallelism. Besides the
security problems associated with speculative execution, this is an uphill
battle. Explicit parallelism is an interesting alternative. One reason, that
the Itanium failed, was that it was considered huge and power-hungry. And
indeed its transistor count was high compared to the x86 chips of the time.
But times have changed. The 3 billion transistors of the last-gen Itanium look
almost tiny compared to the over 8 billion transistors in Apples A13.

------
philipkglass
Itanium was good for high performance computing -- numerical simulations of
physical phenomena. That's what I used it for. It was ok-to-poor for other
workloads. I seem to recall basic utilities like "grep" being slower on my
expensive employer-provided Linux/Itanium 2 workstation than on my budget
x86/Linux desktop.

I can believe than a new VLIW processor can indeed perform well on HPC and ML
workloads. But that doesn't sound particularly "universal" to me. Will people
get good performance running relational databases on it? Graph algorithms?
Compilers? Existing Java applications?

------
jabl
The big question is why would this startup succeed where Intel with their near
bottomless coffers failed? AFAICT there have been no major improvements in
compiling efficient general purpose code for VLIW architectures since.

~~~
_ph_
They are using the TSMC 7nm process, so they would even have a slight process
advantage vs. the best Intel chips. The Itanium never made it beyond 32nm.
Trade-offs at 7nm might work out quite differently than at 32nm. The latest
Itanium chips had 3 billion transistors - that is small compared to the over 8
billion transistors of an iPhone A13 processor. So the "monster" chip Itanium
- large and power consuming - would possibly make a decent mobile processor
with todays technology.

~~~
jabl
The question isn't whether it's going to be better than Itanium. Of course it
will (assuming it ever becomes a real product, of course), since it's using a
several generations better process.

Itanium was never particularly impressive compared to its (roughly)
contemporary competitors on the (roughly) same process, why would Tachyon be
any better?

You say that tradeoffs change as a function of the process. E.g. wires
becoming ever more expensive compared to transistors. Is that enough to tilt
the playing field in favour of VLIW? I can see VLIW having an advantage here
if the VLIW instruction bundle lines up 1:1 with the hardware pipelines (less
routing inside the chip), and the workload is static so you can use profile
guided optimization to work around the lack of dynamic scheduling (OoO).

However, once you try to create a VLIW ISA that maps to several generations
and/or high/low end implementations you lose that 1:1 mapping, and for general
purpose code compile time scheduling isn't particularly good. As Intel found
out.

~~~
ema
The Mill architecture uses install time specialization to go from not quite
machine code to machine code to maintain the 1:1 mapping with different
implementations.

------
innovator116
This will inevitably be compared to different approaches taken by RISC-V
projects. IMHO, Only real world tests at scale can show, if it can live upto
its universal processor claims.

