The 32-bit aligned instruction assumption is probably baked into their low-level caches, branch predictors etc. That might mean much more significant work for switching to 16-bit instructions than they are willing to do.
I don't think anyone bakes instruction alignment into their caches since the early 2000s, and adding an extra bit to the branch predictors isn't that big of a deal. It's got to be the first or second stage of their front end right before the decoders.
Why not bake instruction alignment into the cache? When you can assume instructions will always be 32bit aligned, then you can simplify the icache read port and simplify the data path from the read port to the instruction decoder. Seems like it would be an oversight to not optimise for that.
Though, I suspect that's easy problem to fix. The more pressing issue is what happens after the decoders. I understand this is a very wide design, decoding say 10 instructions per cycle.
There might be a single 16bit instruction in the middle of that block 40 bytes, changing the alignment halfway though. To keep the same throughput, Qualcomm now need 20 decoders, one attempting to decode on every 16bit boundary. The extra decoders waste power and die space.
Even worse, they somehow need to collect the first 10 valid instructions from those 20 decoders. I really doubt they have enough slack to do that inside the decode stage, or the next stage, so Qualcomm might find them selves adding an entire extra pipeline stage, (probably before decode, so they can have 20 simpler length decoders feeding into 10 full decoders on the next) just to deal with possible misaligned instructions.
I don't know how flexible their design is, it's quite possible adding an entire extra pipeline stage is a big deal. Much bigger than just rewriting the instruction decoders to 32bit RISC-V.
Because RISC-V was designed to be trivial to decode length for, you simply need to look at the top two bits of each 16bit word to tell if it's a 32bit or 16bit instruction. At that point, spending the extra I$ budget isn't worth it. Those 20 'simple decoders' are literally just each one 2nand gate. Adding complexity to the I$ hasn't even made sense for x86 in two decades, because of the extra area needed for the I$ versus the extra decode logic. And that's a place where this extra decode is legitimately an extra pipeline stage.
> I don't know how flexible their design is, it's quite possible adding an entire extra pipeline stage is a big deal. Much bigger than just rewriting the instruction decoders to 32bit RISC-V.
I'm sure it is legitimately simpler for them. I'm not sure we should bend over backwards and bring down the rest of the industry because they don't want to do it. Veyron, Tenstorrent were showing off high perf designs with RV-C.
It doesn't matter how optimised the length decoding is. Not doing it is still faster.
For an 8-wide or 10-wide design, the propagation delays are getting too long to do it in all in single cycle. So you need the extra pipeline stage. The longer pipeline translates to more cycles wasted on branch mispredits.
RISC-V code is only about 6-14% denser than Aarch64 [1], I'm really not sure the extra complexity is worth it. Especially since Aarch64 still ends up with a lower instruction count, so it will be faster whenever you are decode limited instead of icache limited.
> Adding complexity to the I$ hasn't even made sense for x86 in two decades
Hang on. Limiting the Icache to only 32bit aligned access actually simplifies it.
And since the NUVIA core was originally an aarch64 core, why wouldn't they optimise for hardcoded 32bit alignment and get a slightly smaller Icache?
> Hang on. Limiting the Icache to only 32bit aligned access actually simplifies it.
Even x86 only reads 16 or 32 byte aligned fields out of the I$, then shifts them. There's not extra I$ complexity. You still have to do that shift at some point, in case you don't jump 32 byte aligned address. You also ideally don't want to only hit peak decode bandwidth starting on aligned 32 byte program counters, so that whole shift register thing is pretty much a requirement. And that's where most of the propagation delays are.
> RISC-V code is only about 6-14% denser than Aarch64 [1], I'm really not sure the extra complexity is worth it. Especially since Aarch64 still ends up with a lower instruction count, so it will be faster whenever you are decode limited instead of icache limited.
There's heavy use of fusion, and fwiw, the M1 also heavily fuses into micro ops too (and I'm sure the AArch64 morph of NUVIA's cores do too).
Under a classic RISC architectures you can't jump to non-aligned addresses. That lets you specify jumps that are 4 times longer for the same number of bits in your jump instruction. Here's MIPS as an example:
And it's not only a way of decreasing code size. It help with security too. If you can have an innocuous looking bit of binary starting at address X that turns into a piece of malware if you dump to instruction X+1 that's a serious problem.
RISC-V, I'm pretty sure, enforces 16 bit alignment and is self synchronizing so it doesn't suffer from this despite being variable length. But if it allowed the PC to be pointed at an instruction with a 1 byte offset then it might be.
As far as I'm aware every RISC ISA that's had any commercial succss does this. HP RISC, SPARC, POWER, MIPS, Arm, RISC-V, etc.
> And it's not only a way of decreasing code size.
And RISC-V has better code density than AArch64.
> It help with security too.
If you can have an innocuous looking bit of binary starting at address X that turns into a piece of malware if you dump to instruction X+1 that's a serious problem.
Not all JIT spraying relies on byte offsets to get past JIT filters, the attack I gave is just an example.
And NanoMips requires instructions to be word aligned just like everybody else, it's just that it requires 16 bit alignment rather than 32. Attempting to access an odd PC address will result in an access error according to this:
> And NanoMips requires instructions to be word aligned just like everybody else, it's just that it requires 16 bit alignment rather than 32. Attempting to access an odd PC address will result in an access error according to this:
Right, and I mentioned RISC-V as yet another sane RISC architecture that requires word alignment in instruction access. But the fact that it requires alignment means that the word size has implications for the instruction cache design and the complexity of the piping there.
I don't have a strong opinion on whether the C extension is a net good or bad for high performance designs, but I do strongly believe that it comes with costs as well as benefits.
Back in 2019, RISC-V was 15-20% smaller than x86 (up to 85% smaller in some cases) and was 20-30% smaller than ARM64 (up to 50% smaller in some cases).
Coq is fairly generic; it has a long history and made it possible to write some really cool proofs such as a proof of the four colors theorem, but writing crypto proofs is really hard using Coq.
For symbolic verification Tamarin and ProVerif are the tools of choice; I used ProVerif.
For proofs of security for protocols EasyCrypt and CryptoVerif can be used. CryptoVerif, ProVerif and Coq where developed at the same Institute by the way; at Inria Paris.
Although it would be a great exercise in hybrid SPARK+Coq proof. If (that's a big if) you can specify your algorithm in SPARK then (I think) you can use either the SPARK automated/guided prover, or when it can't discharge the proof, use some predefined lemmas, and barring that go down to the interactive Coq environment (or Isabelle, I've seen it done once) and discharge the verification conditions.
Not sure anyone has published such a multi-layer spec and proof effort /with crypto code in the mix/.
My biggest question after looking at the readme: What happens if your computer crashes while dura is making a commit? Can it corrupt your local git repository?
From my own experience, git is not crash safe, i.e., it can leave the .git directory in an inconsistent state if the computer crashes during certain git operations.
There is a lot more to nuclear waste than just the fuel rods. Lots of materials that are either used by workers at the plant or part of the plant become radio active and can no longer go to a regular landfill of other trash processing facilities. That is why some energy companies are trying to default rather than pay for decommissioning plants once they have reached the maximum age that they were constructed for.
> if the two of them don't decide to add their own proprietary "extensions" to the language.
icc has always had its own dialect of C++, which in practice means that there is "C++" code that only compiles on icc but is rejected by clang++ and g++.
With Intel switching to the clang frontend, I would hope that their interpretation of C++ will become more, not less, standard conform.
> At best you can do the same with one reduced canonical simplified verilog source-to-source translation.
Parsing Verilog and generating valid Verilog is fairly difficult. If you want to stay with Verilog, the most realistic alternative to firrtl right now is the RTL-IL representation used inside of yosys.
> at worst it does not do the primary function of "being" IR for RTL, because it should've been a graph, not another language with simplified syntax.
Canonicalized LoFirrtl (i.e., the representation the compiler lowers Chisel to) is essentially SSA (single static assignment) which encodes a dataflow DAG. So on a per module level, firrtl does represent the circuit as a graph.
What you might be talking about is the fact that this graph isn't global. Having a global circuit graph could make some analyses easier, but it might require essentially in-lining the whole circuit which is something a lot of designers are opposed to.
Even small optimizations like removing unused pins from internal modules are often times opposed.
Chris Lattner and others are currently working on an "industry" version of firrtl as part of the CIRCT hardware compiler framework: https://github.com/llvm/circt
As you can see they did not decide to go with a global graph based IR and instead opted to just represent local data-flow graphs as SSA.
> Access to any global variable should always occur direct from memory.
What if your function takes a pointer that might be pointing to a global variable? Does that mean that all accesses through a pointer are now excempt from optimization unless the compiler can prove that the pointer will never point to a global variable?
I remember one of the reasons you did not want to use firrtl was that its compiler is implemented in Scala and thus hard to integrate into other projexts. CIRCT will solve that problem by providing a firrtl compiler implemented in C++. Other languages like Verilog/VHDL and new high level languages for HLS-like designs are also on the todo list.
I contribute to CIRCT, so I feel like I should chime in here. I personally hope that it can provide exactly the kind of unifying IRs we are all hoping for in the open-source community. The fact that the tools are implemented in C++ may be a win for some, but I think the CIRCT project is compelling for much deeper reasons. The README states the ambition clearly:
> By working together, we hope that we can build a new center of gravity to draw contributions from the small (but enthusiastic!) community of people who work on open hardware tooling.
There are weekly community meetings that are open to the public, and we have guest speakers from all sorts of interesting projects in the open-source community. Many of those are leading to collaborations and contributions to CIRCT.
There hasn't been much (any?) discussion of CIRCT on HN, but rather than present the reasons I think it's so great here, I'll point to a talk[1] I gave earlier this year and a much better talk[2] Chris Lattner gave shortly thereafter, both of which lead up to the "Why CIRCT?" question in the second half.
Looking back at that SymbiFlow thread, I see familiar faces that are now actively contributing to CIRCT. There are mentions of many different hardware IRs in some of the posts, but at least three have first-class support in CIRCT today: FIRRTL[3], LLHD[4], and Calyx[5]. This is all very recent and experimental, but I would say the results are already promising.
It was in fact the EdgeTPU (which is different from the TPU used in data centers).
The talk from a google engineer can be found on youtube: https://www.youtube.com/watch?v=x85342Cny8c
Please note that they were using a version of Chisel from 5+ years ago and many things have changed since then.
It is still true though that Chisel can be hard to learn for typical hardware engineers, which is why it may be best suited for small and highly dedicated teams rather then large hardware companies.
I'm very surprised by the reported cost to verification. One of the major reasons to move to a high level synthesis design strategy is to ensure equivalence between your models and RTL.
I have experience with C++ HLS and it has enabled massive reductions in verification time.
Chisel is not HLS. It is a Scala library that lets you generate circuits on an RTL abstraction level. That means that you explicitly define every state element like registers and memories. But you can generate N registers inside a loop (or a map/foreach) instead of only 1 at a time. In HLS the compiler needs to somehow infer your registers and memories.
That said, I think one of the problems the google team was struggling with is that in traditional HW development there is design and a separate verification team. The design team bought into Chisel since it would let them generate hardware more quickly, but the verification team just tried to apply their traditional verification methods on the _generated_ Verilog. This is almost like trying to test the assembly that a C++ compiler generates instead of trying to test the C++ program since all your testing infrastructure is setup for testing assembly code and that is "what we have always been doing".
In order to catch verification up to modern Hardware Construction Languages [0] we need more powerful verification libraries that can allow us to build tests that can automatically adapt to the parameters that were supplied to the hardware generator. There are different groups working on this right now. The jury is still out on how to best solver the "verification gap". In case you are interested:
I am probably missing some approaches from the nmigen world that I am not familiar with. You can always write cocotb [1] tests in python, but I am not sure if they can directly interface with nmigen generators to adapt to their parameterization.
Meanwhile there are surprisingly many programming languages with a "synthesizable subset" meant to "replace Verilog". From my point of view it would make more sense to look for a dedicated language, which does not have the uncertainty of synthesizability (because the language was originally intended for something completely different), and which does not aim at the most general case of digital design, but e.g. only RTL and behavioral level for synchronous circuits. This might be an example: https://people.inf.ethz.ch/wirth/Lola/index.html.
Chisel imho solves the "synthesizable subset" problem quite elegantly: All Chisel constructs are simple, synthesizable circuit elements and boolean functions (+ functions on fixed size integers that can be converted into boolean functions).
All automation happens in the meta-language which in this case is Scala. Chisel was always intended for synthesizable hardware!
Scala is a general purpose programming language. Chisel is therefore it's "synthesizable subset". To use Chisel you need to know Scala as well as the HW design specific constructs Chisel offers.
EDIT: and Scala or the functional programming paradigm are not something design or verifications engineers are usually familiar with, which was apparently also a major issue in the Google project according to the referenced talk (quote: "frankly, most hardware engineers don't really get passed this yellow [Chisel learning] curve")
The one thing Chisel adds on top of an Object hierarchy that describes a circuit is what PL people normally call "syntactic sugar". I.e., Chisel makes constructing this circuit object look more like a Verilog circuit by taking advantage of some nice Scala features. However, in the background, we are just constructing a data structure that represents a circuit, just like in a GUI library you might construct a data structure that describes your widget hierarchy. Chisel is not High Level Synthesis.
What I think the other user is getting at is that Chisel isn't a "synthesizable subset" of Scala. You can use any Scala you want as Chisel is just a library. The model isn't a subset of Scala that can be turned into a netlist with the right compiler, but of a library for metaprogramming netlist graphs.
Thanks for clarifying, I was under the (false) impression that Chisel was used for HLS. I know of a few companies working with SystemC and UVM-ML to accelerate verification. Having closer ties between design and verif becomes absolutely essential when you’re flow is accelerated.
I’ve personally used HLS C++, GTest and UVM. It’s pretty effective but there’s definitely room for improvement.