
XLS: Accelerated HW Synthesis - victor82
https://google.github.io/xls/
======
Traster
>XLS is used inside of Google for generating feed-forward pipelines from
"building block" routines

For those that aren't familiar, control flow - or non "Directed Acyclical
graphs" are the hard part of HLS. This looks like a fairly nice syntax
compared to the bastardisations of C that Intel and Xilinx pursue for HLS but
I'm not sure this is bringing anything new to the table.

As for the examples, I'm kind of flumoxed that they haven't given any details
on what the examples synthesize to. For example, how many logic blocks does
the CRC32 use? How many clock cycles? What about the throughput? I'm going to
sound like a grumpy old man now, but it's important becaues it's very
difficult to get performant code as a hardware engineer. Generally it involves
having a fair idea of how the code is going to synthesize. What is damn near
impossible is figuring out what you want to synthesize to, and then guessing
the shibboleth that the compiler wants in order to produce that code. Given
that they haven't tackled the difficult problems like control flow, folding,
resource sharing etc. It makes me hesitant to believe they've produced
something phenomenal.

~~~
learyg
Hi, one of the collaborators here, thanks for the good points.

We have been targeting some Lattice FPGAs for prototyping purposes, but we've
mostly been doing designs for ASIC processes, which is why details are a
little sparse for FPGAs you get off the shelf, but it's a priority for us to
fill those in. We have some interactive demos that show FPGA synthesis stats
(cell counts, generated Verilog, let you toy with the pipeline frequency) and
integrate with the [IR
visualizer]([https://google.github.io/xls/ir_visualization/#screenshot](https://google.github.io/xls/ir_visualization/#screenshot)),
we'll try to open source that as soon as possible. The OSS tools (SymbiFlow)
that some of our colleagues collaborate on can do synthesis in just a few
seconds, so it can feel pretty cool to see these things in near-real-time.

We fold over resources in time with a sequential generator, but we still have
a ways to go, we expect a bunch of problems will map nicely onto concurrent
processes, they're turing complete and nice for the compiler to reason about.

I'm a big believer that phenomenonal is really effort and solving real-world
pain points integrated over time -- it's a journey! We're intending to do blog
posts as we hit big milestones, so keep an eye out!

~~~
Traster
Do you mind me asking what applications Google uses this for internally? Is
this used in a flow that's ended up in production? Also, what are your
thoughts on integrating optimized RTL blocks?

~~~
learyg
One of the things we have on our short list is "good FFI" for instantiating
existing RTL blocks (and making their timing characteristics known to the
compiler) and making import flows from Verilog/SystemVerilog types. The latter
may be a bit your-Verilog-flow specific, but we think there are some universal
components you can provide that folks can slot in their flows as appropriate.

Being able to re-time pipelines without a rewrite is a useful capability.
Although it's still experimental and we're actively building out the
capabilities, we have it in real designs that have important datapaths.

------
Connect12A22
I love their RISC-V implementation in 500 lines of code:
[https://github.com/google/xls/blob/main/xls/examples/riscv_s...](https://github.com/google/xls/blob/main/xls/examples/riscv_simple.x)

~~~
Traster
It's kind of a good demonstration of the problem with software versus
hardware, here's xls solution (just for one function):

    
    
      fn decode_i_instruction(ins: u32) -> (u12, u5, u3, u5, u7) {
       let imm_11_0 = (ins >> u32:20);
       let rs1 = (ins >> u32:15) & u32:0x1F;
       let funct3 = (ins >> u32:12) & u32:0x07;
       let rd = (ins >> u32:7) & u32:0x1F;
       let opcode = ins & u32:0x7F;
       (imm_11_0 as u12, rs1 as u5, funct3 as u3, rd as u5, opcode as u7)
      }

here's the systemverilog solution

    
    
      {im_11_0,rs1,funct3,rd,opcode} <= ins;
    

Obviously, in software, you can't slice data in the same way since as far as I
can tell, it's assuming all variables are a certain size and so there's no
naturally way of bit slicing.

~~~
learyg
Thanks again for the detailed thought! We actually [developed more advanced
bit slicing syntax](
[https://github.com/google/xls/blob/1b6859dc384fe8fa39fb901af...](https://github.com/google/xls/blob/1b6859dc384fe8fa39fb901af3de8453661ff345/xls/dslx/interpreter/tests/bit_slice_syntax.x#L32)
) since that example was written, you can do things like a standard slice
`x[5:8]` or a Verilog-style "width slice" that has explicit signedness `x[i +:
u8]`. There's currently no facility for "destructuring" structs as bitfields
like pattern matches, but there's no conceptual reason it can't be done, I
think that'd be an interesting thing to prioritize if there's good bang for
the buck. [Github issue to
track!]([https://github.com/google/xls/issues/131](https://github.com/google/xls/issues/131))
Let me know if I missed out on details or rationale, thanks!

~~~
Traster
Hey, thanks for replying, the project looks like it has a lot of potential.
You're right, bit slicing gets you like 99% of the way there (the rest is just
syntax sugar). It's interesting because from what I remember there were some
non-trivial issues for the people using LLVM for their IR because of
fundamental assumptions in the representation, but bit-slicing is the core
functionality. Is there a reason you guys decided on your own IR?

------
jashmenn
I've been programming for 20 years and yet I have no idea what this does. Can
someone ELI5?

~~~
zelly
Verilog for codemonkeys

~~~
FullyFunctional
That's a complete mischaracterization. The point of any and all HLSes is to
raise the level of abstraction so you can be more productive. Even for highly
skilled Verilog "monkies", writing in an HLS is a great deal faster and less
error prone (assuming comparable mastery of the language) simply because you
do not need to deal with a lot of low level details.

The $1M question however how this experience pans out as you try to squeeze
out the last bit of timing margin. I don't know, but I'm eager to find out.

ADD: this parallels the situation with CUDA where writing a first working
implementation is usually easy, but by the time you have an heavily optimized
version ...

------
mmastrac
I love this. I did something similar with using Java to build an RTL:

[https://github.com/mmastrac/oblivious-
cpu/blob/master/hidecp...](https://github.com/mmastrac/oblivious-
cpu/blob/master/hidecpu2/src/main/java/com/grack/hidecpu2/CPU.java)

I was thinking about turning it into a full language at some point, but they
beat me to it (and I love the Rust syntax!).

------
jeffreyrogers
This is interesting. Overall I'm bearish on high-level synthesis for anything
requiring high performance, since you typically need to think about how your
code will be mapped to hardware if you want it to perform well, and adding
abstractions interferes with that. I would like to know more about how Google
uses this, since it doesn't seem like a good fit for the type of stuff I work
on.

~~~
typon
This doesn't seem like HLS, more like a new HDL that's based on Rust. This has
been done many times before with other functional languages (Clash, Chisel,
Spinal, hardcaml and others). These projects never take off because hardware
designers are inherently conservative and they won't let go of their horrible
language (Verilog or SystemVeriog) no matter what.

I'm sure Google will use XLS for their internal digital design work, but I
don't expect this to ever gain widespread support. (not because HLS is
inherently bad, but because of the culture)

~~~
Traster
> These projects never take off because hardware designers are inherently
> conservative and they won't let go of their horrible language (Verilog or
> SystemVeriog) no matter what.

This is categorically not true. There have been repeated projects to re-invent
hardware description languages. They don't fail because hardware engineers are
conservative, they fail because they don't produce good enough results.

Intel has a team of hundreds of engineers working on HLS, Xilinx probably has
almost as many, there are lots of smaller companies working on their own
things like Maxeler. They haven't take off because it's an unsolved problem to
automate some of the things you do in Verilog efficiently.

Take this language for example - it cannot express any control flow. It's feed
forward only. Which essentially means, it is impossible to express most of the
difficult parts of the problems people solve in hardware. I hate Verilog, I
would love a better solution, but this language is like designing a software
programming language that has no concept of run-time conditionals.

~~~
aseipp
I mean, languages like Bluespec are very close to actual SystemVerilog
semantically, and others like Clash are essentially structural by design, not
behavioral (I can't speak for other alt-RTLs). You are in full control of
using DFFs, the language perfectly reflects where combinatorial logic is done,
the mappings of DFFs or IP to underlying RTL and device primitives can easily
be done so there's no synthesis ambiguity, etc. In the hands of an experienced
RTL engineer you can more or less exactly understand/infer their logic
footprint just from reading the code, just like Verilog. You can do Verilog
annotations that get persisted in the compiler output to help the synthesizer
and all that stuff. Despite that, you still hear all the exact same complaints
("not good enough" because it used a few extra LUTs due to the synthesizer
being needy, despite the fact RTL people already admit to spending stupid
amounts of time on pleasing synthesizers already.) Died-in-the-wool RTL
engineers are certainly a conservative bunch, and cagey about this stuff no
matter what, it's undeniable.

I think a bigger problem is things like tooling which is deeply invested in
existing RTLs. High-end verification tools are more important than just the
languages, but they're also very difficult to replicate and extend and
acquire. That includes simulation, debuggers, formal tools, etc. Verification
is where all the actual effort goes, anyway. You make that problem simpler,
and you'll have a winner regardless of what anyone says.

You mention the Intel and Xilinx's software groups, but frankly I believe it's
a good example of the bigger culture/market problem in the FPGA world. FPGA
companies desperately want to own every single part of the toolchain in a bid
for vertical integration; in theory it seems nice, but it actually sucks. This
is the root of why everyone says Quartus/Vivado are shitware, despite being
technically impressive engineering feats. Intel PSG and Xilinx just aren't
software companies, even if they employ a lot of programmers who are smart.
They aren't going to be the ones to encourage or support alternative RTLs,
deliver integrated tools for verification, etc. It also creates perverse
incentives where they can fuel device sales through the software. (Xilinx IP
uses too much space? Guess you gotta buy a bigger device!) Oh sure, Xilinx
_wants_ you to believe that they're uniquely capable of delivering P&R tools
nobody else can — the way RTL engineers talk about the mythical P&R
algorithms, you'd think Xilinx programmers were godly superhumans, or they
were getting paid by Xilinx themselves — that revealing chip details would
immediately mean their designs would be copied by Other Electronics Companies
and they would crumble overnight despite the literal billions you would need
up-front to establish profitability and a market position, and so on. The ASIC
world figured out a long time ago that controlling the software just meant the
software was substandard.

------
thotypous
Google is also investing some developer time on Bluespec since it was
opensourced ([https://github.com/B-Lang-org/bsc](https://github.com/B-Lang-
org/bsc)). I wonder if these projects make part of a bigger plan at Google.

------
rbanffy
When I started playing with MAME, I somewhat dreamed of a way to turn its
highly structured code into something that could not only be compiled into an
emulator as it is, but also be synthesizable into hardware.

The possibility of using a single codebase to generate both a software
emulator and a hardware implementation is incredible, from a hardware
preservation point of view.

------
asdfman123
If they rename it XLSM they can embed some neat VBA scripts into it and
squeeze out more functionality.

(I'm sorry.)

------
w_t_payne
I've got a Kahn-process-network based "simulation" framework, intended to
provide a smooth conveyor belt of product maturation from prototypes written
in high level scripting languages like Python or MATLAB through to production
code written in C or Ada. (Sort of like Simulink, but with a different set of
warts). Having some hardware synthesis capability is very much on the roadmap,
and this looks like it's going to be worth investigating for that. Very
excited to dive into it!

------
ampdepolymerase
Reminds me of the old reconfigure.io which used the ideas and syntax of Go's
CSP and transformed them into async HDL code. Unfortunately the startup has
been shuttered.

[http://docs.reconfigure.io/](http://docs.reconfigure.io/)

~~~
navidr
What happened to them?

~~~
ampdepolymerase
They shut down. Here's the founder:
[https://twitter.com/robtaylor78](https://twitter.com/robtaylor78)

~~~
navidr
Thanks. Do you know the reason?

------
simonw
XLS as an acronym for Accelerated HW Synthesis is a bit of a stretch!

~~~
dirtypersian
I believe it might come from the fact that this process of going from high
level programming language to hardware is called "high level synthesis". I
think the "X" is meant to make it more generic, i.e. X level synthesis.

~~~
simonw
That makes sense. Accelerated => XL just about works for me.

------
rowanG077
DSLX seems like a nightmare. Does it support arbitrary C++?

------
R0b0t1
See also
[https://github.com/SpinalHDL/SpinalHDL](https://github.com/SpinalHDL/SpinalHDL).

