Hacker News new | past | comments | ask | show | jobs | submit login
Ask HN: Asynchronous FPGAs and flowchart programming
63 points by JaCaLet on May 13, 2022 | hide | past | favorite | 52 comments
Hello everyone! I've been working to make an FPGA run asynchronously. I think this will be the fastest way to compute.

I interned at You Know Solutions and learned the flowchart programming environment they use. Now they have a new technology patented and I'm trying to help realize the potential. The flowchart programs are asynchronous by design and can create parallel computations. I've been trying to reproduce a flowchart program on a FPGA.

Does anyone use flowchart programming anymore? Has anyone used a FPGA to run parallel processes or asynchronously?




Parallel yes, as other people mentioned, this is almost the entire point of using FPGAs. Regarding Asynchronous, it depends on what you mean. Xilinx(AMD)/Altera(Intel) FPGAs are designed from bottom up to be synchronously clocked. The fabric and tools are designed to use synchronous pipeline registers everywhere to minimize combinatorial logic and increase throughput. You might want to have a design with multiple asynchronous clock domains, but this increases complexity and requires care whenever you want to jump between clock domains. Trying to force asynchronous design into an FPGA seems counter productive. What would be the advantage of asynchronous design?


There are some advantages.

For example, in Alpha AXP they measured that 60% of energy spent in device is due to clock propagation. No clocks to tick - no energy spent. Why do we need to even clock FPU? Or bus - if we are in loop that is in cache.

Another example: in async design ripple-carry adder will exhibit O(log(N)) expected time, with worst case being O(N) and most of the time it will be even less O(log(L)) where L is number of bits that are non-zero. Basically, adding 1 will be as fast as, well, doing AND and XOR in parallel. For clocked design you need to make adder more complicated to make sure that worst case is O(log(N)).

The same is true for other parts as well - multiplier may not even need to wait for some values multiplied by zero bits. You may end up with O(log(N)) and even faster average case multiplier.

Your design does not need strict adherence to timing requirements: if you have seldomly used slow part, your chip still would work fast most of the time (in average). I know of one case where clock frequency of synchronous design had to be turned down because of problems in the placement of some, you guessed it, not frequently used part of a chip - a long bus line to some I/O controller that operated on main clock frequency. This means your asynchronous design can be more modular.


No clocks to tick - no energy spent

If only it were that simple. Logic gates take time to settle, and each input gate switch or transient will have a ripple effect on all its downstream gates, which can be many in a complex circuit. Synchronous logic elements such as latches will block the spurious transients from propagating beyond the next clock barrier, but if you lack those, you also lose the protection against propagating logic transients. And every transient draws a little bit of power.

Imagine the ripple effects of a 64-bit 2-operand multiplier (simple ripple-carry, as it's the easiest to reason about). Since the inputs are probably not gated either, each of the 4096 adder tree inputs may arrive at a different time, and each input has an average of 96 downstream gates (64/2 adder tree height, 128/2 carry propagation length). The carry propagation is done through and-gates which have an attenuating effect on the propagation length (each input bit flip only has 50% percent change of propagating the change), but the xor-gates for the adder propagate every transient. On average, you still get 64 transients per adder input transient, and 2048 (64 and-gates * 50%) transients for every operand bit flip. That's a lot to account for in your worst-case power envelope.

Yes, asynchronous designs are more flexible to work with. But they are less predictable in operation, not just in propagation delay but also in power usage. And you still need some form of inter-module communication, and that communication needs to account for differences in signal path length -- which is much easier to do if you can refer to a global clock.

I'm sure there have been successful asynchronous designs for specific applications (e.g. analog feedback control loops), and I haven't kept up with the last ten years of IC development which is a lifetime, but most asynchronous logic designs weren't necessarily faster than their synchronous implementations last time I checked.


Contemporary intermodule designs are pipelined and message-oriented exactly because it is hard to predict difference in signal path length for long paths. I am talking about high speed buses from ARM, I think I read about them in 2016 or so.

The same can be done with asynchronous designs, in more relaxed way.

You said that asynchronous designs are less predictable in their use of power. Can you elaborate on that?


> The same can be done with asynchronous designs, in more relaxed way.

Sure, just ask these guys:

https://chronostech.com/technology

Chronos Link: A QDI Interconnect for Modern SoCs https://ieeexplore.ieee.org/document/9179196

It's compatible with TileLink, which is SiFive's Fabric. https://bar.eecs.berkeley.edu/projects/tilelink.html


Another advantage is higher yield due to higher tolerance to the production defects.


This implies yield loss is mostly due to small delay defects and not stuck-at faults. Are you sure this is the case?


> What would be the advantage of asynchronous design?

Just the regular advantages, only with FPGA, which means one can choose how logical elements are interconnected, and what's the logic of the chip. Among regular advantages are absence of clocks (less devices, no need to synchronize...) and energy is used when and where the switching happens.

A friend of mine unsuccessfully tried to squeeze asynchronous designs into some mainstream FPGA a few years ago. Tooling wasn't cooperative, and when he used some workarounds to avoid generating clocks, it was simply crashing. I don't think it's useless or for lack of trying - but asynchronous circuits in FPGAs are certainly not common.


> choose how logical elements are interconnected

On the RTL level, you can already do that with FPGAs. On the physical level, you can't do that with an asynchronous design either.

> absence of clocks (less devices

The clocks are still there physically and consume space, even if you don't use them.

> no need to synchronize

Synchronization becomes very easy when the clocks are aligned and the frequencies are multiples of each other. FPGAs have delay elements in the clock blocks to help with the alignment.

> energy is used when and where the switching happens.

There are several points of energy use: * the clock network -- you are right about this. Does anyone know how much of the total energy use goes into the clock network? * registers and downstream logic -- behaves the same, whether synchronously or asynchronously. A register that doesn't "flip" will not consume energy for that, and the downstream logic will not flip either. * whatever the asynchronous logic needs for coordination -- don't forget that this is not for free.

Analyze energy consumption first before jumping to conclusions or even measures. The whole energy topic reeks of premature optimization.


I think I see where you are coming from. There certainly could be some power reduction if you limit the amount of switching. It's just combinatorial logic, so there shouldn't be any tool issues. The real challenge would be in verification. Usually, there are timing constraints that analyze your design and attempt to guarantee that everything will work across worst case temperature and process variations. Without timing verification there would be a lot of uncertainty in the actual path delays. So just because it might work on one device that was tested, this wouldn't guarantee that the design would work consistently. There would be a ton of glitches and phantom pulses to contend with, and every time you change something the routing delays will change! But maybe you have a method to deal with this.


The combinational part of async design is built to be self-synchronous. You derive a clock signal to write computed value from the computed signal itself.

The combinational part also synthesized as monotone function without ringing - voltages there never go down after they went up during compute, and they never go down and then up and then down again when computation is reset.

This means that timing guarantees can be local, related only to parts next to concrete registers.

Usually, asynchronously designed chips work in the first batch. They also often work being underpowered, when power voltage is slightly lower than switching voltage - because switching voltage is set for typical transistor to work at the speed needed. Asynch designs usually are much less speed-dependent and can work being "officially underpowered".


Yes, makes sense. I can see how that could be beneficial in some situations.


You can clock fpga ffs from non clock signals in xilinx/amd fpgas. Not sure how well it scales, but it's possible.


Labview provides a dataflow programming environment that can be used to program FPGAs (eg in the compactRIO line of hardware).

Not saying it's great (it's not), but it works in its niche: programming of one-off test fixtures by people who really have no idea about programming, digital circuits or FPGAs but might need the performance they afford (eg real-time high frequency control of a 100K+ RPM prototype gas turbine). You also have to be willing to shell out for their hardware (and labview itself), but in the testing world, I've found their hardware to actually be on the "pretty affordable for what you get" side, compared to the likes of HBM eDAQ or Siemens LMS setups.


Async adds a lot of overhead to propagate readiness information alongside every data path and calculate it through every piece of logic.

Unless things have changed drastically in the last decade-and-a-half (well, or the professor in that class was wrong), it's way more efficient to just precompute all that and mess with shifting logic between pipeline stages so everything lines up as closely as possible against a shared click signal.


"Way more efficient" as a conceptual and implementation model.

But synchronous logic always leaves some speed on the table, since you have to choose the global clock rate for the longest data path that must complete in one clock; for any given path that may be active on a clock, it's likely to not be the longest.

In practice it's possible to architect synchronous critical paths with pipelining to keep the worst single-clock data path reasonable, basically spread the work over multiple clock periods so you can select a faster clock rate and so reduce the wasted time on average to the point nobody cares.

OP sounds like he stands zero chance of implementing any nontrivial async design if he's talking about 'flowchart' rtl generation in the same breath.


You forgot that clock delay is computed for worst-case register and combinational logic behavior, e.g., "99.9999% of all registers need this amount of time for signal to be stable before write", "99.9999% of this gate in real silicon will have this delay".

The chart I saw several years ago put about half of typical clock cycle delay into this "reserve time" part. I guess things did not change much since then.


It depends what you are doing but with async it's possible to synthesize SR latches that don't have these setup / hold and metastability problems, the output of the logic is directly strobes that set or reset the storage element.


FPGA is literally as parallel as you can make a computation, the ultimate jaunt towards the space end of the time/space tradeoff. Don't like waiting two cycles for your ALU to finish working on previous data? Put another ALU right next to it.

So yes, you can totally make a parallel program on an FPGA. As long as there isn't a data or control dependency between two statements, they can be implemented to execute simultaneously.

As far as flowchart programming, I'm not sure what advantages that would confer over existing HLS tooling.


The flowchart programming is built to be parallel. It’s inherent to the ordering of execution.


"Fastest way to compute"

This isn't necessarily true, especially considering the architecture of an FPGA. You have no control over the routing of the circuit and you're extremely restricted by the tools (which have decades of work towards synchronous circuits). More often than not, a synchronous circuit will end up being faster and more practical (there's a lot of overhead for async as well).

Another issue is that a lot of fundamental asynchronous primitives like the muller C-element and latches aren't really feasible to implement on the fpga (easily). The C-element requires a feedback loop on the LUT which is really hard to constrain properly, and the tool will fight you for doing that.

There's a cryptography paper out there comparing a synchronous and asynchronous implementation of ciphers and the conclusion was that synchronous was easier to implement and had higher throughput


There's a lot of interesting research out there as designers having been toying with asynchronous for decades.

For example this one sponsored by Intel where they put an asynchronous instruction length decoder into a Pentium.

https://my.eng.utah.edu/~kstevens/docs/rappid.pdf

They won on latency and power with comparable area. The issue that blocked it was DFT CAD and the ATE infrastructure doesn't exist for asynchronous designs.


You may want to look up asynchronous logic - it's more complicated that you realize. In fact, you'll quickly understand why 99% of all digital designs use synchronous logic design instead - it has a far smaller gate count to accomplish the same function.

There are legitimate places where asynchronous logic can be very useful: specifically when you are interacting with the "real world" which is not synchronous. But once you get beyond that, going back to synchronous design is usually best.

http://www2.imm.dtu.dk/pubdb/edoc/imm855.pdf

https://www.researchgate.net/publication/245530456_Asynchron...

https://www.researchgate.net/publication/331181568_Asynchron...


Before you go too far down this path you should look into previous works and understand why they failed. There’s a ton of information out there on this. Here’s something to get you started:

https://www.eetimes.com/startups-try-to-revive-null-conventi...


And ARM had clockless processor prototypes in 2006; I remember learning about them at university. https://www.eetimes.com/arm-clockless-core-cuts-power-to-abo...


"I interned at You Know Solutions and learned the flowchart programming environment they use. "

And what do they use? How does it work, do you click your flowcharts together?


Their patent seem to mention flowcharts:

https://www.freepatentsonline.com/9003383.html


I am familiar with the you know solutions patents and the one you refer to is for parallelizing C code and other languages into the flowcharts. Here is a link to the asynchronous design patent called "processing circuits for parallel asynchronous modeling and execution". https://www.freepatentsonline.com/10181003.html


Interesting, but not great, since I work on something related and I was not aware of any patents in that area. But it should be distinct enough, I hope. (I don't target FPGAs for example, but I have zero experience with patents, so lets see how that works out)


Yup the flowcharts are the key to the parallelizing of code


It’s similar to ladder logic. It’s called FlowPro. You can combine flowcharts together that’s easy.


It's not really similar to ladder logic but in those days it did the same thing, controlled machines. In ladder logic all of the rungs of the ladder are evaluated all of the time but with flowcharts only the necessary part of each flowchart is executed as the machine cycles.


First of all google is your friend. Search for Muller-C, asynchronous communication

As a starter, have a look at these paper:

http://www2.imm.dtu.dk/pubdb/edoc/imm7126.pdf https://essay.utwente.nl/79740/1/YADAV_MA_EEMCS.pdf


Not sure if you are actually talking about clockless logic. Maybe you are talking about asynchrony at a higher level of granularity.

But in fact there was a company founded to make FPGAs based on clockless logic: Achronix. They found that their customers wanted to map clocked designes onto their FPGAs and don't make any noise about clockless anymore - possibly their designs still use it under the hood, possibly not.


Yes clockless logic no handshaking


Even the most trivial design will need some form of synchronization which implies handshaking of some kind.

Asynchronous design is a really interesting field where it's pretty easy to get wins at the circuit level, but it's much harder to win at the system level. Especially when you realize there is no rule that says the system needs only one clock domain, and the period of those domains doesn't actually have to be constant.

I highly recommend you spend some time with a recent overview in the field if you're serious about it. Here is a good one:

http://www.cs.columbia.edu/~nowick/nowick-singh-async-IEEE-D...


What handshaking? Never heard of that before.


Looks like you want to implement "asynchronous circuit": https://en.wikipedia.org/wiki/Asynchronous_circuit

These basically need a handshaking logic for every independent data path.


Take a look at the you know solutions patent. It doesn't use handshaking and the design can be clockless. https://www.freepatentsonline.com/10181003.html


The main issue with the comments are people are mixing terms without knowing it.

To many a single instruction executed on a CPU is an atomic event. This is not the case for a circuit designer (and FPGAs are closer to circuit design as technically what you're doing is configuring them) For us an instruction on a CPU is a sequence of many smaller events, sometimes happening in parallel, which all need to be properly ordered to get a correct result. The most basic example is adding two multi-bit numbers, as was given in tremon's earlier comment, how does the next circuit know that all the bits in the result are ready to be consumed? To us those are parallel processes too, and we synchronize them. Sometimes by design (i.e. this process is guaranteed to complete before the next tick) and sometimes with a separate handshaking circuit. But no matter what, there is always some form of synchronization present in the machine itself.


If you step back and look at a flowchart or thousands of flowcharts that represent parallel tasks, I think the object of the patent is to get those flowcharts to propagate (i.e. execute on their own) without a processor. The propagation flow is always forward (not requiring a handshake) until a loopback is reached on the flowchart. A new propagation begins at the loopback destination block. The new propagation flow may or may not follow the same flowchart past depending on decision events. Synchronization takes place at the flowchart level and not at the circuit level. To synchronize, one flowchart sets a variable and other flowcharts can test the variable and decide what to do. The flowcharts are the code which synthesize directly to action, test and task objects without Boolean or state machine structures. These structures (circuits) are synchronous when the flowcharts are implemented in a standard FPGA but the flowcharts themselves remain asynchronous. The patent mentions an FPFA (field programmable flowchart array) that would use clock less circuitry.


Is the flowchart system different from a transition system? https://en.wikipedia.org/wiki/Transition_system

If not, I don't know of a way to make that machine without some timing assumptions. https://authors.library.caltech.edu/26721/2/postscript.pdf

Maybe other people do though...


Yes it is, transition systems are based on state and flowcharts are stateless but can easily be made statefull. I'm not a PhD but here is a PhD that states this although I don't agree with all of his conclusions. http://www.stateworks.com/technology/TN9-Flowchart-is-not-St...

That's the point to asynchronous flowchart programming. Everything is event driven unless timing is specifically specified on the flowchart and synchronization takes place on the flowchart. Flowcharts do not represent the flow of time they represent the flow of events. Timing closure then becomes ensuring that every atomic path on every flowchart meets a system throughput requirement. The flowcharts are partially ordered according to an algorithm that follows the flowchart lines and therefore ensures that these partial orders (atomic path's) are pipelined.


You might want to ask Maya Posch about her experiences. See: https://mayaposch.wordpress.com/category/programming/vhdl/


You should check out GreenArrays. Their G144A12 is an amazing little asynchronous chip. It's not an FPGA though.


I am aware of these guys and they have been around for years. The only similarity is that they have an asynchronous chip but it is implemented in the old classic asynchronous design approach with specialty circuits for their language. The language they use is Forth a ‘stack based’ language.


Isn’t the purpose of an FPGA is to run parallel processes? Parallel meaning 2 actions taking place at the same time


Not necessarily. Its just that CPUs are better at sequential so you'd only really use an FPGA if you had grossly parallel plans. And GPUs are better at some set of parallel operations.

So you'd only use an FPGA if you needed to do something GPUs and CPUs couldn't do, especially because of how cheap CPUs / GPUs are.


FPGAs can be low cost too. Another reason to use them is when you need a ton of IO, or my preferred application, real time control systems. In many cases I find it’s easier to get very precise, and predictable, timing in an FPGA.


Thank you sir/ma’am for the explanation!


Thank you sir/ma’am for the explanation!


I know of people who build a microprocessor in SimuLink.




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: