Hacker News new | past | comments | ask | show | jobs | submit login
A Simulated Annealing FPGA Placer (stefanabikaram.com)
103 points by stefanpie 3 months ago | hide | past | favorite | 35 comments



Simulated annealing has been useful for decades. My undergrad VLSI project was to lay out a 32-bit CRC chip and test it with SPICE (an analog simulation tool). To get the capacitance down, you need short wires. But a CRC-32 has many many interdependencies. Laying one out by hand would take a very very long time.

So my partner wrote the Verilog code for the annealer and I wrote a C++ program that attached virtual springs to each gate, iteratively moving gates closer together if they were far apart. At first, the movements are dramatic, but over time, the total length of all gates converges onto an asymptotic limit that is much better than the starting point.

Once we had a gate layout implemented in an obscure EDM language, we were able to bring it into SPICE and damn right it worked the first time! I think the professor was somewhat mystified why we didn’t just build a simple 4-bit adder instead, but spend 50 hours on this project was a lot more fun than doing a hand layout.


And here is the layout: https://www.dropbox.com/scl/fi/ksxru0lmnp0oha356fdby/layout....

With apologies, it was a CRC-8. Still so many nodes and so many wires…


Wow, that looks complicated. I ended up implementing the 32 bit ethernet CRC in Verilog for use on a Kintex 7 a few years ago, but I never looked at the mess of wiring that ended up generating. The code had to work with an arbitrarily wide bus as the PCIe core required a 256 bit wide bus to close timing at an achievable clock speed that was able to fully utilize PCIe x4 gen 2.


Years ago I figured out how to get the best first approximation ... borrow a gym for a day, make 1000s of metal plates with hooks and a bar code, one for each gate, have a robot take rubber bands and use the netlist to hook up the plates. Then pick up the whole thing by the pads and give it a bit of a shake, lay it all carefully back down and have the robot go around and read the position of each bar code ....


I used simulated annealing in my previous work doing molecular dynamics- it was great for not getting your protein trapped in a local minima, which gradient descent is prone to doing. I got excellent results for very little work.

I asked my intern, who was knowledgeable in deep networks as well as molecular stuff, "it looks like ML training mainly does gradient descent, how can that work, don't you get stuck in local minima?" and they said "loss functions in ML are generally believed to be bowl-shaped" and I've been wondering how that could be true.

It's interesting to read up on the real-world use of annealing for steel - it's quite intersting how you can change steel properties through heat treatment. Want it really strong? Quench it fast, that will lock it into an unstable structuer that's still strong. Quench it slow, it will find a more stable minimum, and be more ductile.


Their explanation is relatively incomplete, I think. The best explanation I can come up with regarding why gradent descent works is that the optimization space is extremely high-dimensional, which has the tendency to force functions that would look weird in low dimensions to be more well-behaved. Points taken randomly in high-dimensional space also tend to be closer together.

The numerical benefits of dimensionality are my belief as to why the "stack more layers" crowd has been so successful.


High dimensionality allows gradient descent to make progress independently in each dimension. When you get stuck in one dimension, you make progress in another dimension until you can progress in the previously blocked dimension. Also, even if you do get stuck in X dimensions, you are unlikely to significantly reduce that X by finding a better starting point and if there is redundancy, then it doesn't matter that you have the local minima in one dimension but not the other dimension. The end result is that the quality differences between the local minima end up as insignificant. They are sufficiently good compared to the global minima.


Modern DL architectures are complicated. If you want to gain insight if a loss function is convex (bowl shaped), you can calculate the second derivatives matrix (Hessian) at arbitrary points and test that all its eigenvalues are non negative. If you have a simple linear regression then that’s true, but for modern DL architectures things are not always convex and of course Hessians in very high dimensions are unrealistic. A simpler numerical test is to take lines connecting two arbitrary points of the loss function and showing that the value of the function along the line of the function variable connecting the two points is always lower than or equal to the value of the line connecting the two points. It is also hard to make this test rigorous in a practical high dimensional setting but one could be lucky and eg connect local minima to show that the function is not convex.

The practical reason people didn’t use annealing on the DL problems at scale has been the satisfactory empirical results of the simpler/faster methods.


Does this actually result in a routable design when used on a real FPGA?

Optimizing for the lowest value of a distance metric isn't necessarily going to be ideal - a highly compact placement (like the results shown) is going to require a lot of wires to pass through the center of the design. Some FPGAs may not have sufficient routing resources to support this, especially if there are many wires that cross over the center without interacting with it.


My understanding of simulated annealing is that solutions that are not improvements are still accepted with some probability in early steps but that this probability decreases as "temperature" drops. Looking at your description (but not code) I did not see that aspect but it looked like you would only accept improvements of the cost function. Is this correct or where does your solution accept slight regressions with some probability, too?


Based on the other comments, they are correct in that when doing annealing you usually want to accept some moves that do lead to regression that might not improve the regression to escape early local minimums in your objective.

I abused the definition of annealing a lot in the post but I briefly touched on the idea:

"At first, you might want to make moves or swaps over large distances or you might want to accept some percent of moves that don't improve the objective, but as time goes on, you want to make smaller moves and be less likely to select moves that don't improve the objective. This is the "annealing" part of simulated annealing in the context of FPGA placement."

I think I might have made the writing confusing because I mixed the original definition of the annealing approach (of accepting moves that don't improve the objective) with the use of "annealing" for other things like action parameters (ex. swap distance between two nodes). Something I should edit to clarify better.

Note that, yes, the thing I implemented doesn't do any annealing but rather just pick actions that only improve the objective. I am working on some extensions to add real annealing but that turned out to have a lot of more in-depth technical work that is not obvious.


Yes, that's the annealing part. Otherwise it's just local search


In the "Simulated Annealing" section:

"At first, you might want to make moves or swaps over large distances or you might want to accept some percent of moves that don't improve the objective, but as time goes on ...

However, as it turns out, you technically don't need this annealing part to make FPGA placement work. You can just randomly try different moves and accept or reject them based on whether they improve the objective function. This is what I did in my toy implementation of an FPGA placer just to keep it simple."


The argument for annealing in the original paper is that accepting regressions is essential to escape from local minima.

https://www2.stat.duke.edu/~scs/Courses/Stat376/Papers/Tempe...

"Annealing, as implemented by the Metropolis procedure, differs from iterative improvement in that the procedure need not get stuck since transitions out of a local optimum are always possible at nonzero temperature. A second and more important feature is that a sort of adaptive divide-and-conquer occurs. Gross features of the eventual state of the system appear at higher tempera-tures; fine details develop at lower tem-peratures. This will be discussed with specific examples."


Yes, without a temperature and cooling schedule, how can it be annealing? It's in the name. It may sound harsh, but I'd call it an abuse of the term to do hillclimbing but call it annealing. It also seems lazy, since doing it right is an all but trivial addition to the code. Finding the best cooling schedule might require some experimentation though.


So obscure that in a field as important as optimization we still think in terms of „escaping from local minima“. Also (as a total outsider) the progress in general optimization algorithms/implementations appears to be very slow (I was shocked how old ipopt is). I was wondering if all the low hanging inductive biases (for real world problems) have already been exploited or if we just have no good ways of expressing them? Maybe learning them from data in a fuzzy way might work?


Unless you come with some new take on the P ?= NP problem, there isn't much we can improve on generic optimization.

There are all kinds of possibilities for specific problems, but if you want something generic, you have to traverse the possibility space and use its topology to get into an optimum. And if the topology is chaotic, you are out of luck, and if it's completely random, there's no hope.


Couldn‘t there be something between chaotic and completely random, let’s call it correlated, where e.g. (conditional) independence structures are similar in real-world problems?


You mean something that is well behaved in practical situations but intractable in general?

There is plenty of stuff like that, things don't even need to be chaotic for that. Anyway, chaotic and random are just two specific categories. There are many different ones. Nature happens to like those two (or rather, not random exactly, but it does surely likes things that look like it), that's why I pointed them.


Yes.

And beyond this intuition (escape from local optima), the reason that annealing matters is that you can show that (under conditions) with the right annealing schedule (it's rather slow, T ~ 1/log(Nepoch) iirc?) you will converge to the global optimum.

I'm not well-versed enough to recall the conditions, but it wouldn't surprise me if they are quite restrictive, and/or hard to implement (e.g., with no explicit annealing guidance to choose a specific temperature).


If you want to keep the pedants at bay, call this "simulated annealing at zero temperature." Or just call it a greedy algorithm.


My final year undergrad project (1992?) used an FPGA. The router for that used simulated annealing.


To the author: the source link https://github.com/stefanpie/sa-placer appears to be private.


Should be fixed now, thank you for the note!


If you're getting into the territory of Global Optimization algorithms like SA it can be helpful to implement your problem generically enough that it can be plugged into a general purpose optimization package like pagmo2, where you can experiment with a dozen or more similar algorithms.


Wondering how this compares to nextpnr (https://github.com/YosysHQ/nextpnr) ?


I believe that nextpnr has integrated both a simulated annealing placer and an analytical placer (based on this publication from 2019: https://arxiv.org/abs/1903.10407) (see here for the work the analytical placer is based on: https://ieeexplore.ieee.org/abstract/document/6339278). I think they are also working on electrostatic placement but I'm not sure what the current status of that is.


There is initial electrostatic placement support for iCE40 and ECP5 in upstream nextpnr at this point, since a few months. But there hasn't been a stable release just yet.


force/spring based placement also seems to work pretty well- surprised that wasnt simpler to implement as a toy method


Do any methods exist that utilize deep learning to drive the optimization?


Yes, see DREAMPlace -- "DREAMPlace: Deep Learning Toolkit-Enabled GPU Acceleration for Modern VLSI Placement".[1] The technique in particular rather reformulates VLSI placement in terms of a non-linear optimization problem. Which is how ML frameworks (broadly) work, optimizing approximations to high-dimensional non-linear functions. So it's not like, shoving the netlist it into an LLM or an existing network or anything.

Note that DREAMPlace is a global placer; it also comes with a detail placer but global placement is what it is targeted at. I don't know of an appropriate research analogue for the routing phase of the problem that follows placing, but maybe someone else does.

[1] https://github.com/limbo018/DREAMPlace


I also want to note that some people are looking at using ML/DL to generate placement solutions, even though DREAMPlace is still a key recent work in this area that at the core uses analytical placement, although cleverly using deep learning libraries and features.

For example, the VTR people are looking at RL to pick what actions to take during FPGA placement for their simulated annealing placer (see: https://ieeexplore.ieee.org/document/9415585).

I also know this paper/poster was indexed recently under OpenRevew but I'm still waiting for some more implementation details to look at: https://openreview.net/forum?id=6GR8KqWCWf

FPGA routing I don't think anyone has touched on using ML/DL but I do know that there is some talk about using ML/DL models with current routing approaches to replace search heuristics (think like replacing or augmenting something like A*) or do routability predictions. Certainly, there are probably many ways to use RL in routing as there are many places in current algorithms to intelligently make certain heuristic decisions.

Edit: I also want to note that there are a ton of works that also use ML/DL to tune the "hyperparameters" of EDA tools such as placers and routers. Think ML/DL for back-box non-linear optimization.


> The technique in particular rather reformulates VLSI placement in terms of a non-linear optimization problem. Which is how ML frameworks (broadly) work, optimizing approximations to high-dimensional non-linear functions. So it's not like, shoving the netlist it into an LLM or an existing network or anything.

Do you mean it uses gradient descent to find the optimum?


You can’t really do that for nonlinear functions


Well everybody in DL is doing it, and neural networks are nonlinear functions.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: