
STOKE – a stochastic optimizer for x86_64 assembly - ingve
https://github.com/StanfordPL/stoke-release
======
dikaiosune
I was very curious to skim the papers that (I presume) covered results on non-
toy programs. Unfortunately all three 404 :(.

EDIT: Took a look at the issues for the repo, and here are links that work:

[https://cs.stanford.edu/people/eschkufz/docs/asplos291-schku...](https://cs.stanford.edu/people/eschkufz/docs/asplos291-schkufza.pdf)

[https://cs.stanford.edu/people/eschkufz/docs/oopsla011-sharm...](https://cs.stanford.edu/people/eschkufz/docs/oopsla011-sharma.pdf)

[https://cs.stanford.edu/people/eschkufz/docs/pldi52-schkufza...](https://cs.stanford.edu/people/eschkufz/docs/pldi52-schkufza.pdf)

In the issue he also mentioned this paper that's linked from his page:

[https://cs.stanford.edu/people/eschkufz/docs/cove.pdf](https://cs.stanford.edu/people/eschkufz/docs/cove.pdf)

Going to start reading through these now, seems pretty interesting and I've
never come across something like this before.

~~~
viraptor
> I've never come across something like this before.

Have a look at Souper. I believe it's got slightly different approach, but the
same idea / result.
([https://github.com/google/souper](https://github.com/google/souper))

Actually I'm a bit disappointed that none of the papers references it (did a
quick text search only).

~~~
UniQP
That's because a) the initial STOKE paper (2013) is older than Souper (initial
commit 2014) and b) there is no Souper paper so far

------
derf_
I applied this to some pieces of Daala's entropy coder:
[https://review.xiph.org/763/#msg2](https://review.xiph.org/763/#msg2)

The output needs a bit of manual review (see the mixing of 32-bit and 16-bit
accesses to the same register in the second example), but it was definitely
able to find something non-obvious in a sequence far too long for exhaustive
search.

------
nly
> This will produce a directory named bins that contains the text of every
> function contained in the binary ./a.out.

This is pretty cool functionality in itself!

------
nickpsecurity
"Stochastic optimization," "equivalence checking..."

Assembler optimization is starting to sound like the ASIC tooling I've been
reading about. That's exciting given what's been accomplished with that. :)

~~~
RossBencina
Can you recommend related reading in the ASIC tooling literature?

~~~
nickpsecurity
I'm not sure what your background is or how much you know about this stuff.
Unlike software, almost every aspect of the tooling for hardware is Mega-HARD.
Also, I collected a _ton_ of shit on that this year in hopes academics or FOSS
types will make OSS alternatives to Big Three. So, here's some intro and
example links for you to go through to get an idea of processes and techniques
involved. Only then, if you're still committed to learning, will I try to
organize and send you some links on specific techniques for achieving them.

ASIC flow and EDA techniques intro
[http://cc.ee.ntu.edu.tw/~ywchang/Courses/PD/EDA_Chapter1.pdf](http://cc.ee.ntu.edu.tw/~ywchang/Courses/PD/EDA_Chapter1.pdf)

Another covering most flows
[http://www.csl.cornell.edu/courses/ece5745/handouts/ece5745-...](http://www.csl.cornell.edu/courses/ece5745/handouts/ece5745-T05-methodology-
auto.pdf)

Modern example of a Standard Cell flow
[http://www.cs.berkeley.edu/~yunsup/papers/riscv-
esscirc2014....](http://www.cs.berkeley.edu/~yunsup/papers/riscv-
esscirc2014.pdf)

A flow that's a lot closer to the metal with some custom work
[http://lejpt.academicdirect.org/A19/133_143.pdf](http://lejpt.academicdirect.org/A19/133_143.pdf)

Note: I was going to post a custom flow w/ example of MIPS processor but
damned thing is totally gone from Google and DuckDuckGo. And my filesharing
service is down lol. Put an email address, public or disposable, in your
profile then I'll send it to you manually. Or just email the address in my
profile with your HN name in subject or body. It's too good to pass up. Do
read the two intro's first so you appreciate the work that level requires.

~~~
RossBencina
I've finished reading the links. The first one was great. Thanks, I learnt
something. It's not my field, but I'm vaguely familiar with the workflow,
having friends who design ASICs, work at Xilinx, etc. I've encountered some
algorithms in advanced compiler texts.

But what I was specifically curious about was applications of stochastic
optimization to EDA workflow. It sounded like you'd seen papers on that topic.

My interest is purely in applications of stochastic optimization. I'm not
about to jump into EDA.

~~~
nickpsecurity
Ok, now that I know your focus, I basically just used my fast googling
technique and searched through recent archives to see if anything popped up.
Here's various things in hardware, software, and architecture benefiting from
SO to learn from or improve on in your own work.

Really detailed write-up of how cells can be modeled and transformed with
stochastic tie in
[http://www.eecs.berkeley.edu/~keutzer/classes/244fa2004/pdf/...](http://www.eecs.berkeley.edu/~keutzer/classes/244fa2004/pdf/3-1-partition.pdf)

Parallel, stochastic, FGPA placement with great speedup
[http://www.eecg.utoronto.ca/~vaughn/papers/fccm2014_parallel...](http://www.eecg.utoronto.ca/~vaughn/papers/fccm2014_parallel_place.pdf)

Multi-optimization scheme for nanometer
[https://repositories.lib.utexas.edu/handle/2152/3688](https://repositories.lib.utexas.edu/handle/2152/3688)

Note: This does _not_ do stochastic I don't think but a number of other
methods. It's significant because it specifically mentions modern, typically
large, designs avoid stochastic optimization for global issues because it's
cost-prohibitive compared to geometric methods. That might have significant
implications outside EDA, too. Might only be good for local optimizations that
are tied together some other way.

Synthesis and control of system design for industrial project (software)
[http://www.scielo.org.ar/pdf/laar/v40n2/v40n2a07.pdf](http://www.scielo.org.ar/pdf/laar/v40n2/v40n2a07.pdf)

RTL verification by minimizing multiway decision graphs (2006)
[http://citeseerx.ist.psu.edu/viewdoc/download;jsessionid=D9E...](http://citeseerx.ist.psu.edu/viewdoc/download;jsessionid=D9E92C9F833CED1CDC02512ACBABDADB?doi=10.1.1.495.3582&rep=rep1&type=pdf)

Use in clock-tree synthesis for multi-FPGA's with focus on thermal properties
[http://thescipub.com/PDF/ajassp.2013.1604.1615.pdf](http://thescipub.com/PDF/ajassp.2013.1604.1615.pdf)

Synthesis of high-performance analog circuits (1996)
[https://users.ece.cmu.edu/~rutenbar/pdf/rutenbar-
aotcad96.pd...](https://users.ece.cmu.edu/~rutenbar/pdf/rutenbar-aotcad96.pdf)

System levle hardware and software partitioning with simulated annealing and
tabu search
[http://www.gstitt.ece.ufl.edu/courses/fall08/eel4930_5934/re...](http://www.gstitt.ece.ufl.edu/courses/fall08/eel4930_5934/reading/tabu.pdf)

Simulated annealing for aiding genetic algorithms in software architecture
synthesis
[http://www.sis.uta.fi/~tp54752/pub/sa/SA_aiding_GA_acta.pdf](http://www.sis.uta.fi/~tp54752/pub/sa/SA_aiding_GA_acta.pdf)

Synthesize implementations of hard real-time stuff
[http://link.springer.com/article/10.1007/BF00872288?no-
acces...](http://link.springer.com/article/10.1007/BF00872288?no-access=true)

Humie awards from Genetic Programming. Most of these could probably be ported
to Stochastic Optimization in general. [http://www.genetic-
programming.org/combined.html](http://www.genetic-
programming.org/combined.html)

Accelerate floating point problems
[https://web.stanford.edu/~sharmar/pubs/fp.pdf](https://web.stanford.edu/~sharmar/pubs/fp.pdf)

Stochastic super-optimization of loop-free binaries
[https://web.stanford.edu/~sharmar/pubs/asplos291-schkufza.pd...](https://web.stanford.edu/~sharmar/pubs/asplos291-schkufza.pdf)

Note: I'm keeping that one for sure. :)

------
tslug
Took a brief look. Seems like a cool idea, but it smells like early days.

The example listed on github (population count of 1s in a 64bit register) I
thought was a little underwhelming. It took a poorly optimized loop, for some
reason came to the conclusion that 0's would be more popular than they would
be with a uniform distribution of inputs as you'd get in a compressed stream
of bytes, for example (did I miss something there?), so it added some overhead
to special case 0 even though 0 only takes one iteration through the loop
anyway, and the loop still required up to 64 iterations if you have the high
bit set. If doing population counts on compressed data (most of what traverses
the internet), odds are you'll generally have at least one bit set in every
byte, so it'll usually be doing a lot of iterations.

If the code matters, then on an x86_64 arch, you can definitely burn a few
cache lines to use lookup tables for considerably more speed. 256 byte lookup
(4 cache lines) to bring the loop down to 8 iterations, 64k (1k cache lines)
to bring it down to 4, assuming you're confident you can warm that latter
table up in the dcache with enough activity.

I guess it isn't making guesses with lookup tables yet, but that's when I'd
get all nipply.

~~~
barsonme
> I thought was a little underwhelming.

I agree, but...

> so it added some overhead to special case 0 even though 0 only takes one
> iteration through the loop anyway, and the loop still required up to 64
> iterations if you have the high bit set.

Didn't it simply replace the body of `size_t popcnt(uint64_t x)` with the
`popcnt` instruction? Or am I looking at the wrong test?

------
erichocean
It seems crazy that stochastic optimization written by a couple of grad
students is better than 30 years of the very best programmers writing
optimizing compilers by hand.

Take something like Inria's certified C compiler, then run STOKE (with loop
optimizations available) and you've got a provably correct _optimizing_
compiler that's as fast or faster than GCC, Clang, VS, and Intel but without
the compilation bugs.

Really cool stuff.

~~~
UniQP
There are a lot of compiler optimizations that uses global
analyses/optimizations that are not covered by optimizing a sequence of
assembly instructions. Thus, I doubt that CompCert + STOKE would reach the
code quality of production compilers.

Nevertheless, I agree that superoptimizers ofter find optimizations that are
not considered by the compiler engineers. Another good example is the Optgen
tool [1] that also identified missing (machine-independent) optimizations in
all production compilers.

[1] [http://pp.ipd.kit.edu/optgen/](http://pp.ipd.kit.edu/optgen/)

------
pklausler
Reminds me of the "superoptimizer" from the 80's.

~~~
akkartik
Ah, thanks for the reminder. This is useful background:
[http://blog.regehr.org/archives/923](http://blog.regehr.org/archives/923)

------
daguava
its sandboxing behavior is a really nice touch. This is a really cool project,
interested to see if this continues.

------
frozenport
This is great and reminds me of the ATLAS BLAS[1]

I would really like the authors to discuss the effects of power-saving
features, for example dynamic clocking of the CPU.

Indeed ATLAS will often fail to converge or get the wrong tuning when
automatic power-tuning is not disabled. In a modern system is becoming
increasingly difficult to manage concurrent power saving features, for example
things like Intel's Turbo Boost or the clock on your mobile device.

[1][http://math-atlas.sourceforge.net/](http://math-atlas.sourceforge.net/)

~~~
jepler
Do you have a reference for this problem? I would love to hear more. Is it
thought to be a hardware bug, like this recent snafu with arithmetic (but for
which power tuning was not mentioned as contributing)
[http://arstechnica.com/gadgets/2016/01/intel-skylake-bug-
cau...](http://arstechnica.com/gadgets/2016/01/intel-skylake-bug-causes-pcs-
to-freeze-during-complex-workloads/) ?

~~~
frozenport
So, modern ATLAS has a CPU throttling check, for example:
[http://stackoverflow.com/questions/14592401/atlas-install-
re...](http://stackoverflow.com/questions/14592401/atlas-install-really-need-
to-get-past-cpu-throttle-check)

------
zump
Seems a bit involved to get a result. It should be like american fuzzy lop.

