
Mill vs. Spectre: Performance and Security [video] - leoc
https://www.youtube.com/watch?v=8E4qs2irmpc
======
UnquietTinkerer
For anyone interested, here are links to the slides and the accompanying white
paper.

[Slides] [https://millcomputing.com/blog/wp-
content/uploads/2018/04/20...](https://millcomputing.com/blog/wp-
content/uploads/2018/04/2018-03-14.Spectre.03.pptx)

[White Paper] [https://millcomputing.com/blog/wp-
content/uploads/2018/01/Sp...](https://millcomputing.com/blog/wp-
content/uploads/2018/01/Spectre.03.pdf)

I haven't read the paper yet; hopefully it offers more detail than the talk
does because I am still confused about how the Mill avoids cache pollution
from speculative loads.

EDIT: Here is my attempt at a summary of the relevant bits of the whitepaper:

The Mill is immune to Meltdown for the same reason AMD et al. are; it does
permission checks before loading rather than in parallel and thus the load
faults before going to memory.

The Mill is immune to Spectre because "Current Mill configurations will
[speculatively] issue, and revoke, a maximum of two instructions. Revocation
includes all cache and other micro-architectural side effects."

Neither of those points is covered in the talk. I don't know enough about the
subject to judge, but the arguments in the paper seem a bit glib. I'd like to
hear from an expert on the subject.

~~~
strstr
I’m pretty surprised if they don’t leave speculatively loaded (and still
correct) data in the cache. My understanding of speculation is that was sort
of the point: often you won’t compute the right value (because you have to be
right in every instance) but you will have loaded nearly all of the relevant
data into the cache, so it’s comparatively fast the second time around.

~~~
Veedrac
This argument holds better for an OoO CPU that is speculating 100 instructions
ahead, so there's significant work done in this window. When your speculative
execution is only 2 cycles ahead, you aren't throwing away much work; you'd be
lucky to even _have_ work to throw away by that point, at least as it applies
to cache misses.

------
evancox100
(Started watching at 24:30, thanks y'all).

Everything he's saying misses the mark. If the only issue was hiding the
memory latency of a load when you know the address, you could solve this with
existing techniques like simultaneous multithreading (a la HyperThreading),
prefetch hints, etc.

The need for speculation arises when you do not know ahead of time which
address to access, which branch to take, etc. For example, you're accessing an
element in an array and need to multiply the index by the element size. You
don't know which address to load until the multiply completes, so you
speculate. I don't see how the Mill's deferred load semantics help you any
more than a prefetch or dummy load would. Actually, unless I'm missing
something you couldn't even use the deferred load because, again, you don't
have the address.

~~~
snuxoll
> The need for speculation arises when you do not know ahead of time which
> address to access, which branch to take, etc. For example, you're accessing
> an element in an array and need to multiply the index by the element size.
> You don't know which address to load until the multiply completes, so you
> speculate. I don't see how the Mill's deferred load semantics help you any
> more than a prefetch or dummy load would. Actually, unless I'm missing
> something you couldn't even use the deferred load because, again, you don't
> have the address.

You kind of hit three different issues here, there's three completely
different scenarios I can think of off the top of my head to cover and the
Mill design ties with out-of-order designs in the worst case and beats them in
the other two.

1\. Random I/O on array elements - nobody wins here because branch prediction
and speculative loads will consistently fail, you hope your data is in cache
and everybody stalls if not.

2\. Sequential I/O on array elements - Mill can perform equally to an out-of-
order design in most cases and beat it in others, you don't rely on the CPU
seeing far enough ahead to reorder loads and have much better facilities for
parallelizing common operations (their strstr example using their smear
instruction, NaR values and pervasive vectors is truly mindblowing).

3\. Switche statements with jump tables, the Mill's wide-issue design handles
many of these cases without needing jump tables to begin with, especially when
paired with speculative operations on potential NaR's. When you need to call
code at another address you are again at the hands of the branch predictor and
instruction prefetch, which the Mill does do and has some novel designs for
that provides a low mispredict penalty and purportedly better prediction
results. Ultimately though, if you keep hitting mispredicts you're in the same
worst-case as you have on out-of-order designs.

The Mill can't beat out-of-order designs where your code just thrashes cache,
causes mispredicts all the time, etc, but it can match them without eating
gobs of power.

~~~
twtw
You talk about the mill as if it exists. There is no hardware, there are no
benchmarks. Bloviating about the excellent performance of the mill is not
valuable - showing SPEC CPU results is. VLIW performance was great too, until
it wasn't. You can statically schedule everything in theory and the
performance will be great, but experience suggests that giving hardware the
capability to react dynamically cannot be replaced by static scheduling,
except in code with limited branching and a known execution pattern. This is
why VLIW works nicely for DSP, and fairly poorly for general purpose
computing.

The mill has been in development for 15 years, and almost done for 5. Forgive
me for not holding my breath.

~~~
deepnotderp
I don't understand why people are so willing to say "it won't work" without
actually taking the time to understand it. They literally spend like every one
of their talks addressing how they overcome traditional vliw problems.

~~~
evancox100
He's not saying it won't work, he's saying it doesn't exist yet, so talking
about it as if it does is a bit silly.

------
__s
24:30 to reach info about Mill architecture, beforehand is building context by
explaining Spectre

------
analognoise
Does this thing even exist on an FPGA yet?

~~~
Veedrac
No.

------
jcranmer
Does anyone have a link to the slides? I find that a much preferable way to
access this sort of stuff...

~~~
leoc
Presumably it will show up at
[https://millcomputing.com/docs/](https://millcomputing.com/docs/) eventually
but it doesn't seem to be there yet.

------
gizmo686
Discussion on Mill begins at about 24:30.

------
ptc
This Mill guy is the gift that keeps on giving. With any luck he’ll still be
around to explain how the soon-to-be-released Mill 1.0 cpu would have avoided
the year 2038 problem.

~~~
gbrown_
> With any luck he’ll still be around to explain how the soon-to-be-released
> Mill 1.0 cpu would have avoided the year 2038 problem.

What? Software working with a 64-bit time_t is not the CPU's problem.

------
twtw
Talk is cheap.

The TL;DW is that the mill cpu will have better performance than existing CPUs
without speculative execution because it has "deferred loads," while the straw
man not-mill architecture doesn't and therefore stalls after every load. Also,
newsflash - Spectre doesn't impact architectures that don't speculate.

This is great, except for that existing CPUs don't stall after issuing a load.
Scoreboarding + prefetch are together capable of more than this "deferred
load," and require less work from the compiler. If you have independent
instructions following a load, any existing architecture worth its salt will
notice that and execute them while the load is in progress.

It's potentially a neat idea to include the number of cycles until load retire
in the instruction, but it's a joke to pretend that it's higher performance
than what x86 does and will get you back all the performance lost by not
speculating.

I can't help but think that the mill architecture gets a lot of hype from a
lot of people that don't know very much about computer architecture. There
have been lots of great ideas that didn't pan out for general purpose
computing, and I'm not sure that this vaporware architecture deserves to be
thought about.

~~~
snuxoll
> This is great, except for that existing CPUs don't stall after issuing a
> load. Scoreboarding + prefetch are together capable of more than this
> "deferred load" [...]

Except existing CPU's spend a lot of die space and power budget on doing
speculative execution to hide the stall, the point of a deferred load is you
don't need all this hardware to extract the same performance.

> and require less work from the compiler. [...]

Three words: static single assignment. If you can work out the dataflow of a
function you already have everything you need to order loads in the most
efficient way possible, this is why all of Mill Computing's work has been
around LLVM because LLVM IR _forces_ SSA by design. Hell, your compiler
doesn't even need to think about the ordering if it relies on LLVM to do the
native code generation, because the Mill backend is supposed to do all of this
for you.

> It's potentially a neat idea to include the number of cycles until load
> retire in the instruction, but it's a joke to pretend that it's higher
> performance than what x86 does and will get you back all the performance
> lost by not speculating.

Deferred loads alone aren't there to beat x86 in terms of performance, they're
there to avoid needing all the costly out-of-order hardware while avoiding
memory stalls that previous statically scheduled/in-order machines incur.
There's other features in the architecture to bring better performance, but
that's all around the VLIW-like design.

~~~
gpderetta
As far as I know (and I don't know much because I'm not a compiler guy) LLVM
(and most compilers) doesn't keep everything in SSA form. Any value whole
address has escaped (most of the things not in the C stack and evem some local
variables as well) must treated as memory. I think that not automatic, not
recently used variables would also be the values that would benefit the most
from deferred loads. So IIRC Mill has hardware to help with aliasing but it
wouldn't in fact plug out of the box in LLVM.

Do the Mill guys even have an LLVM backend yet? Or even any compiler at all?

~~~
Veedrac
I believe the latest we've heard is that their LLVM backend is mostly working
but still pretty buggy.

