Hacker News new | comments | show | ask | jobs | submit login
The Mill CPU Architecture: Switches [video] (millcomputing.com)
156 points by willvarfar on July 18, 2017 | hide | past | web | favorite | 73 comments



Well, I'd love to get some discussion here.

One of the perennial questions that I've never received a satisfactory answer to is: Where's the ILP?

Everytime I get asked this, I get referred to the "Phasing" talk, but that isn't a satisfactory response either.

I think quantitative analysis will prove to be far more nuanced and damning than expected. Take this paper (https://pdfs.semanticscholar.org/1002/50a822251d1302c146ab49...). This paper analyzes a processor with much STRONGER characteristics than the Mill: It dynamically identifies these dependencies and then schedules them.

The result? It gets smashed by a classic OoO.


> This paper analyzes a processor with much STRONGER characteristics than the Mill: It dynamically identifies these dependencies and then schedules them.

I'm not really sure what this means, especially "stronger characteristics". The paper seems to have started with an idealized OoO processor, then subtracted from its capabilities and demonstrated that performance suffered. No surprise there, but that's not easy to translate into a comparison against the Mill.

The Mill is not expected to ever compete against a full OoO processor of equivalent width; the Mill is expected to always be much wider.

I haven't fully thought through the implications, but I think there might be a significant difference in the instruction scheduling algorithm between the paper's model and a Mill. The paper's model will queue an instruction as soon as at least one of its dependencies has been scheduled, or slot it into an empty queue (where it may cause head of line blocking). By contrast, the ahead-of-time scheduling done by a compiler for the Mill won't issue an instruction until all of its dependencies are in the pipeline and due to finish in time (or stall for memory fetches), and uses explicit NOPs instead of head of line blocking on an issue queue.


The "wider" claim is one that I highly suspect.

Mill has very specific storage for arguments and results. For each operation you need 3 lines to that storage - two for arguments and one for result. The storage has O(num-of-wires^2) wiring overhead and you cannot get away from that in general. You can get away with less (Alpha AXP (OoO) has, for example, split register file where results migrated from one half to another behind the scenes) but it is very complex to implement. The VLIW architectures has some nice wideness and THEY ALL suffer from problem that register file of that width presents. You can see that in the clock speeds - Pentiums run around the Itanium in terms of clock speeds. It is precisely because Itanium has large register file (128 registers) and wide issue.

I keep coming to the discussion of the Mill and keep finding the argument that Mill will be wider. And no one bother to address my arguments that for wideness you need O(width^2) wires in the storage, which makes storage impractical or too complex to implement.


Mill has no register file, and is essentially a bypass network internally. That bypass network has a source X sink complexity (not N^2 because the number of sources need not match the number of sinks). You are right that the cross product is a limit on Mill scaling. Our internal work gives us some confidence that we can handle 30-wide issue with tolerable clock impact; beyond that is unclear, and indeed we may hit other constraints that preclude going further; memory bandwidth is a likely issue.


What makes you think you can extract that much instruction level parallelism in the first place?


One major one is speculation. Each Mill operand, or element in a vector operand, has a 'Not a Result' (NaR) flag.


You still have quadratic complexity of the network, the "need not match" argument can be reduced to a constant multiplier.


The reason I bring up the comparison is that both the Mill and this paper rely upon the existence of these sorts of dependency chains (I like to call them "picothreads" or "femtothreads") in general purpose code in order to get good performance. This paper shows that the existence of such chains are far too wide and shallow to be useful for their intended purpose.

FWIW, someone (disjoint from the Mill team!) has actually made a model of the Mill's phasing and found that it fails to deliver.

And "stronger characteristics" means that it dynamically identifies dependencies.


> FWIW, someone (disjoint from the Mill team!) has actually made a model of the Mill's phasing and found that it fails to deliver.

Citation is definitely needed, here.


Not really, publication quality. Just a Sunday-morning effort to understand the ceiling on the phased-execution model. All resources like decode, #pipes, cache and branch prediction are unmodeled, assumed infinite. I'm sure anyone can improve on it, but I lost interest with disappointing results.

https://pastebin.com/14E7X41j

I dump the instruction stream for coremark out of LLI interpreter, hacked a tiny bit. To save trouble, here's one iteration of coremark:

https://drive.google.com/open?id=0B0ygb7T_Ab3kM01DQnlvQi1teG...

gunzip the data, run it through the script and count the lines out vs lines in. (I get "IPC" ~2.12.) For a bit more insight, the number on the left of each output line is how many instructions fused. You can make a histogram like:

    cat tmp.out | cut -f 1 -d " " | sort -n|uniq -c

    76056 1
    95805 2
    32931 3
    4401 4
    1172 5
    970 6
    ...


Hi Michael,

To think I messed around with Valgrind ;) This is much easier.

I look at the particulars of the grouping and think 'no, that's not right', and 'that can't be with this', and 'that can be with that', and so on. The results are far too back-of-the-envelope to be right. But the approach is excellent, and that's the main thing :)

I'm going to crunch through the IR recording and work out the ILP with/without speculation etc. Will leave pipelining and auto-vectorization off the table. As this is all public stuff I reckon the results can be made public.

Wanna work with me on this? I'd like you back. This time it might work. Mail me :)

/Will


> This paper shows that the existence of such chains are far too wide and shallow to be useful for their intended purpose.

But doesn't this paper model a machine that is substantially narrower and deeper than a high-end Mill is expected to be? And what you judge to be the depth of a dependency chain depends on whether you're looking at a 128-instruction reorder buffer or if you're a compiler looking at the entire function.

Figure 4 in the paper particularly seems to indicate that the failing of the dependence-based model is entirely in the front-end, which seems to be the least Mill-like part of the model.


Read the next sections of the paper, it shows that the reason for that anguishing front end is due to the stalled instructions.


> it shows that the reason for that anguishing front end is due to the stalled instructions.

Stalled by what, though? Stalled waiting for memory (partially alleviated by the Mill's deferred loads), stalled waiting for dispatch from the queues that the Mill doesn't have, or stalled when at the head of the dispatch queue and ready to issue when the assigned execution unit is busy but some other execution unit is free?

The whole dispatch vs issue mismatch they're complaining about doesn't exist on the Mill or any other statically-scheduled machine (though there's a related problem in how many instruction slots end up as NOPs). There are at least three other features of the Mill that make the example in figure 6(a) inapplicable, and they didn't diagram the dependencies from the real code used to generate figure 6(b).


Stalled by scheduling constraints.


I don't think those scheduling constraints exist on the Mill. Certainly the ones illustrated in figure 6(a) don't. Add operation number 5 could issue during the empty slot on cycle number 5 instead of 6 thanks to deferred loads, or thanks to the fact that the Mill uses a multitude of specialized execution units rather than a handful of general-purpose execution units—a busy load unit could never stall an independent add operation to begin with.

It looks like add operation number 5 can't use the empty cycle on lane 2 because compare number 3 would already be queued in that lane. The time-reversed/consumers first scheduling heuristic that a Mill compiler uses wouldn't paint itself into this corner the way the this model's dispatcher does.

Figure 6(a) for a Mill would look like an ALU pipeline doing an add operation every cycle without stall, a load pipeline issuing a load operation every cycle without stall, another ALU pipeline doing compare operations every cycle but lagging behind by several cycles instead of just one, and a branch unit doing the conditional jump every cycle immediately after the corresponding compare. And on a mid-level or high-end Mill there would be plenty of execution units to spare for a non-trivial loop body without reducing throughput, and the whole thing would be vectorized too.


It seems to me that the main fundamentally new idea of the Mill is the deferred load mechanism described in the Memory talk. They claim that it can complete with OoO execution with a much simpler implementation. It does require some of the other Mill mechanisms for full efficiency, e.g. it uses the belt to defer loads across calls.

If this mechanism doesn't work as expected with real software, then there's no reason to think that the Mill will fare any better for general-purpose computation than any earlier VLIW/EPIC machine, for all the same reasons as before. And in that same talk they make some claims about OoO that seem a bit naive. The state-of-the-art has evolved a lot since the company was founded in 2003.


Copy paste:

Fun fact, the Mill's proposed method for hiding that latency, "deferred loads" has already been done by Duke's Architecture Group: http://people.duke.edu/~bcl15/documents/huang2016-nisc.pdf (warning PDF link). The big gain? A measly ~8%.


IIRC the Mill's presentation about deferred loads predates this paper, though the paper is a lot more detailed and has simulations. It's not clear how the Mill's gain from deferred loads would compare (it differs in a lot of other ways that would interact).


8% seems huge for an architectural change to me.


8% for an unoptimized PoC is pretty substansial, or it could be nothing when all details are accounted for. If it comes with a simpler silicon as well it's better and surely worth evaluating for GP processors.


I'm pretty out of my depth here, but I thought the deferred loads latency was part of the 12% gained from dynamic scheduling, not the 88% that the Mill claims to tackle with its phasing thing. In that case, 8% doesn't sound measly at all.

But like I said, I wouldn't be surprised if I'm just not understanding what I'm reading/watching.


Something equivalent to deferred loads is a part of AMD's GCN architecture, which has existed for many years now. (It's not exactly a deferred load; rather, there's an explicit wait instruction that can block a thread until a previous load has returned its results. But in the end that's almost the same thing.)

Of course, the characteristics of memory accesses are quite different in a GPU: very high bandwidth at the cost of very high latency.


Actually only ~12% of the performance obtained by OoO machines comes from its dynamic reaction capabilities (eg cache misses and branch mispredicts), ~88% comes from better schedules.

I encourage you to read McFarlin's publications, they're very good OoO statistics.


It doesn't look like the architecture in this paper supports pervasive speculation, which the Mill can via its "Not-a-Result" (NAR) bits on each data element. Vectors on the Mill have such a bit both for the entire vector, and for each element within the vector, which (along with other novelties) is supposed to help both with vectorizing and software pipelining code that can't be on conventional architectures. Ivan emphasizes in the videos that these features are what make the Mill capable of high ILP, which is why they go very wide (33 on "gold").


Short loops which have very few iterations and are not very long running comprise anywhere from 50-60% of the execution time on SpecINT type workloads (we'll extrapolate that to general purpose code). These are the types of things that kill software pipelining. Furthermore, sometimes these loops have dependencies between iterations.

As a sidenote, the primary advantage of the Mill in software pipelining seems to be that they can software pipeline loops without a need for a prologue or epilogue and thus save a lot of cache space.

Elbrus (https://www.google.ch/patents/US20020133813) had this too, didn't help them get good ILP.


Why don’t you come up with a concrete example of code that you think will perform poorly and post it on their forum and see what they say?


Basically any piece of general purpose code.

Oh, and someone (disjoint from the Mill team!) has actually made a model of the Mill's phasing and found that it fails to r.


Do you have a link to that?



If true, that'd be a very high number. Intel can get 2 on some benchmarks, but system-average IPC tends towards 1. (In terms of instructions standardized for cross-platform comparison. I take data on ARM servers and I'm jellous of 1.)

Are any Intel engineers on this thread saying, whelp if they can do 6 we may as well give up? Or does the lack of any data help you sleep soundly and skeptically?


Sure. In theory Skylake can deliver ~8uOPs per cycle in parallel.

Do they in real code?

No.

The Mill team claims (with no running general purpose code OR data of any sort) that they can. YMMV, but I bet Intel sleeps safe. Which sucks because I actually want innovation in computer architecture, which is a notoriously uninnovative industry, and the Mill has some genuinely good ideas, but I don't think they'll work here.


> In theory Skylake can deliver ~8uOPs per cycle in parallel.

4 µop/cycle after fusion.

http://www.agner.org/optimize/blog/read.php?i=650


ILP and IPC are not the same thing. ILP (instruction-level parallelism) is the number of instructions executed in a non-stalled clock cycle. IPC (instructions per clock) is the average number of instructions executed per clock, including clock cycles spent waiting for main memory and recovering from branch mis-prediction.


See slide 44:

https://riscv.org/wp-content/uploads/2016/01/Wed1345-RISCV-W...

It is very dependent on the workload.


[deleted]


Decades of optimization on a crap architecture. The Mill has been able to learn from the mistakes made in every architecture to date. The kind of comparison you make doesn't mean anything as a result.


Intel has also learned from each generation.

CISC is very efficient. They do the most ops per transistor.


But Intel is constrained by backwards compatibility.

"CISC is very efficient" is a meaningless statement. The specific instruction set and its implementation is or is not efficient.


Disclaimer: Opinions are solely my own and do not reflect my employer's


That paper isn't particularly relevant, as the non-OoO CPU described therein is not much like the Mill at all. It would be interesting indeed, though, to see the Mill folks produce performance figures for some common software.

Remember that a VLIW CPU has a lot of sub-ops in each instruction word, so it by design has more ILP, though only when the workload allows for many of those sub-ops to be filled, and how much work those sub-ops do can muddy the waters.


It's good he recognizes that most code won't benefit from switch optimization. But certain tasks like pattern matching benefit greatly. For that, I suggest:

http://www.micronautomata.com/

Micron will eat 1000 nested switches 100 cases wide every cycle. This sort of TCAM tech was rare 5 years ago, but everywhere now.


I'm biased on this topic, as I work for Intel on regular expression matching, so I suppose I am 100% invested in the traditional Von Neumann architecture for this task.

I'd say the Micron Automata processor (which never really shipped from Micron in performant form - there were some "too slow to provide real numbers" units out there) is very interesting. But many of the applications for it in pattern matching as elsewhere are contingent on doing useful parallel work if they are to outcompete an optimized software implementation. So if you have a giant NFA where you need to run 16384 states all the time, the Micron AP looks great. However, if you can manage to find sleazy optimization tricks that allow you to cheat and get down to running the occasional 256-bit NFA over a few bits of your input, the AP isn't competitive.

More of the pattern matching task that we're familiar with (i.e. for network security) looks like the latter case (tricks being possible).

I'm very interested in seeing the tech get out there, but its been spun out into a startup and I don't know what the future looks like for it.


"We postulate that the original study produced good results because it evaluated the new scheduler in a machine whose performance bottleneck lay outside the execution core. The problems introduced by the steering logic were, as a result, hidden."

Any new microarchitecture needs to first answer that big question -- how is the core going to deal with memory?


Fun fact, the Mill's proposed method for hiding that latency, "deferred loads" has already been done by Duke's Architecture Group: http://people.duke.edu/~bcl15/documents/huang2016-nisc.pdf (warning PDF link).

The big gain? A measly ~8%.


This doesn't decouple across function calls like the Mill claims to, right? Although doing that optimization on existing C/C++ programs will be difficult.


I don't think that's enough of a difference, YMMV, but nothing indicates that it should.


Can the mill not execute instructions ahead of others? Maybe i dont understand what it means to be out of order, but the Mill can decode up to 30ops per cycle and issue many of the simultaneously.


The Mill is pipelined and superscalar, which are the key features to having multiple operations in-flight, and it is also speculative in that it will start feeding stuff into the pipeline before it's known whether those are definitely the right instructions to be executing.

Out-of-order execution means a CPU will re-order instructions on-chip to account for stalls due to cache misses or (on superscalar processors) due to a mismatch between the kinds of instructions and the kinds of execution units available. OoO execution require analyzing instruction dependency information pretty early in the pipeline and very quickly to keep up a steady stream of ready-to-execute operations.

An in-order processor like the Mill only pays attention to instruction dependencies to determine whether a memory fetch has been completed, and if not to stall the pipeline until it has. The Mill compensates for that performance disadvantage by being statically scheduled: all the decisions about how to schedule the instructions are handled by the compiler toolchain, which has to know the exact instruction latencies and mix of available functional units on that model. An in-order processor that isn't statically scheduled could be superscalar but it would be hard to get several functional units in use simultaneously. For example, the original Pentium was in-order but could issue up to two instructions simultaneously, with restrictions on what instructions could be executed in what pipe and if the two instructions in a pair didn't take the same amount of time a third (and maybe fourth) instruction couldn't start until both had finished.


> ... all the decisions about how to schedule the instructions are handled by the compiler toolchain, which has to know the exact instruction latencies and mix of available functional units on that model.

That sounds pretty bad from a performance standpoint. You'd have to compile a binary for every single CPU to get the best performance. Maybe it could work with JIT languages running on JVM and CLR.


Binaries would be distributed in a family-member-independent form, and then they would be "specialized" on each individual machine.


Wouldn't the individual latencies also depend on clock speed?

How would you have a CPU be able to clock-up and clock-down and still maintain good performance?


I'm unsure. When you increse RAM speed the timings are relaxed, and I would think that cache is the same.


That is, more or less exactly what running on a VM is. You distribute Jave bytecode or CIL.


You dont need a VM. You just push the last stage of compilation to the user. The user can then compile to their architecture before starting execution. Optimal compilation involves some NP complete problems, so depending on how good there heuristics are, this could cause a performance issue in loading code, or in running underoptimized code. However, these issues could be solved by moving compilation to install time.


Pushing a final compilation to the user is not a good idea - either it will happen everytime you run, or at install time. I have a lot of programs that are just executeables, and will not be installed. Having that overhead every time something is run is just a waste. Install compilation means that I cannot expect reasonable performance if I upgrade the CPU, if it'll work at all.

So for any reasonable implementation we're looking at JIT with caching, for running platform independent code. The very definition of a process VM is "to execute computer programs in a platform-independent environment"[0].

[0]: https://en.wikipedia.org/wiki/Virtual_machine


You've really missed the point. The point is to lower power consumption and increase performance on the CPU level. Running a VM on top of x86 does not achieve that goal because you are still on x86.


I know very well that is the goal, I've been following this project for years now. My point still stands, you need to compile specifically for the CPU it's run on, or you need to move the final compilation to the end user. The only way to achieve reasonable performance if you distribute platform independent code is by using a process VM.

I'm not saying it will not be faster, just that it is a weakpoint in the Mill architecture compared to x86.


The Mill isn't an “out-of-order” machine because it's not taking linear code and scheduling it to run out-of-order dynamically (i.e. when it executes). Instead, the compiler does the scheduling.


> It gets smashed by a classic OoO.

Keep in mind that if its implementation is much simpler, it'll be smashed by an OoO x86 single thread performance, but may still win on real world workloads.


>> One of the perennial questions that I've never received a satisfactory answer to is: Where's the ILP?

Anyone with any experience in the field can come up with an instruction encoding structure that produces more instruction level parallelism in theory. It's usually a simple mental exercise to determine that such a CPU isn't implementable in the "real world." That's exactly what is happening here with Mill except they have really dived off the deep end and do so much handwaving that it's hard to tell exactly how mill is supposed to work.

Take branch prediction: Last time I checked they wanted the processor to load what they described as an "exit table" associated with a module of code that serves as a branch prediction unit. The problem is that this doesn't work at all. You can't have a CPU load a separate data structure that replaces its internal state like this.

First off, this is way too complex as a piece of front end logic for the processor. Trying to access this as code is running is basically impossible. Secondly, A branch predictor has to produce a prediction immediately, you can't just handwave and say you will ask an internal data store structure to produce a result. Their description of this is that it is in some way a "hash table" (http://millcomputing.com/wiki/Prediction). That's really interesting because you would never ever want to implement a hash table in a CPU. Either a) they described it analogously as acting like a hash table; which doesn't make a lot of sense here in a technical description b) they really want to implement a hash table lookup as a branch predictor which be horrendously complex and would never work latency wise or c) they are are calling an associative cache a hash table. They also want to somehow update this "hash table" with a correct prediction on a mispredict. So somehow the processor will update a separate data structure (using a hash table lookup) every time there is a misprediction. Which again, doesn't work latency wise.

By the way, all of this somehow has to be loaded and unloaded seamlessly into main memory.

Notice that there is no description of any advanced logic that is used in modern processors to dramatically increase the accuracy of predictors. I don't even understand how they would possibly implement any of that advanced logic with the branch prediction unit so complex and decoupled from the front end of the processor.

Of course, this kind of compiler driven prediction has been tried before and it doesn't work for obvious reasons. The compiler has a lot less information about the current state of the program than the processor does.

So to recap: they described a feature that doesn't work as described, if it did work would be horrendously complex and wouldn't properly work as a branch predictor and finally doesn't actually provide better predictions at all.

This is basically Mill in a nutshell.


Disclaimer: Going off memory of the presentations myself, not at all an expert, etc.

Ivan Goddard's descriptions of the branch predictor sounds quite simple that I can remember; The predictor itself is completely detached from the compiler, outside of the instruction layout falling into EBBs. A point that (I think is on the Mill forum somewhere) they've said about the prediction algorithim itself, is that they're doing simulations with nearly the simplest known prediction algorithm, and getting something like 90% success rates from that compared to the 98% the state-of-the-art sees. The difference is entirely in what data is carried by the prediction. (I believe this is in the forum thread on Switch statements, referenced in this video) What the predictor stores and delivers as output is the logical equivalent to an instruction pointer, and where in the EBB they exit, instead of a boolean (or an entire exit table). Which is more state, but not replacing the entire state of the processor. And, as explained in this presentation, the impact of having (effectively) an instruction pointer from the branch predictor is you know what to prefetch from memory. And that pointer, being associated with some other EBB that the predictor might know about, is also enough to ask the predictor for the next prediction, get a second prefetch, and so on.

That also has implications to a point Goddard made in this presentation, in that some entire Switch statements have only one "EBB" (although that's loose terminology itself), and thus only one prediction associated with it. The prediction is to where the block will exit, not whether any specific prediction will succeed, thus there is no case where the processor will stall multiple times before finding the correct control flow. It can only fail once before it determines the correct exit to the block (in general, at least...)

Which is I believe how they are expecting good results, rather than anything peculiar about how they determine the prediction from history.


I'm not sure I follow about the benefits of this branch prediction scheme in the first place. I already have a branch target from the branch target buffer( assuming I'm in a conventional processor), so either the frontend can begin immediately loading code from the cache or I stall while I load from main memory. I'm never prefetching anything(at least in terms of an initial target)

Let's assume that Mill can produce a prefetch chain though that can load additional code segments beyond the initial branch target. Does this really benefit the processor? I don't understand how a static prediction of where I will branch to next after an initial branch target is even useful. If I have already branched to that location before the code will be sitting in the L1 cache waiting for me, if it isn't, well, how does the branch predictor know more about the behavior of my program than the L1 cache?

To put it another way: if the instruction set doesn't fit in the L1 cache is there some predictable way that I can load and unload the cache just based on a given branch target( that in turn implies other branch targets will be loaded next with a total instruction set larger than the cache). I don't really think there is any possible way to predict the behavior of an entire program like this. Larger instruction segments and more branch targets imply more complex programs that are correspondingly harder to predict so I'm always better off just relying on the instruction cache.

Assuming Mill tries to load entire chains of instructions into the cache based on a simple static target I would be much better off just disabling this and relying on the instruction cache in almost all cases therefore.


The target is not static; it is dynamically updated as part of history.

Prefetch chaining is to get code out of DRAM, and it runs DRAM-latency ahead of execution. Separately, fetch chaining is to get the code up to the L1, and it runs L2/L3-latency ahead of execution. Load chaining gets the lines from the L1 to the L0 micro-caches and the decoders, and runs decode-latency (3 cycles on the Mill) ahead of execution.

The Mill stages instructions like this because the further ahead in time an action is the more likely that the action is down a path that execution will not take. We don't prefetch to the L1 because we don't have enough confidence that we will need the code to be willing to spam a scarce resource like the L1. But avoiding a full DRAM hit is important too, so we stage the fetching. It doesn't matter at all in small codes that spend all their time in a five-line loop, but that's not the only kind of codes there are :-)


Perhaps you could revisit the prediction talk. Short summary:

Table reload is part of our support for the micro-process coding style. It is essentially irrelevant for typical long running benchmarks, especially those that discard the first billion instructions or so to "warm up the caches".

Table reload provides faster response for cold code (either new processes or running processes at a program phase boundary) than simply letting the predictor accumulate history experience. There are heuristics that decide whether execution has entered a new phase and should table-load; the table is not reloaded on every miss. Like any, the heuristics may be better or worse for a given code.

The historical prediction information is in the load module file and is mapped into DRAM at process load time, just like the code and static data sections. Table-load is memory-to-predictor in hardware and is no more difficult than any of the other memory-to-cache-like-structure loading that all cores use, such as loading translation-table entries to a TLB.

While a newly-compiled load module file gets a prediction table from the compiler, purely as being better than nothing, the memory image from the file is continually updated during execution based on execution experience. When the process terminates, this newly-augmented history is retained in the file, so a subsequent run of the same load module is in effect self-profiling to take advantage of actual execution history. Of course, programs behave differently from run to run and the saved profile experience may be inapt for the next run; there are heuristics that try to cope with that too, although we have insufficient experience as yet to know how well those work. However, we are reasonably confident that even inapt history will be better than the random predictions made by a conventional predictor on cold code.

As always in the Mill, we welcome in the Forum (millcomputing.com/forum) posts of the form "I don't understand how <feature> works - don't you have trouble with <problem>?". Unfortunately, time and audience constraints don't let us go as deep into the details in our talks as we'd like, but the details are available for you. If, after you have understood what a feature is for and how it works, you still see a problem that we have overlooked (as has happened a lot over the years; part of the reason it's been years) then we'd really welcome your ideas about what to do about it, too.


Is there really anything wrong with conflating a hash table and an associative cache in this context? An associative cache uses a portion of the address as the key, and has a usually limited capability for handling collisions. From a software perspective this is pretty similar to how a hash table works, even if the circuit that you visualize when hearing "hash table" is different.

If we assume that they really do mean "associative cache" for the predictor table, then that's obviously something that really works in hardware, and can be updated on mis-predicts. Flushing the predictor state out to some OS-accessible area when the task quits doesn't sound like magic to me. Pre-loading the cache with warm data when the task is launched next time would be very latency-sensitive and is thus more questionable, but I don't think you've adequately explained why it's impossible to pull off.


>>If we assume that they really do mean "associative cache" for the predictor table, then that's obviously something that really works in hardware, and can be updated on mis-predicts. Flushing the predictor state out to some OS-accessible area when the task quits doesn't sound like magic to me.

That is quite a complex operation though. How large is the associative table that is being flushed and how often does the flushing occur? Assuming we are talking about something in the general size of a BTB loading this entire segment into the branch predictor on very context switch is possible(although how this is implemented on silicon would have to be quite complex)

The data structure described in the paper must be quite a lot larger than a simple BTB though if the predictor is to work as described. It has to hold significantly more information than a BTB if Mill wants to replace all the other associated branch prediction machinery with a table that needs to predict every target. I don't believe a prediction table this size would be feasible to load on every context switch. It's possible they want a smaller table and to replicate some of the branch prediction machinery in a modern CPU. But this seems to defeat the entire point of having a prediction table produced by a compiler and updated by successive runs of the program.

It's very hard to tell exactly what they are loading or what the table will actually do. That's why I'm not surprised they described it as a hash table. It's a simple explanation for how their solution works that avoids having to explain tricky implementation details.

>> but I don't think you've adequately explained why it's impossible to pull off.

I think it's theoretically possible in the sense that they could design silicon that does something similar to what they described. I totally understand what they are trying to do:

Have the compiler produce a complete set of predicted branch targets and improve the targets over time by instrumenting the process. It sounds great superficially, you don't have to bake any branch prediction logic into the cpu at all and you can use the all the information the compiler has combined with how the program actually runs. The problem(performance-wise) is that most branches are executed many times and have dynamic behavior that can't be exploited by just producing a table of branches.

Silicon-wise I don't see any way they could integrate such a complex piece of standalone logic into the frontend of the CPU and still maintain acceptable latency. The branch predictor of modern CPUs is incredibly well engineered and optimized and this just isn't in the same ballpark and can't perform as well.


> It has to hold significantly more information than a BTB if Mill wants to replace all the other associated branch prediction machinery with a table that needs to predict every target.

The exit table and the prediction cache that they've described are only intended to be the common subset of prediction mechanisms that differ from a typical CPU, while a high-end Mill could also be incorporating many of the tricks you would expect on a desktop-class CPU, but those details were beyond the scope of the prediction talk.

High-end x86 CPUs already have pretty complex BTBs, often multi-level caches with different capacity for different branch types and limits on how many branches per I$ line. The only extra thing the Mill is adding with its prediction cache is the length information on how many bytes of instructions to prefetch and decode to get to the expected branch point (whereupon the prediction cache will be consulted a second time to start on the next EBB). As with x86 cores, I don't think the prediction cache/BTB would always be the same buffer or same size as the history pattern table or other model-specific prediction mechanisms.

The exit table is further from the execution pipeline and isn't on the critical path for recovering from a mispredict. It gets feedback from the core but only directly feeds into the prefetcher. This is the cache that will have the save/restore mechanisms and will load data from RAM in batches. It sounds a lot like a TLB in these respects, though the context switch that triggers loading new data is a function call rather than only a process switch. It definitely doesn't need to predict everything. In the talk it was mentioned that the exit table would have victim buffers and has a 3-cycle latency to the prefetcher. (And remember that Mill clock cycles will probably be pretty long in initial implementations.)


There seems to be no hardware implementation right now, not even on FPGA (last message about it dating from 2014). I also couldn't find a GCC or LLVM backend for its architecture and no software simulator with or without cycle counters.

As much as I'd love to see something new in the general-purpose CPU arena (was Itanic the last attempt?) I will remain skeptical of such claims until they can be demonstrated by actual benchmarks.


Sorry there isn't a glut of easily-accessible material.

The Mill is statically-scheduled, so WYSIWYG and you can just count cycles from the assembly.

Here's an example of someone hand-rolling assembly and counting the ILP: http://millcomputing.com/topic/switches-orig/#post-2846

He gets his ops, phases, latencies etc from http://millcomputing.com/wiki/Instruction_Set (there's a slots/latencies table at the bottom of the description of each op)

The instruction-set doc is slightly out-of-date and we know that and we're regenerating it soon, but most ops have same latencies now as documented. We also know there is scant examples for those who don't want to hand-roll stuff; we are working on doing some prettily-rendered examples, initially without auto-vectorization and loop-pipelining (so not really unlocking the full potential of the Mill, but better than nothing).


The Mill team seem to have a strong aim for commercial success. Talks and documentation are typically only released after they have secured their patents. I'm also skeptical as to if they will make a viable product at the end of the day. None the less these videos have been some of the most enjoyable technical talks I've come across.

But I sincerely hope they do find commercial success.


They've got a huge mountain of patents to file and sort out before they can start going public, I think.


On top of this, they have gotten patents filed, and as they do go through the process, information gets released about what is now covered by patent.

As in, they are making good on the promise of actually proving it isn't vaporware, it is just a slow and painful process.

Anything that gets leaked ahead of time can be used to derail their patent process and depending on what it is, it could be the difference between being a profitable venture or not.

Everyone who makes certain types of processors (which includes Intel and AMD both, along with all the ARM licensees) have a vested interest in ruining this effort, no matter if Mill ends up being successful or not (as in, Mill holding certain patents even if their implementation was not successful, is still a highly profitable venture).

As a side note, a lot of people use Itanium as an excuse to shit on Mill: just remember, quite a few things engineered for Itanium have found their way back into today's x86 designs. As in, things Intel patented in Itanium are still in use and lived on. Even though the Mill is strange, if it doesn't succeed, parts of it will live on in other designs.


> I also couldn't find a GCC or LLVM backend for its architecture and no software simulator with or without cycle counters.

They do have a compiler, and they do have a simulator. Or rather compilers and simulators: one for each Mill family member. The simulator, compiler and related intermediary tooling (and the specification they use to produce them) are the first “real” things they've written and gotten working, and the simulator defines the instruction set. They've done a live demo of editing the specification and producing a new simulator, so I'm inclined to believe there's no subterfuge here.

However, they're not public. Possibly because they haven't finished filing all their patents?




Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact

Search: