One of the perennial questions that I've never received a satisfactory answer to is: Where's the ILP?
Everytime I get asked this, I get referred to the "Phasing" talk, but that isn't a satisfactory response either.
I think quantitative analysis will prove to be far more nuanced and damning than expected. Take this paper (https://pdfs.semanticscholar.org/1002/50a822251d1302c146ab49...). This paper analyzes a processor with much STRONGER characteristics than the Mill: It dynamically identifies these dependencies and then schedules them.
The result? It gets smashed by a classic OoO.
I'm not really sure what this means, especially "stronger characteristics". The paper seems to have started with an idealized OoO processor, then subtracted from its capabilities and demonstrated that performance suffered. No surprise there, but that's not easy to translate into a comparison against the Mill.
The Mill is not expected to ever compete against a full OoO processor of equivalent width; the Mill is expected to always be much wider.
I haven't fully thought through the implications, but I think there might be a significant difference in the instruction scheduling algorithm between the paper's model and a Mill. The paper's model will queue an instruction as soon as at least one of its dependencies has been scheduled, or slot it into an empty queue (where it may cause head of line blocking). By contrast, the ahead-of-time scheduling done by a compiler for the Mill won't issue an instruction until all of its dependencies are in the pipeline and due to finish in time (or stall for memory fetches), and uses explicit NOPs instead of head of line blocking on an issue queue.
Mill has very specific storage for arguments and results. For each operation you need 3 lines to that storage - two for arguments and one for result. The storage has O(num-of-wires^2) wiring overhead and you cannot get away from that in general. You can get away with less (Alpha AXP (OoO) has, for example, split register file where results migrated from one half to another behind the scenes) but it is very complex to implement. The VLIW architectures has some nice wideness and THEY ALL suffer from problem that register file of that width presents. You can see that in the clock speeds - Pentiums run around the Itanium in terms of clock speeds. It is precisely because Itanium has large register file (128 registers) and wide issue.
I keep coming to the discussion of the Mill and keep finding the argument that Mill will be wider. And no one bother to address my arguments that for wideness you need O(width^2) wires in the storage, which makes storage impractical or too complex to implement.
FWIW, someone (disjoint from the Mill team!) has actually made a model of the Mill's phasing and found that it fails to deliver.
And "stronger characteristics" means that it dynamically identifies dependencies.
Citation is definitely needed, here.
I dump the instruction stream for coremark out of LLI interpreter, hacked a tiny bit. To save trouble, here's one iteration of coremark:
gunzip the data, run it through the script and count the lines out vs lines in. (I get "IPC" ~2.12.) For a bit more insight, the number on the left of each output line is how many instructions fused. You can make a histogram like:
cat tmp.out | cut -f 1 -d " " | sort -n|uniq -c
To think I messed around with Valgrind ;) This is much easier.
I look at the particulars of the grouping and think 'no, that's not right', and 'that can't be with this', and 'that can be with that', and so on. The results are far too back-of-the-envelope to be right. But the approach is excellent, and that's the main thing :)
I'm going to crunch through the IR recording and work out the ILP with/without speculation etc. Will leave pipelining and auto-vectorization off the table. As this is all public stuff I reckon the results can be made public.
Wanna work with me on this? I'd like you back. This time it might work. Mail me :)
But doesn't this paper model a machine that is substantially narrower and deeper than a high-end Mill is expected to be? And what you judge to be the depth of a dependency chain depends on whether you're looking at a 128-instruction reorder buffer or if you're a compiler looking at the entire function.
Figure 4 in the paper particularly seems to indicate that the failing of the dependence-based model is entirely in the front-end, which seems to be the least Mill-like part of the model.
Stalled by what, though? Stalled waiting for memory (partially alleviated by the Mill's deferred loads), stalled waiting for dispatch from the queues that the Mill doesn't have, or stalled when at the head of the dispatch queue and ready to issue when the assigned execution unit is busy but some other execution unit is free?
The whole dispatch vs issue mismatch they're complaining about doesn't exist on the Mill or any other statically-scheduled machine (though there's a related problem in how many instruction slots end up as NOPs). There are at least three other features of the Mill that make the example in figure 6(a) inapplicable, and they didn't diagram the dependencies from the real code used to generate figure 6(b).
It looks like add operation number 5 can't use the empty cycle on lane 2 because compare number 3 would already be queued in that lane. The time-reversed/consumers first scheduling heuristic that a Mill compiler uses wouldn't paint itself into this corner the way the this model's dispatcher does.
Figure 6(a) for a Mill would look like an ALU pipeline doing an add operation every cycle without stall, a load pipeline issuing a load operation every cycle without stall, another ALU pipeline doing compare operations every cycle but lagging behind by several cycles instead of just one, and a branch unit doing the conditional jump every cycle immediately after the corresponding compare. And on a mid-level or high-end Mill there would be plenty of execution units to spare for a non-trivial loop body without reducing throughput, and the whole thing would be vectorized too.
If this mechanism doesn't work as expected with real software, then there's no reason to think that the Mill will fare any better for general-purpose computation than any earlier VLIW/EPIC machine, for all the same reasons as before. And in that same talk they make some claims about OoO that seem a bit naive. The state-of-the-art has evolved a lot since the company was founded in 2003.
Fun fact, the Mill's proposed method for hiding that latency, "deferred loads" has already been done by Duke's Architecture Group: http://people.duke.edu/~bcl15/documents/huang2016-nisc.pdf (warning PDF link).
The big gain? A measly ~8%.
But like I said, I wouldn't be surprised if I'm just not understanding what I'm reading/watching.
Of course, the characteristics of memory accesses are quite different in a GPU: very high bandwidth at the cost of very high latency.
I encourage you to read McFarlin's publications, they're very good OoO statistics.
As a sidenote, the primary advantage of the Mill in software pipelining seems to be that they can software pipeline loops without a need for a prologue or epilogue and thus save a lot of cache space.
Elbrus (https://www.google.ch/patents/US20020133813) had this too, didn't help them get good ILP.
Oh, and someone (disjoint from the Mill team!) has actually made a model of the Mill's phasing and found that it fails to r.
Are any Intel engineers on this thread saying, whelp if they can do 6 we may as well give up? Or does the lack of any data help you sleep soundly and skeptically?
Do they in real code?
The Mill team claims (with no running general purpose code OR data of any sort) that they can. YMMV, but I bet Intel sleeps safe. Which sucks because I actually want innovation in computer architecture, which is a notoriously uninnovative industry, and the Mill has some genuinely good ideas, but I don't think they'll work here.
4 µop/cycle after fusion.
It is very dependent on the workload.
CISC is very efficient. They do the most ops per transistor.
"CISC is very efficient" is a meaningless statement. The specific instruction set and its implementation is or is not efficient.
Remember that a VLIW CPU has a lot of sub-ops in each instruction word, so it by design has more ILP, though only when the workload allows for many of those sub-ops to be filled, and how much work those sub-ops do can muddy the waters.
Micron will eat 1000 nested switches 100 cases wide every cycle. This sort of TCAM tech was rare 5 years ago, but everywhere now.
I'd say the Micron Automata processor (which never really shipped from Micron in performant form - there were some "too slow to provide real numbers" units out there) is very interesting. But many of the applications for it in pattern matching as elsewhere are contingent on doing useful parallel work if they are to outcompete an optimized software implementation. So if you have a giant NFA where you need to run 16384 states all the time, the Micron AP looks great. However, if you can manage to find sleazy optimization tricks that allow you to cheat and get down to running the occasional 256-bit NFA over a few bits of your input, the AP isn't competitive.
More of the pattern matching task that we're familiar with (i.e. for network security) looks like the latter case (tricks being possible).
I'm very interested in seeing the tech get out there, but its been spun out into a startup and I don't know what the future looks like for it.
Any new microarchitecture needs to first answer that big question -- how is the core going to deal with memory?
The big gain? A measly ~8%.
Out-of-order execution means a CPU will re-order instructions on-chip to account for stalls due to cache misses or (on superscalar processors) due to a mismatch between the kinds of instructions and the kinds of execution units available. OoO execution require analyzing instruction dependency information pretty early in the pipeline and very quickly to keep up a steady stream of ready-to-execute operations.
An in-order processor like the Mill only pays attention to instruction dependencies to determine whether a memory fetch has been completed, and if not to stall the pipeline until it has. The Mill compensates for that performance disadvantage by being statically scheduled: all the decisions about how to schedule the instructions are handled by the compiler toolchain, which has to know the exact instruction latencies and mix of available functional units on that model. An in-order processor that isn't statically scheduled could be superscalar but it would be hard to get several functional units in use simultaneously. For example, the original Pentium was in-order but could issue up to two instructions simultaneously, with restrictions on what instructions could be executed in what pipe and if the two instructions in a pair didn't take the same amount of time a third (and maybe fourth) instruction couldn't start until both had finished.
That sounds pretty bad from a performance standpoint. You'd have to compile a binary for every single CPU to get the best performance. Maybe it could work with JIT languages running on JVM and CLR.
How would you have a CPU be able to clock-up and clock-down and still maintain good performance?
So for any reasonable implementation we're looking at JIT with caching, for running platform independent code. The very definition of a process VM is "to execute computer programs in a platform-independent environment".
I'm not saying it will not be faster, just that it is a weakpoint in the Mill architecture compared to x86.
Keep in mind that if its implementation is much simpler, it'll be smashed by an OoO x86 single thread performance, but may still win on real world workloads.
Anyone with any experience in the field can come up with an instruction encoding structure that produces more instruction level parallelism in theory. It's usually a simple mental exercise to determine that such a CPU isn't implementable in the "real world." That's exactly what is happening here with Mill except they have really dived off the deep end and do so much handwaving that it's hard to tell exactly how mill is supposed to work.
Take branch prediction: Last time I checked they wanted the processor to load what they described as an "exit table" associated with a module of code that serves as a branch prediction unit. The problem is that this doesn't work at all. You can't have a CPU load a separate data structure that replaces its internal state like this.
First off, this is way too complex as a piece of front end logic for the processor. Trying to access this as code is running is basically impossible. Secondly, A branch predictor has to produce a prediction immediately, you can't just handwave and say you will ask an internal data store structure to produce a result. Their description of this is that it is in some way a "hash table" (http://millcomputing.com/wiki/Prediction). That's really interesting because you would never ever want to implement a hash table in a CPU. Either a) they described it analogously as acting like a hash table; which doesn't make a lot of sense here in a technical description b) they really want to implement a hash table lookup as a branch predictor which be horrendously complex and would never work latency wise or c) they are are calling an associative cache a hash table. They also want to somehow update this "hash table" with a correct prediction on a mispredict. So somehow the processor will update a separate data structure (using a hash table lookup) every time there is a misprediction. Which again, doesn't work latency wise.
By the way, all of this somehow has to be loaded and unloaded seamlessly into main memory.
Notice that there is no description of any advanced logic that is used in modern processors to dramatically increase the accuracy of predictors. I don't even understand how they would possibly implement any of that advanced logic with the branch prediction unit so complex and decoupled from the front end of the processor.
Of course, this kind of compiler driven prediction has been tried before and it doesn't work for obvious reasons. The compiler has a lot less information about the current state of the program than the processor does.
So to recap: they described a feature that doesn't work as described, if it did work would be horrendously complex and wouldn't properly work as a branch predictor and finally doesn't actually provide better predictions at all.
This is basically Mill in a nutshell.
Ivan Goddard's descriptions of the branch predictor sounds quite simple that I can remember; The predictor itself is completely detached from the compiler, outside of the instruction layout falling into EBBs. A point that (I think is on the Mill forum somewhere) they've said about the prediction algorithim itself, is that they're doing simulations with nearly the simplest known prediction algorithm, and getting something like 90% success rates from that compared to the 98% the state-of-the-art sees. The difference is entirely in what data is carried by the prediction. (I believe this is in the forum thread on Switch statements, referenced in this video) What the predictor stores and delivers as output is the logical equivalent to an instruction pointer, and where in the EBB they exit, instead of a boolean (or an entire exit table). Which is more state, but not replacing the entire state of the processor. And, as explained in this presentation, the impact of having (effectively) an instruction pointer from the branch predictor is you know what to prefetch from memory. And that pointer, being associated with some other EBB that the predictor might know about, is also enough to ask the predictor for the next prediction, get a second prefetch, and so on.
That also has implications to a point Goddard made in this presentation, in that some entire Switch statements have only one "EBB" (although that's loose terminology itself), and thus only one prediction associated with it. The prediction is to where the block will exit, not whether any specific prediction will succeed, thus there is no case where the processor will stall multiple times before finding the correct control flow. It can only fail once before it determines the correct exit to the block (in general, at least...)
Which is I believe how they are expecting good results, rather than anything peculiar about how they determine the prediction from history.
Let's assume that Mill can produce a prefetch chain though that can load additional code segments beyond the initial branch target. Does this really benefit the processor? I don't understand how a static prediction of where I will branch to next after an initial branch target is even useful. If I have already branched to that location before the code will be sitting in the L1 cache waiting for me, if it isn't, well, how does the branch predictor know more about the behavior of my program than the L1 cache?
To put it another way: if the instruction set doesn't fit in the L1 cache is there some predictable way that I can load and unload the cache just based on a given branch target( that in turn implies other branch targets will be loaded next with a total instruction set larger than the cache). I don't really think there is any possible way to predict the behavior of an entire program like this. Larger instruction segments and more branch targets imply more complex programs that are correspondingly harder to predict so I'm always better off just relying on the instruction cache.
Assuming Mill tries to load entire chains of instructions into the cache based on a simple static target I would be much better off just disabling this and relying on the instruction cache in almost all cases therefore.
Prefetch chaining is to get code out of DRAM, and it runs DRAM-latency ahead of execution. Separately, fetch chaining is to get the code up to the L1, and it runs L2/L3-latency ahead of execution. Load chaining gets the lines from the L1 to the L0 micro-caches and the decoders, and runs decode-latency (3 cycles on the Mill) ahead of execution.
The Mill stages instructions like this because the further ahead in time an action is the more likely that the action is down a path that execution will not take. We don't prefetch to the L1 because we don't have enough confidence that we will need the code to be willing to spam a scarce resource like the L1. But avoiding a full DRAM hit is important too, so we stage the fetching. It doesn't matter at all in small codes that spend all their time in a five-line loop, but that's not the only kind of codes there are :-)
Table reload is part of our support for the micro-process coding style. It is essentially irrelevant for typical long running benchmarks, especially those that discard the first billion instructions or so to "warm up the caches".
Table reload provides faster response for cold code (either new processes or running processes at a program phase boundary) than simply letting the predictor accumulate history experience. There are heuristics that decide whether execution has entered a new phase and should table-load; the table is not reloaded on every miss. Like any, the heuristics may be better or worse for a given code.
The historical prediction information is in the load module file and is mapped into DRAM at process load time, just like the code and static data sections. Table-load is memory-to-predictor in hardware and is no more difficult than any of the other memory-to-cache-like-structure loading that all cores use, such as loading translation-table entries to a TLB.
While a newly-compiled load module file gets a prediction table from the compiler, purely as being better than nothing, the memory image from the file is continually updated during execution based on execution experience. When the process terminates, this newly-augmented history is retained in the file, so a subsequent run of the same load module is in effect self-profiling to take advantage of actual execution history. Of course, programs behave differently from run to run and the saved profile experience may be inapt for the next run; there are heuristics that try to cope with that too, although we have insufficient experience as yet to know how well those work. However, we are reasonably confident that even inapt history will be better than the random predictions made by a conventional predictor on cold code.
As always in the Mill, we welcome in the Forum (millcomputing.com/forum) posts of the form "I don't understand how <feature> works - don't you have trouble with <problem>?". Unfortunately, time and audience constraints don't let us go as deep into the details in our talks as we'd like, but the details are available for you. If, after you have understood what a feature is for and how it works, you still see a problem that we have overlooked (as has happened a lot over the years; part of the reason it's been years) then we'd really welcome your ideas about what to do about it, too.
If we assume that they really do mean "associative cache" for the predictor table, then that's obviously something that really works in hardware, and can be updated on mis-predicts. Flushing the predictor state out to some OS-accessible area when the task quits doesn't sound like magic to me. Pre-loading the cache with warm data when the task is launched next time would be very latency-sensitive and is thus more questionable, but I don't think you've adequately explained why it's impossible to pull off.
That is quite a complex operation though. How large is the associative table that is being flushed and how often does the flushing occur? Assuming we are talking about something in the general size of a BTB loading this entire segment into the branch predictor on very context switch is possible(although how this is implemented on silicon would have to be quite complex)
The data structure described in the paper must be quite a lot larger than a simple BTB though if the predictor is to work as described. It has to hold significantly more information than a BTB if Mill wants to replace all the other associated branch prediction machinery with a table that needs to predict every target. I don't believe a prediction table this size would be feasible to load on every context switch. It's possible they want a smaller table and to replicate some of the branch prediction machinery in a modern CPU. But this seems to defeat the entire point of having a prediction table produced by a compiler and updated by successive runs of the program.
It's very hard to tell exactly what they are loading or what the table will actually do. That's why I'm not surprised they described it as a hash table. It's a simple explanation for how their solution works that avoids having to explain tricky implementation details.
>> but I don't think you've adequately explained why it's impossible to pull off.
I think it's theoretically possible in the sense that they could design silicon that does something similar to what they described. I totally understand what they are trying to do:
Have the compiler produce a complete set of predicted branch targets and improve the targets over time by instrumenting the process. It sounds great superficially, you don't have to bake any branch prediction logic into the cpu at all and you can use the all the information the compiler has combined with how the program actually runs. The problem(performance-wise) is that most branches are executed many times and have dynamic behavior that can't be exploited by just producing a table of branches.
Silicon-wise I don't see any way they could integrate such a complex piece of standalone logic into the frontend of the CPU and still maintain acceptable latency. The branch predictor of modern CPUs is incredibly well engineered and optimized and this just isn't in the same ballpark and can't perform as well.
The exit table and the prediction cache that they've described are only intended to be the common subset of prediction mechanisms that differ from a typical CPU, while a high-end Mill could also be incorporating many of the tricks you would expect on a desktop-class CPU, but those details were beyond the scope of the prediction talk.
High-end x86 CPUs already have pretty complex BTBs, often multi-level caches with different capacity for different branch types and limits on how many branches per I$ line. The only extra thing the Mill is adding with its prediction cache is the length information on how many bytes of instructions to prefetch and decode to get to the expected branch point (whereupon the prediction cache will be consulted a second time to start on the next EBB). As with x86 cores, I don't think the prediction cache/BTB would always be the same buffer or same size as the history pattern table or other model-specific prediction mechanisms.
The exit table is further from the execution pipeline and isn't on the critical path for recovering from a mispredict. It gets feedback from the core but only directly feeds into the prefetcher. This is the cache that will have the save/restore mechanisms and will load data from RAM in batches. It sounds a lot like a TLB in these respects, though the context switch that triggers loading new data is a function call rather than only a process switch. It definitely doesn't need to predict everything. In the talk it was mentioned that the exit table would have victim buffers and has a 3-cycle latency to the prefetcher. (And remember that Mill clock cycles will probably be pretty long in initial implementations.)
As much as I'd love to see something new in the general-purpose CPU arena (was Itanic the last attempt?) I will remain skeptical of such claims until they can be demonstrated by actual benchmarks.
The Mill is statically-scheduled, so WYSIWYG and you can just count cycles from the assembly.
Here's an example of someone hand-rolling assembly and counting the ILP: http://millcomputing.com/topic/switches-orig/#post-2846
He gets his ops, phases, latencies etc from http://millcomputing.com/wiki/Instruction_Set (there's a slots/latencies table at the bottom of the description of each op)
The instruction-set doc is slightly out-of-date and we know that and we're regenerating it soon, but most ops have same latencies now as documented. We also know there is scant examples for those who don't want to hand-roll stuff; we are working on doing some prettily-rendered examples, initially without auto-vectorization and loop-pipelining (so not really unlocking the full potential of the Mill, but better than nothing).
They do have a compiler, and they do have a simulator. Or rather compilers and simulators: one for each Mill family member. The simulator, compiler and related intermediary tooling (and the specification they use to produce them) are the first “real” things they've written and gotten working, and the simulator defines the instruction set. They've done a live demo of editing the specification and producing a new simulator, so I'm inclined to believe there's no subterfuge here.
However, they're not public. Possibly because they haven't finished filing all their patents?
But I sincerely hope they do find commercial success.
As in, they are making good on the promise of actually proving it isn't vaporware, it is just a slow and painful process.
Anything that gets leaked ahead of time can be used to derail their patent process and depending on what it is, it could be the difference between being a profitable venture or not.
Everyone who makes certain types of processors (which includes Intel and AMD both, along with all the ARM licensees) have a vested interest in ruining this effort, no matter if Mill ends up being successful or not (as in, Mill holding certain patents even if their implementation was not successful, is still a highly profitable venture).
As a side note, a lot of people use Itanium as an excuse to shit on Mill: just remember, quite a few things engineered for Itanium have found their way back into today's x86 designs. As in, things Intel patented in Itanium are still in use and lived on. Even though the Mill is strange, if it doesn't succeed, parts of it will live on in other designs.