Seems like a _fascinating_ architecture. I don't get the double program counter bit, also seems like too much cognitive overhead to me. The belt (queue of un-named slots) idea and function call for free with multiple return results, such that regular ops and user defined functions work the same way, that sure is clever.
Godard mentions that the CPU space has seen little to no innovation. But what about Transmeta? They were doing some cool stuff, no? And the DEC Alpha, that was super interesting. Probably lots more besides. I'm sure if I bothered to check that Wikipedia would have a CPU graveyard page :) Thing is, x86 has been the only game in town for a long time with Motorola challenging in the past and ARM challenging now. Maybe time for a three horse race?
Otherwise, CPU design is almost the poster-child for technical debt. The problems are more social and economic than technical. Once an architecture dominates the market it's almost impossible to replace it with something better. The best hope is to do what ARM are doing and sneak up on the market sideways, starting from a segment with much less competition.
The GHz barrier isn't caused by lack of talent, it's caused by basic thermal limits of silicon chips.
100ghz vs ~3ghz limit on silicon
(which is the reason why we sometimes use Strained-Silicon-Gemanium
Once something is baked into a CPU design that seems to be it forever alright, I take it that that's what you mean by technical debt. Besides, everybody up the stack (and that means everybody because you can't get much lower down the stack than the CPU) has got to re-tool. Mill Computing have been working on this for 12 years! Transmeta had Linus Torvalds for crying out loud. Best-in-the-world does not cut it when it comes to CPU design. Compatibility and small incremental logical changes seems to trump all. Or do what ARM are doing, as you say.
Exotic computing (CPU and/or memory) architectures are easier (still hard, but easier) to dream up than to implement on real hardware with a real software ecosystem (while still keeping the theoretical benefits).
As a tech geek I think what they are doing with the Mill is pretty fascinating, but it "looks like a duck, quacks like a duck" in the way of transmeta, alpha, itanium, etc, and I'm old and jaded so I'd honestly be pretty shocked if anything came of it.
Hopefully this lets them pivot and take advantage of new opportunities better than previous efforts to make a superior CPU.
While this "IoT" might be the most stupid buzzword I've ever heard, what it stands for is true: everything will be networked in the future. The only thing that matters here is that devices can talk together, and this might be their biggest shot. These days micro-controllers are already being replaced by cheap ARM's that run circles around them and are easier to develop for. This is an area where the semi-conductor industry will grow massively. If they're able to prove that their design offers the same processing power for less power, a whole lot of companies could be interested.
You could have:
offset : -----000000000111111
offset : 54321012345678901234
contents: 64212iijkkllllmmmmmm (digits are instruction lengths, letters are instructions)
With instruction sizes of 1-4 bytes, you could store instruction sizes in 2 bits per instruction. That would lower the pressure on your cache for the extra information you store. There also would be more space in your instruction set because you no longer would have to worry about creating an instruction that has a prefix that also is a valid instruction. For example, you could use 0x01 both for INC and ADD IMMEDIATE VALUE, where it means the former if the instruction length is 1, the latter if it is longer, with the size of the immediate determined by the size of the instruction.
Would that be a net win? I wouldn’t know, but it seems simpler (but ’simpler’ looks like a negative in their worldview) than creating those two streams that somehow must work together to do what a single statement sequence in the source code does.
You are right that you need to parse all previous instructions to get an offset. Again, this is far from my area of expertise (first time I read about CPU design), but it seems this problem could easily be alleviated by slight modifications to Huffman codes, for example by setting "blocks" of instructions of certain sizes and assigning a prefix to each block containing the number of instructions in each -- you then can get offsets very quickly.
This is hardware. There aren't branches, there are wires. Decoding those n bits can be constant time at the cost of quadratic "space" usage.
Quadratic is O(n^2), not O(4^n). n is the number of bits.
The bits are decoded in parallel, ergo. in constant time. There may be a linear step afterwards (eg. numbering each instruction result) but the actual decode happens in parallel.
I'm a bit confused about why you think it's linear time. (I suppose technically it even has to be constant time, since the number of bits is capped at a low constant, but I suppose that's also besides the point.)
It's not actually quite that simple, since it's actually a 3-phase decode and there is some interdependence. However, the point remains that shifting the read along one instruction at a time would be too slow and parallel decode is key to fixing that.
I suppose you might instead be asking why it's not linear hardware space complexity (rather than quadratic); the answer to that is a little more subtle.
All I will say is that they seem to have really tried to simplify and regularize the Mill _except_ in the opcode stream / decode step where they say that two streams is optimal! Seems like it would be a nightmare and introduce all sorts of complexities. Was the only bit of the talk I didn't grok, whereas most others bits had me nodding my head and thinking "oh, neat".
Each half is grouped with the same half of other instructions to make a stream, and the two streams decode together, step by step, so each cycle one instruction, comprising both halves, decodes and issues.
The result is to double the available top level instruction cache, and cut in half the work that has to be handled by each of the two decoders.
With this two-stream approach, you can split the bits that describe the instruction lengths from the bits that are the actual instructions, and move the former into the downward-moving 'CPU counter'. That way, you have a simple array containing those bits. That allows you to find the length of an instruction or, alternatively, its offset within its group in O(1) fashion.
From another comment, it seems that doesn't solve the main problem, which is that, with variable-length instructions, you need the ability to shift about any contiguous sequence of bits from a cache line (worst case: cache lines) into an execution unit. That wouldn't be that hard, if it didn't have to be very fast, too.
The problem is not figuring out what sizes things are, but the routing. When you have 30+ instructions a cycle, you need to route a lot to a lot of different places. You also can't do a parallel read since the size of previous instructions affects the placement of the next instruction - you will eventually fix this with routing but it's not easy.
Consider a 30-wide instruction. Decoding it involves knowing where in the instruction stream it starts. This can only be known once the prior instruction is partially parsed.
Consider its later half - this can only be read after the first half is read. In your case, you actually only need to read the 2-bit prefixes but the difficulty remains.
With a split instruction stream, since there are two independent (at least until a branch) program counters so you have twice the contextual information. You already know where the second half starts from the prior instruction.
One more great thing about this is that the contextual information is preserved in jumps, without increasing the size of jump instructions. Even if prior instructions had metadata about splitting the next instruction for parsing (at yet more space wastage), this would be lost across jumps. That would worsen jump latency, which is a common bottleneck.
The above point is way more important if you expect a flow instruction almost every bundle, due to the massive ILP the Mill has.
> With instruction sizes of 1-4 bytes, you could store instruction sizes in 2 bits per instruction.
However, the Mill does not constrain its instruction sizes to multiples of 8 bits. This means either you'd lose that bonus or you'd need even more bits.
The Mill also splits which instructions are allowed in each half. Not only does this give you a free implicit bit, it again halves the amount of work for each decoder. This could probably happen with interleaved halves like you describe, but not without difficulty.
> For example, you could use 0x01 both for INC and ADD IMMEDIATE VALUE, where it means the former if the instruction length is 1, the latter if it is longer, with the size of the immediate determined by the size of the instruction. Would that be a net win?
The answer is no. The instructions codes are generated to be space optimal. I'm not sure how this is done, but it has been stated. They also leave it open for metadata like you mention to be included in each instruction stream, and in fact their blocks do have headers.
> creating those two streams that somehow must work together to do what a single statement sequence in the source code does
Given the instructions are disjoint and they only need synchronization at jumps, I would be very surprised if this was problematic.
"Given the instructions are disjoint and they only need synchronization at jumps"
But they are (AFAIK) executing a single thread of execution, so creating those disjoint streams in general will be far from easy, won't it? if it were possible to do that, we would see a 2x speed up of single-threaded code on CPUs with two cores.
One great thing about the Mill is that instructions within a bundle are mostly disjoint. The Mill has no data hazards and each instruction takes from the belt, which is not changed until the bundle is finished. There are "gangs" that allow combining multiple instructions, but I don't know how they are handled.
In general, this means a Mill has more instruction interdependence than an x86. Although an x86 could have a split-stream instruction cache, data dependence prevents a traditional architecture from parallelizing itself as much as the Mill and hazards prevent the kind of speculation the Mill can do. Since instruction decode isn't a big bottleneck on fixed-width architechtures, there's no real pressure to innovate there.
I haven't read all documentation on the Mill, but I would expect that that sharing part means that both streams can read or one can write each register.
The Mill is Havard architecture.
Each Mill core has two separate physically separated instruction caches that each feed a dedicated decoder, which decodes for a distinct set of pipelines.
I apologize for my mistake, I hope that it hasn't induced too many confusions (I take it that 'distinct set of pipelines' means distinct operation and 'belt' so the registers aren't shared).
The pipelines are distinct, but the set of ops that any particular pipeline supports is arbitrary (as in picked by the HW guys who have to fit the FUs in) and the belt is shared. The belt is where the two sides join.
We tend to get good entropy putting the flow on one side and exec on the other, and that's nice for humans to model too, but there is no technical requirement.
The thing about the Mill is its easily customisable. If you have some new FU (lets say you want a particular op that assists in say encrpytion) you can choose where you put it, and how many pipelines you put it in etc.
But what do they know?
Mill has...videos. And patent applications. A strong social media presence. And a simulator that they really, really promise you'll be able to use real soon now and is going to totally own everything. And they're going to fab the chips themselves, long after everyone but Intel figured out that was stupid. Or something.
Because that sort of "we have the off-the-wall architecture that will change fucking everything in computing forever" attitude worked so well for the iAPX432, Transputer, Rekursiv, i860, everything Chuck Moore has ever done, TI9900, most (not all) things called VLIW, etc." Right?
Best of all? A completely uncritical, fawning audience on HN.
Yup, you have good ideas. How about you ship? Then we can see if the ideas meet reality and work.
I'm friends with one the guys who did the P6. Which became the basis for pretty much any x86 chip. He shipped stuff.
Ideas are easy, execution is hard, I have yet to see the Mill people execute.
I'd be crazy happy if they stepped up and made me look stupid because they shipped a kick ass chip.
It's a major architectural change as well. Most existing systems rely on virtual memory, which the Mill doesn't support.
I believe you're mistaken. The Mill does support virtual memory; the main differences to most systems are that its caches use virtual addresses (the TLB is between the L2 cache and the DRAM), and that it's a single-address-space architecture.
In particular the Mill does not have the same notion of a process, the CPU itself is aware of threading, it has its own security design that isn't anything like user/supervisor mode, there are no syscalls, and programs have to be specialised before they can be run, a la ART on Android.
I suspect once they have LLVM up and running, one of the next biggest wins they'll want to go for is porting HotSpot. The only other company I'm aware of that was able to actually make money selling a really crazy custom CPU architecture is Azul with their Vega arch, and they did it by making Java bytecode their exposed bytecode. The actual CPU opcodes were an implementation detail and everything compiled on the fly.
The nice thing about porting the JVM is tons of business-critical, performance sensitive code would Just Work™. And then you have a ready made market for companies with huge Java (e.g. Hadoop) jobs that are very power/performance tradeoff sensitive.
Azul made their money from selling high performance JVM's they have their "VEGA" compute appliances for Java applications but I've never actually seen their ISA documentation for the hardware which would be a very interesting thing to see....
Further, the Mill has no false aliasing, consistent caches, is always sequentially consistent and has no aliasing races. Some of this is covered in http://millcomputing.com/wiki/Memory.
The more criticisms I hear of this architecture, the more impressed I get.
Reading the Memory page on the wiki, it is indeed SAS with an MPU. Obvious difference from Blackfin is that you have 2^28 times as many addresses available.
It would also be interesting to see how much smaller a PLB entry is compared to a typical TLB. TLB misses are responsible for a surprising amount of performance loss on some workloads as L2 caches have gotten so much larger, so if they can have e.g. an order of magnitude more entries in the PLB it would be a big performance win.
[edit3] It just occurred to me that your PLBs could be as slow as your L1 cache with little impact on performance. This can also let them be physically larger than your TLBs.
Assuming the ISA is designed for efficient PIC/PID a unix-like shouldn't be too terrible to port from that difference alone. I know there have been SAS unixes before.
I have run into the 46-bit userspace limit on linux AMD64 (think mmap), so I'm somewhat concerned 64-bits may not be enough in the future as disks get bigger, though that's a long way off as you have 2^14 times that room under this scheme.
[edit2] It's a 60-bit SAS, not a 64-bit SAS
Afraid this answer is quite late, but hopefully it finds the curious passing through anyway :)
The really big gains from splitting protection from translation are twofold:
First, you can do the protection lookup in parallel to the cache lookup, rather than before it. This allows the PLB to be much bigger than a conventional TLB could ever be.
Second, the PLB entries do not need to be fixed power-of-two sizes. Each PLB entry has a start and stop offset so protection can be as granular as protecting one byte. In general, though, a single PLB entry will be quite large, covering the whole mapped region.
The Mill still has a TLB but it sits between chip and the off-chip memory. This allows it to be larger than a conventional TLB, as it taking slightly longer will not have the same impact. Additionally, as a small implementation optimisation, the TLB on the bigger Mill family members will do a lot of things like allocating-from-a-pool dynamically, so you only trap to the OS when things get dire.
The Mill address has a special bit to signify if it is a global address, or whether it is "local" to the process (using a special register which is XORed into the address). This allows the Mill to fork() and all the other things that traditionally have been hard on a SASOS.
We are porting Linux, and it will be Linux. Only a very very small part of the HAL will need to know anything Mill-specific :)
This seems orthogonal to the choice of PLB vs. TLB. You could make a TLB with a start and stop offset if you really wanted to (though I suppose the increased size per entry would be problematic with TLBs needing to be faster than PLBs (which only need to be as fast as an l1 cache hit).
> The Mill address has a special bit to signify if it is a global address, or whether it is "local" to the process (using a special register which is XORed into the address). This allows the Mill to fork() and all the other things that traditionally have been hard on a SASOS.
Would shared-memory (in the unix sense) be considered global or local? Would read-only text sections be marked global to improve fork performance? I assume CoW is out-of-the-question for non-read-only sections and fork()? If there's a wiki page relating to fork, just point me to it :)
> We are porting Linux, and it will be Linux. Only a very very small part of the HAL will need to know anything Mill-specific :)
The feasibility of this seems obvious to me, as I've seen linux run other SAS platforms that are far more constrained than The Mill.
Re non-paged TLB:
The Mill has non-paged protection in the PLB and paged translation in the TLB.
It is true that a 'conventional' could have non-paged translation, but non-paged translation would be slow. Really slow. Shudder to imagine how it would work internally :)
Conventional machines are Multi-address-space (MAS). In a MAS, addresses can alias each another. Not so on the Mill (at least, if you deliberately alias in the Mill TLB you'll have to know that reads and writes will merge as they go off-chip etc; tricky, but can be useful in select low-level scenarios).
On the Mill, everything that is shared between processes is global. On a Unix port, everything that is local to a process is 'local'. So MAP_SHARED gets addresses without the local bit set and MAP_PRIVATE does.
When we fork(), we have some choices. The OS picks a new local offset for the new process, meaning that local pointers don't alias one another.
We could straight away clone the protection entries and page tables.
Or, as an implementation detail, depending on the extent to which the PLB self-manages on the particular Mill member, just let the PLB trap. As the new forked process accesses memory ranges for the first time the PLB will miss and there will be the classic trap. In the trap, the handler can set up COW lazily and for the memory regions that are actually touched by the child.
I am simplifying it a bit ;)
Most fork() is just going to exec(), and the latter will win. Some people are still using fork() in other ways, and they may benefit from an aggressive COW up front on fork(). Swings and roundabouts.
The term "virtual memory" refers to a dynamic mapping between physical and virtual addresses. SAS doesn't preclude virtual memory; you can still have many of the advantages of virtual memory without multiple address spaces (for example, memory mapped files).
EDIT: Updated with complete video link
Edit: I've changed the URL to the truncated one temporarily, because even a truncated most-of is probably better than a large file download. http://llvm.org/devmtg/2015-04/Videos/SD/day%202/Ivan_1.mp4 is a URL for the whole thing.
They aren't? Really?