Hacker News new | more | comments | ask | show | jobs | submit login
Microsoft ports Windows 10, Linux to homegrown “E2” CPU design (theregister.co.uk)
281 points by cpeterso 8 months ago | hide | past | web | favorite | 73 comments

Just in case you missed it, there's an update at the bottom to the effect that work on E2 has wound down and there aren't any plans to turn it into a product:

> After publication, a spokeswoman for Microsoft got back to us with some extra details. "E2 is currently a research project, and there are currently no plans to productize it," she said.

> "E2 has been a research project where we did a bunch of engineering to understand whether this type of architecture could actually run a real stack, and we have wound down the Qualcomm partnership since the research questions have been answered."

> As for the missing webpage, she added: "Given much of the research work has wound down, we decided to take down the web page to minimize assumptions that this research would be in conflict with our existing silicon partners.

> "We expect to be able to incorporate learnings from the work into our ongoing research."

They don’t want to scare the piss out of their current silicon partners. Who knows what will come of this, but I’m doubtful it’s nothing.

Sort of blackmail what comes out of this. You can say to Intel, if you don't give us a nice enough discount for the current order of 100 000 Xeons for Azure, we will go back to these plans and eventually your revenue stream will dry up.

While that may be true for something that's more production ready (like an AMD or Qualcomm chip), this project seems unlikely to affect negotiations.

However, it's likely to cause panic at Intel for the long term. It may prompt Intel to move into Microsoft's territory, or partner more closely with competing OS vendors (chrome?)

I like that this architecture tackles a large problem in existing architectures, which Mike Pall put rather nicely:

> All modern and advanced compilers convert source code through various stages and representation into an internal data-flow representation, usually a variant of SSA. The compiler backend converts that back to an imperative representation, i.e. machine code. That entails many complicated transforms e.g. register allocation, instruction selection, instruction scheduling and so on. Lots of heuristics are used to tame their NP-complete nature. That implies missing some optimization opportunities, of course.

> OTOH a modern CPU uses super-scalar and out-of-order execution. So the first thing it has to do, is to perform data-flow analysis on the machine code to turn that back into an (implicit) data-flow representation! Otherwise the CPU cannot analyze the dependencies between instructions.

> Sounds wasteful? Oh, yes, it is. Mainly due to the impedance loss between the various stages, representations and abstractions.


The obvious answer is to completely to the instructions the cpu uses internally. That would mean recompiling for every cpu revision though.

That approach has (sort of) been tried before for superscalar architectures. VLIW architectures were the Next Big Thing in the early 1990s. The general idea is that the machine code explicitly told the CPU what to do with each of its execution units. It seems like a good idea. Intel released their i860 processor and waited for the cash to roll in.

The trouble was that because the machine code is pretty specific to the internal structure of the CPU every time they released a new major revision of the CPU the existing executables all had to be recompiled. All of their customers needed to get an entirely new OS, new third party software, they had to recompile all their own code, everything. This proved too much of a burden for many and popularity of the architecture suffered.

The other problem of these kinds of architectures is that they're quite inefficient at encoding code with lacks inherent parallelism. The instructions are long and most of them have to be NOPs if most of the execution units are idle - which is a lot of the time. This code bloat in turn wastes memory bandwidth and instruction cache space, making them overall not as efficient at using precious cache space as architectures with more compact instructions.

Those weren't the only problems with VLIW (and more specifically Itanium-type VLIW).

Recompiling and increased code size would've been fine if that were the only concerns.

The big problem is that so far, in general purpose workloads (the type with heavy control flow and gnarly memory access), VLIW was never able to match OoO in performance in terms of extracting enough ILP.

That was in the 90s though. We haven't had much "design" in CPU for 5 years. We are stuck with 4-5Ghz max, IPC seems to have its limit. And now we ran into security issues with too much clever hardware optimisation.

And it seems recompiling is relatively easy enough for servers? Where everything are in controlled environment?

> And now we ran into security issues with too much clever hardware optimisation.

I contest "now". It seems we ran into security issues at latest with Sandy Bridge (lazy FP) but many even earlier.

There have been some remarkable enhancements toward cache coherency and feeding data to large core counts while keeping latency from completely going crazy.

Kind of interesting to thing of the CPU as just another peripheral.

Yeah, it was "obvious" to Intel too when they made Itanium. And the rest is history.

> the "Itanium" approach that was supposed to be so terrific—until it turned out that the wished-for compilers were basically impossible to write

- Donald Knuth, source: http://www.informit.com/articles/article.aspx?p=1193856

No, because the CPU has information the compiler doesn't. The compiler throws away useful information while it compiles, yes, but the CPU gets new information at runtime which the compiler doesn't have.

Wasn't that the idea behind Transmeta - to JIT x86 operations in response to runtime feedback?

Nvidia's Project Denver also attempted something similar. It shipped in the form of the HTC Nexus 9, but it wasn't any faster than the competition in benchmarks (where you'd expect a JIT to shine). In the real world it's even worse because JITs are not good for responsiveness, particularly JITs that are not scheduled by the kernel and are unable to be scheduled behind important work. The CPU just randomly stopped running code to run a JIT instead.

But I was more just talking about things like branch predictors. You can be informed by runtime data without being a full on JIT, which is the world very nearly every modern CPU lives in these days.

It was the idea behind HP Dynamo.

There's a good article here spoiled by the bitrotted formatting: http://archive.arstechnica.com/reviews/1q00/dynamo/dynamo-1.... It looks better if you use Firefox Reader view.

I have a feeling where they were going with that was to push the internal data-flow representation to the CPU and have it figure out the register allocation, instruction selection and scheduling etc. right before execution.

Itanium 2.0?

It's not exactly clear what you mean by that comment, but this is almost the polar opposite of the Itanium approach, which was to push the low-level execution unit scheduling right up the stack to the compiler.

No, it wouldn't mean recompiling for every CPU revision.

Today's CPUs contains a kind of JIT compiler implemented in hardware. There is no reason why this JIT compiler couldn't be implemented in Software.

A lot of business-software runs already in some kind of VM, with a JIT compiler. This JIT compilers could easily spit out the hardware depended low-level code instead of e.g. x64 ASM. So a new hardware revision would only require an update to the runtime's JIT compiler.

I think IBM is doing something similar for decades now with their mainframes. "Native" executable are "only" some byte-code so you don't ever need to recompile your applications when you upgrade the hardware.

That introduces portability and cpu update issues. One of the reasons for high level languages and the trends in virtual machine/runtimes ( JVM, .Net ) is to decouple the hardware and the software to increase portability and stability and of course productivity. It's a tremendous logical gain in exchange for some physical performance loss. The trends have been clear for the past few decades. I don't the the pendulum swinging in the other direction any time soon.

So Microsoft is highlighting the value of software freedom and the harm of software non-freedom: we're all granted permission to port free software (as Microsoft claims to have done with the Linux kernel, BusyBox, and more) but only Microsoft -- the proprietor -- is allowed to port Windows.

So long as we value software freedom we can take steps to defend software freedom and continue to see practical gains. But proprietary software leads us to a dead-end waiting for a willing proprietor to do something for us (and thus providing a result we cannot trust).

I see this proprietary/free duality as a fundamental weakness of copyright law and intellectual-property-based business models. As things currently stand, software that provides the most value to society also often happens to make the least money (and vice versa). To truly prosper, I wonder if society will somehow have to move beyond copyrights, patents, and intellectual property rights as they currently exist.

I’m not sure what the solution is, but a recent HN post caught my attention — it was about how quickly Germany caught up to England’s industrialization in the 19th century when books weren’t copyrightable. This may provide a clue. [1]

[1] https://news.ycombinator.com/item?id=17329345

Microsoft is pretty open compared to competitors. It has long history of providing Windows source code to universities (I recall Win2000). It also provides debug symbols for its libraries.

Try to get source code for Apple, Amazon or Google products...

Like Darwin source code?

Granted they don't make it very easy to run, but it can be done with a HELL of a lot of effort.

Here's the Wikiepdia page on EDGE:


This seems to be a fairly close descendant of the older dataflow designs that also inspired out of order computing. The problem with dataflow processors, as I understand it, was that the fact that they weren't pretending to execute instructions in order meant that they didn't provide for the precise exceptions you need for memory protection, multiplexing across threads, etc. Is there a standard EDGE solution to that?

What is interesting is that the new breed if timing attacks show that the the pretense of ordering isn't as tight as we had imagined. So maybe we might once gain prefer processors like this because the software is going to have to deal with that complexity anyway.

Not that I know of, one of the THE problems with TRIPS/EDGE was the imprecise exceptions.

How similar is this architecture to the Mill? https://en.wikipedia.org/wiki/Mill_architecture

It's unrelated except to the extent that both are communicating the dependency chain to the processor through the ISA in a way other than sequential order.

Compiler hints for data dependencies between groups of instructions sounds very much like the mistake that sank the Itanium. What is different this time around?

In VLIW like Itanium, the compiler had to determine not only instruction dependencies but it had to fully schedule when each instruction would execute, even though critical information like cache access time cannot be known by the compiler. In EDGE architectures the processor still does dynamic "out of order" instruction scheduling which gives much better performance.

The iceberg that sunk the Itanic was the HP-Oracle-Intel deal.

HP was the only one who could sell them and Oracle put restrictions on how they could be used.

Wheb you have an ultra niche product that also happens to require a lot of ground work which is utterly incompatible with everything else on the market you aren’t likely to succeed.

If anything given the handicap that Itanium was playing with it can be considered an astonishing success.

Huh? Itaniums weren't available just from HP. At some point, basically everyone sold them, although HP is probably the only one left still selling systems based on it (Integrity/Superdome).

Outside of “super-computers” all the other OEMs were limited to a low cpu/core count which made their offerings essentially mute.

That wasn't a mistake, and Itanium did just fine. Circa 2008 the fastest database servers in the world ran on Itanium.

Intel was just unlucky that AMD exists and came out with AMD64.

I believe if that wasn't the case, we would be using Itanium laptops nowadays.

I don't think they would have ever made Itanium mobile, instead we'd be stuck with that PAE nonsense, and living in 32bit world, because 'general users don't need 64bits' ...

I think Intel wanted to eventually phase out x86 and push Itanium all the way down the stack, not for technical reasons but because it was a proprietary architecture that would never have competing implementations.

You are think of Intel as having a single mind. The reality, and the reality of most large organizations, is that the ant hill is beaming with divergent and competing opinions. In particular, there were two powerful, competing camps: x86 and Itanium. The combined effect of Itanium performance not living up to promises [1] and AMD introducing a 64-bit x86 extension shifted the power balance, but it wasn't an inevitable outcome.

[1] The first iteration (I have one in the garage) was ~ 486 level fast and, sure, the final iterations were fast, but that took an AWFUL lot of silicon. The perf/transistor is terrible, even worse than x86 (which is bad).

Maybe, but the Itanium came out during a time when the laptops were really becoming important and there was never an Itanium laptop. I look at the Itanium like I look at the G5 (970). It doomed itself by having no laptop possibilities.

It mostly doomed itself by being a backward incompatible mess. You couldn't run your existing software on it, it was expensive, and slow; being in a laptop would not have mitigated any of these.

Sure, you couldn't run your i386 machine code on it or at least not at competitive speed. But that is true for basically everything that isn't an x86.

That you can't run object code for one ISA on a totally different architecture is kind of the default state and you solve it by recompiling your code. It's why we have portable programming languages. No need to re-write your software except for some assembly routines in the OS and support libraries.

The Wintel dominance really fucked up the industry, badly.

Some of its a self inflicted wound by chip makers. Skipping the Alpha because it was doomed by HP, it doesn't seem a lot of chip makers know how to put out a motherboard that can be bought by developers. MIPS theoretically had the Magnum boards, but even they were not generally available. Even today, getting a PC board for POWER or ARM is an adventure for your wallet.

Remembering the history, they had pretty much everyone signed up (including Sun for some reason) to convert their software. They were expensive, and like most manufactures they didn't have a good on ramp for hobbyists and early developers. I think if they had noticed the trend to laptops, the rest would have followed.

Sun was in their blunder years at the time, also releasing x86 solaris and open sourcing it, so it's not surprising they signed up (I trust you when you say they did, I don't remember). HP signed up as part of the deal of killing their own processor division. Digital Alpha was somewhere between dying and dead/purchased by Intel. MIPS wasn't fairing very well already, especially with SGI committing slow long winded suicide by moving to Intel.

Obviously everyone is entitled to their opinion, but I don't find your argument convincing. ARM had a much better story even with early portables/laptops with Windows CE, and were a dismal failure -- just like the later Surface RT.

ARM had a power/battery-life advantage for laptops and support from Microsoft, and that didn't help; Switching to Itanium would not even have that. If Itanium found any general market success, it would have only been because of Intel's dominance. Luckily for us, that didn't happen.

Sun: https://www.linuxtoday.com/developer/1999090200706PS

ARM has just as crappy a story except from the opposite way but not the performance or PC. They had power consumption but didn't duplicate their early PC support in the US. Palm and Windows CE handhelds did ok but go clobbered by the iPhone. ARM never had a big launch with multiple PC OS vendors. ARM also didn't have motherboards available in PC configuration for a lot of its lifetime. That's what made the Raspberry Pi so special since it was an actual motherboard available to people and not some dev / eval board. Now, we see ARM showing up in the server and PC market because the performance is there, smart phones have paid for the research, and ARM is customizable / multi-vendor.

If we had been lucky, IBM would have chosen the 68000.

> Palm and Windows CE handhelds did ok but go clobbered by the iPhone.

Palm and Windows CE were there since 1998, maybe even before. the iPhone came out in 2007. There were multiple ARM PDAs at the time (Compaq's iPaq, HP's Jornada, Palm's Pilot, many windows CE devices).

> If we had been lucky, IBM would have chosen the 68000.

Totally agree. The 68000 ISA is so elegant compared to the mess that is x86.

The original Palms were Dragonball machines (68000) and CE actually ran pretty well on MIPS.

Palm and CE are 1996. Like I said, they did ok, but never sold in the volume that the iPhone and later Android achieved.

I dearly miss the 68000 and seriously sometimes wish the whole PowerPC thing didn't happen in favor of continued 68000. If the x86 proved anything its CISC / RISC wasn't exactly clear cut. The 88000 wasn't that bad except for a price that makes me think they were doing a bit of hallucinogens. I never got to play with ColdFire.

Forgot about the Dragonball times ... Thanks.

Indeed, and it would be interesting to see how the most recent Itanium would perform if it would be ported to the current 14nm process instead of 32nm. While it once was considered to be a monster chip in transistor count , it no longer is.

Didn't Itanium suffer from the same issues as the Netburst architecture?

No, the issues were completely different. Itanium was pretty low-frequency.

Static scheduling plays badly with variable memory latencies and Itanium's speculative memory features just didn't work well enough. But it's instruction encoding was very inefficient and hurt a lot on workloads sensitive to i-cache pressure. If I understand EDGE correctly they don't try to statically schedule with respect to loads so this isn't a problem, they're just explicit about where the parallelism is.

One thing that has changed: we routinely see hardware outliving software these days (e.g. Android phones that would still be perfectly adequate if software updates were available), this was hardly imaginable in the days of the "Itanic". Back then new hardware was typically bought to run old software faster.

Also, jit runtimes, once you have a hardware specific build of your browser (or of your server stack) you are 90% there.

I haven’t read the relevant papers, but I imagine they were targeting something that presented more like a fine-grained MIMD and less like VLIW. That is, many tiny cores with independent execution pipelines, not a few large cores with very wide paths of functional units.

It's right there in the announcement, all the compiler support is already there this time around.

> E2 uses an instruction set architecture known as explicit data graph execution, aka EDGE which isn't to be confused with Microsoft's Edge browser.

I chuckled at this one.

Let us give thanks that Ballmer is no longer around. He woulda made sure it was named Windows.


People interested in VLIW (or EDGE, which sounds like VLIW but less strict) should also checkout the Mill architecture.

The designers claim a 10x speedup compared to existing processors. Their presentations are online on youtube, esp. the security related ones are amazing since they throw away alot of current design in x86/ARM CPUs.

They claim a 10x power / performance advantage actually, what I gather is that it's supposed to be higher performance at much lower power. It isn't out of order and so it has to wait a long time in case of last-level cache misses. But it has insane throughput when it doesn't have to wait for memory.

Most likely they'll mitigate the waiting problem by hyperthreading. GPUs do extreme hyperthreading to stay busy without any kind of out-of-order execution.

To be fair, they've addressed memory access issues, and have proposed "deferred loads" (works well: http://people.duke.edu/~bcl15/documents/huang2016-nisc.pdf) and other thus far unnamed techniques for mitigating it.

I’m not a hardware expert, so I’m not sure how interesting this is here, but along the lines of MS architectures and FOSS operating systems, see also their work w eMIPS and NetBSD[0]

[0] https://blog.netbsd.org/tnf/entry/support_for_microsoft_emip...

On a related note, another interesting CPU design - http://multiclet.com/index.php/en/support/general-technical-...

That's really good news. We need more progress in CPU design.

This is written in a baby-talk style but scroll to the end, there are links to research publications.

This system seems to be the same as VLIW: https://en.wikipedia.org/wiki/Very_long_instruction_word

VLIW compilers construct a "bundle" of instructions which are to be executed simultaneously, and thus must be completely independent of each other. The VLIW compiler must further space "bundles" apart such that any use-bubbles are respected (i.e., a load won't be ready for 3 cycles, etc.).

EDGE compilers construct "blocks", each made up of many instructions which are tightly connected, and send the block to an execution cluster which will dynamically schedule each instruction within the block as it sees fit.

Communication of dependencies across blocks is more expensive. Hopefully there is enough parallelism across blocks that you can execute multiple blocks simultaneously, in addition to ILP within the block itself.

Like VLIW there's a notion of explicit parallelism but unlike VLIW there isn't a strict prescribed order of execution.

Sounds a lot like...the Itanic

One big problem with the Itanic was that the compilers weren't (yet) smart enough to produce good code for it. This does not seem that much of a problem here, as it seems pretty close to how compilers keep immediate representation anyway.

Applications are open for YC Summer 2019

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact