Hacker News new | past | comments | ask | show | jobs | submit login
Coming Soon: AWS Graviton2 Processor for AWS (amazon.com)
231 points by ChuckMcM 5 months ago | hide | past | web | favorite | 184 comments

If Intel makes it out of the current soup with its dominance intact it will make for an interesting case study. Think they are facing the perfect storm on multiple fronts.

a. The 10nm process fiasco

b. Missing the chiplet concept

c. ISA fragmentation - AVX512 which was supposed to be the next big thing was server CPU side only till recently and downclocks the entire chip when used making it extremely hard to reason about whether it makes sense to use it for mixed workloads.

d. With a move to more distributed programming models the network is often in the critical path now, making cpu performance less relevant for many workloads

e. ARM eating their lunch in mobile

f. Nvidia eating their lunch with GPGPU

g. Big players are increasingly building accelerators for critical workloads that offer better power / performance

h. Strong execution from AMD for the first time in over a decade

i. GUIs moving to browsers and a lot of compute moving to runtime based languages like Python/JS means that any new ISA features are initially hard to utilize in killer apps on the client side. So if Intel ships a killer new set of instructions there is a significant lag between it doing something useful for the user.

This comment seems kinda slanted. AVX-512 debuted on Xeon because datacenter operators asked for it. It does not “downclock a whole chip”, it gates the core where it is active and there’s not even that penalty on the current generation parts. “10nm” is marketing fluff which has little or nothing to do with actual semiconductor construction. “Chiplet” is also marketing-speak for “wow this memory topology is hard to program around “. Not sure they should feel too bad about missing that boat.

What Intel really should be worried about is the client side being their largest revenue segment. That’s a dead business, eventually. And the bets they made didn’t pan out: FPGAs aren’t popular because the people sophisticated enough to use them are also smart enough to tape out ASICs. IoT is not a thing.

Uhh source on AVX512 not downclocking on modern CPUs? We benchmarked ML workloads on the newest chips the cloud had to offer and the slowdown was a significant problem because, as the parent comment said, it is very hard to reason about whether the benefits of vectorized ops will outweigh the the reduced clock speed. Sometimes it does and sometimes it does not - which is a major problem when you have to specify instruction set when you build the ML library from source.

Maybe you know something I don’t but that FPGA statement makes zero sense to me. The ASIC development cycle is measured in years - that’s why FPGA’s are valuable (and I thought they were relatively heavily used).

It is a major problem to figure out what instructions to use but it's a lot more nuanced than you seem to imply. In the first place you seem to assume that "not running at the max turbo speed that's printed in the marketing literature" is equivalent to "downclocking". However, there are a huge number of reasons why a core might not clock up, including the number of active cores on the package. The first Xeons that shipped with AVX-512 had turbo clocks that were 25% lower than the headline turbo clocks, e.g. 2400MHz instead of 3200MHz. This is still pretty good, and the base clock is 2100MHz.

With the newest Ice Lake processors ("10th generation") the all-cores-active, all-avx-512 max clock speeds are the same as max scalar clock speeds. You can try this out yourself with the avx-turbo program.

No, when we used MKL, the workload was slower and turning off MKL made the workload faster. The marketing is irrelevant - using vectorized instructions slowed down the workload in practice which is all that really matters. The Intel teams we were working with explained it as being due to the slower clock speeds caused by vectorized instructions. I don't really know, but it seems fair to assume that they do.

It will be interesting to test Ice Lake when they make it to the cloud, hopefully some time late next year, but until we can actually use Ice Lake, Sky Lake is what AVX512 will be judged on.

It's a good thing you measured it :-) Programs that do a little bit of 512x512 FMA mixed in with other stuff will not benefit from AVX-512 but can suffer from the heat it generates, or from the hiccup when the CPU turns the FMA unit on and back off.

Codes that can do a lot of 512b FMA consecutively will benefit very greatly, and pay a small penalty (up to 25%) in terms of throughput for everything else.

Codes that use non-multiplier stuff that's just marketed as AVX-512, like VBMI2, also benefit greatly and without any penalty.

People with AMD CPUs don't get a choice. Hard to see how this accrues to Intel's mistakes column.

It's not really an Intel mistake, but it is an Intel problem. In ML, the ASICs are coming. NVIDIA is pretty much guaranteed to maintain a leadership position in this space because their software layers are dominant. Intel's ML leadership position is quite tenuous because the killer ML features don't work quite well enough for the premium. MKL should be a solid moat, similar to NVIDIA's CUDA and CUDNN, but if it requires serious effort to get the benefits, it becomes more palatable to spend that effort on ARM-based servers or custom hardware like Inferentia which are meaningfully cheaper. Maybe Ice Lake will fix this, but Intel is running out of time to convince people that Intel chips should remain the first choice in ML.

AMD isn't relevant in this space AFAIK.

Djb’s article linked above answers this in detail. In short, vectorized instructions don’t have to hit ram to do things like addition, but they can heat up the processor beyond its ability to cool to operating temperatures. The clock speed drop is necessary to avoid overheating.

For more reading, check out cpu pipelining, as well as how vectorized instructions actually work. The performance benefit for well implemented vectorized instructions overcomes the clock speed hit by leaps and bounds, which is why mobile systems make such heavy use of them, for example.

FPGAs are in a tough place. Like OP said, most people writing RTL make asics, or at least an asic that's programmable. The FPGA target market is getting slimmer, since we have programmable Asics, like GPUs and tpus, that are as performant with easier programming. They will still serve a niche market, but the "write c++ and run on an fpga" will likely never take off.

I thought the main market for FPGAs was that period between "we have a problem that needs custom hardware" and "we have custom hardware being produced at the scale we need". I guess that's a relatively niche market?

There's also the market for high performance things that need to be in-service upgradeable. I understand mobile base stations are a significant customer for large FPGAs, to enable deployment of new standards revisions/modulation schemes without a truck roll.

Pretty much. The problem set they're useful for is low-latency high-throughput stuff, and/or connectivity to high speed digital signals, for things that there isn't an existing custom solution and where you don't care about area or power consumption. That's not a huge market.

We do use them at my employer, a multinational chip company - but only in very small numbers, like one $50k board gets shared around project groups who use it for a few weeks each. Most of the work is done in simulation.

Chiplets are more significant than you credit them for. They allow higher yields and make the production economics much more favorable for AMD, whereas Intel is throwing out a lot more silicon.

Yield might be part of it, but I'm sure intel can ship partially functional chips with a core here/there disabled.

Another of the big advantages for AMD is that their products aren't reticle limited. The basic design lets them have a single design they bolt into dozens of configurations that scale larger than what intel can fit on a single die. Hence 64 "big" cores in a single socket.

There are likely other advantages too (cooling?) that partially make up for the longer more complex core->core latencies.

Do a simulation, chiplets extract multiples of revenue more than binning on failed functional units. Lots of functional blocks are NOT redundant, leading to the total loss of part. At these small feature sizes and massive chip areas, yields are down. Chiplets avoid this.

I don't have anywhere close to enough information to know what the actual yield numbers being experienced by AMD's products vs Intel's (you can probably count on 1 hand the number of people who know such things). For sure its much harder to make a perfect large die, which is part of why most of the 7nm parts are so small (or experiencing really low yields). But its so completely different. Intel is on a very mature process with a larger feature size, and so much of their large die chips _ARE_ consumed by things that can be disabled (cache slices, cores, etc) that the probability of landing on some critical portion of the die that completely junks it are probably fairly low or we would be seeing a glut in the lower core count parts too and intel doesn't really seem to be having a problem sourcing the upper mid range xeon parts.

Bottom line, I don't believe that intels product lines prices in any way reflect what the actual yield curves are.

Chips with problematic cores are sold as lower end chips. For the same production cost, you are getting less revenue - failure rate plays a big role in profit margins.

Vs throwing the whole die away because you don't sell enough systems that small?

Its hard to tell, but intel still has a strong markup on 24 core parts being sold from 28 core dies. Intel has often be "caught" down selling parts to protect their higher margin parts. (AKA they are selling parts with things disabled that work)

they weren't "Caught" - binning is a common practice in the cpu industrty. this isn't a problem

I wasn't talking about binning, I was talking about when you have binned at a certain level, but the product is sold under its capability because you want to maintain the illusion of scarcity of the better parts.

AKA its a perfect part, but its being sold with a couple cores disabled or at a frequency below whats its capable of.

You won't even know if it's a perfect part. There is likely a microscopic defect on a part of the cpu they can turn off that disqualified it from being perfect or having a feature of the high end part.

That’s part of binning and is common across manufacturers. There used to be some nvidia chips where you could reprogram the firmware and have a decent chance at getting a quadro for a fraction of the cost.

Ah yes the anti-vexxer argument. A good overview on AVX-512 and the criticism can be found here: https://blog.cr.yp.to/20190430-vectorize.html

> AVX-512 debuted on Xeon because datacenter operators asked for it.

It deputed on workstation accelerator cards.

> It does not “downclock a whole chip”, it gates the core where it is active and there’s not even that penalty on the current generation parts.

It very much could thermally throttle more than the one core.

> “10nm” is marketing fluff which has little or nothing to do with actual semiconductor construction.

"10nm", even as a proper noun, is a very important component of Intel's woes right now. They aren't getting the yields they were expecting, a major competitor surpassed the for the first time ever (TSMC) and that's how AMD is killing them right now.

> “Chiplet” is also marketing-speak for “wow this memory topology is hard to program around “. Not sure they should feel too bad about missing that boat.

No, it's marketing-speak for "near EUV process nodes have terrible yields compared to previous nodes, and need smaller dies combined on a multi chip module to get anything worthwhile for an acceptable cost". Current EPYC chips are a single NUMA node again, but still chiplets. They are absolutely kicking themselves for not bucking the trend and going chiplet, because then they would have been competitive with TSMC for yield/area. Single chips is putting all your eggs in one basket, but splitting the dies means you throw away way less chips. (Another way out is FPGAs and GPUs that practically can bin off way more of the chip).

> And the bets they made didn’t pan out: FPGAs aren’t popular because the people sophisticated enough to use them are also smart enough to tape out ASICs.

FPGAs are very interesting in a post Moore's law world. Their ability to dynamically reconfigure makes them interesting in cases where ASICs don't make sense. High level logic can be treated like code from a continuous delivery perspective (like Alibaba does with their memcache like FPGAs sitting on RDMA fabric). Data can be encoded in combinatorial logic and treated like any other infrastructure deployments (like Azure does with their routing CAMesque logic in their SDN FPGAs). ASICs don't give you anywhere near that flexibility, even in a world where they're a commodity. Don't confuse their tooling immaturity for a lack of usefulness.

> IoT is not a thing.

It's very much a thing; once again just an extremely immature ecosystem. Once high end CPUs are commidities that can been shopped around from each of the fabs, IoT external customer designs will almost certainly be a very important revenue stream for Intel. A modern fab is nothing to sneeze at, basically only countries with $20B to spend will have one, so we'll be seeing one or two per continent. It won't make sense for anyone else in the US to compete. As for how that affects IoT, tiny nodes will be amazing for little smart dust chips once the capital investment of these end nodes has been paid off.

IoT is a thing .. that Intel failed to get into. They don't have anything that scales down that well. That market is dominated by ARM, implemented by all sorts of lower tier vendors like MediaTek.

FPGAs really need a tooling unlock to take off so they can be useful to people who haven't been on the ASIC design course.

> It won't make sense for anyone else in the US to compete


TSMC isn't a US company.

Correct, but they do a lot of fab work for other companies. It's not absolutely necessary to have your own fab to be competitive unless your volumes are huge... at which point you can afford it.

Apple have $245bn on hand, so they could have a dozen $20bn fabs if they felt the need. Bezos has $180bn and no idea what to do with it.


They could, but the answer to "what should I spend $20b on" for both pretty clearly isn't "a fab where at the end of the day you might get a fraction of a percent better costs on bare dies from the vertical integration versus just buying from TSMC or whoever"

Fabs are both a commodity, and take a high capital investment. Unless you have a geopolitical reason to create one and you get .gov kickbacks to make it happen, then there's way better uses of your money.

Thank God for this comment I couldn't upvote this enough, got into much more detail that I could have bothered to reply.

2nd Comment into the page and literally everything said in that were wrong.

Basically all that stuff you said about IoT has been said verbatim for decades and yet here we are. Remember the "SmartMote"? Neither does anyone else. By the way that was _also_ an Intel-funded project.

Well, and for 25-30 years, everyone’s been crowing about how they were doomed to fail because of x86 as a CISC design. Intel killed off most RISC competitors as Moore’s Law held in the 90 and they could just add more transistors to turn x86 into a facade. But that increase in transistors and doing whatever possible to keep x86 performant meant they were doomed to lose the power/performance war that came with mobile processors. ARM held on just long enough in the low margin embedded world to then conquer mobile. Intel was able to win price/performance through superior fabrication and economies of scale, but power/performance really does require a superior design.

If Intel recovers, it may need to jettison x86, or offload x86 interpretation to a secondary unit for desktop processor with it gone from server procs.

This would make me happy because it’s been crushing to think I might go my entire career/lifetime with little endian processors being the mainstream. :-/

Little endian has nothing to do with CISC vs RISC or x86 though. Everybody is little endian today.

x86 helped with that historically, certainly, but fundamentally endianness just doesn't matter most of the time, and when it does, little endian makes more sense from first principles.

I think big endian is better: with little endian, treating a pointer to one size integer as a pointer to a different size integer has a decent chance of working, which allows bugs and sloppiness to fester. With big endian, screwing up your pointer type is very unlikely to work, which is IMO a good thing.

> Everybody is little endian today.

ACKCHYUALLY, PowerPC (POWER9 included) supports both big and little endian. Software wise, last I checked the Linux kernel, Debian (PPC port), Gentoo, FreeBSD, and several other major projects supported both. I believe KVM also supports flipping VM endianness versus the hypervisor.

Acknowledged, of course, that the spirit of what you said is indeed correct - the majority of systems in active use are little endian.

ARM too.


See: > AArch64 GNU/Linux big-endian target (aarch64_be-linux-gnu)

>and when it does, little endian makes more sense from first principles

Could you explain why little endian would make more sense? While I dont take side in this debate, I have always thought Big Endian would make more sense from First principles.

There are two main applications where I find little endian is more logical: big integer implementations, and bit stream encoding.

With big integers, the logic is simple: if you use little endian, you can operate on the same memory representation of the bigints quite easily with machine integers of different sizes.

A similar phenomenon happens with bit encoding. Let's say you want to encode a sequence of 25-bit integers tightly packed in memory. How do you do that? With little endian, you get a somewhat more natural representation especially for seeking into the bitstream (especially for architectures that allow unaligned memory accesses).

> With big integers, the logic is simple: if you use little endian, you can operate on the same memory representation of the bigints quite easily with machine integers of different sizes.

That's true with big endian too, but you just have to store your integers in the opposite way as you would on little endian: with the MSB at the lowest address (which of course is the same as the distinction between little and big endian in the first place).

Basically the bigint layout has to be compatible with the the endianess.

I don't know what bignum implementations do in practice on big endian systems though.

If anyone wants to see why little endian is better try implementing a sigint. Big endian makes everything backwards in terms of iteration order.

You are presenting argument in favor of little endian, yet you're provide enough information that is only useful for someone who already knows the answer.

The answer is also only useful to those who know it? I mean, nerds love to shout about the superiority of this or that, but they're mostly just parroting their tribe motto. If you have reason to really care about endian order, I would encourage you to do more research than you'll find in an HN comment. Perhaps there's even more than one right answer depending on circumstance.

I don't have too much knowledge in that area, but my understanding is as follows: for a software engineer big endian simpler to use, for hardware engineer little endian is simpler to implement.

I think it's similar to scientific calculator vs RPN calculator.

I don't think big vs little endian makes a difference in terms of hardware implementation. It's just a simple swap of wires.

it's more optimal when you're doing pipelining, for example adding two numbers to do the operation you start from the lowest byte and add them together and move up. With little endian bytes are already positioned in the order you want.

x86 architecture also was big on backward compatibility. You have registers that can be accessed as 8bit, 16bit, 32bit (and I believe 64bit) adding the extra bytes after similarly makes it easier, because the little significant bits are always in the same place.

Again, I don't have extensive understanding of hardware side, so I might be wrong.

I know endianness is separate from CISC vs RISC, and mostly a pointless and useless debate.

Except MIPS and the network.

MIPS processors have mostly been replaced by ARM already.

Not in routers.

I work in that area. All consumer-grade router/gateway SoCs from Broadcom have been exclusively ARM for several years now. They were MIPS before.

Caviun also went ARM, and is now owned by Marvell.

ARM has never enjoyed a Op/J advantage over x86. They have low-power designs, yes, but they don’t do more work for a given amount of energy.

x86 won fair and square. The risc people failed to foresee that instruction density would be extremely important to performance. Intel didn’t beat them with physics. CISC is just fundamentally better.

<i>ARM has never enjoyed a Op/J advantage over x86.</i>

I'm not sure that exactly accurate, its more accurate to say, that there was a lack of market crossover that allowed similar power or perf envelopes.

That is because ARM did/does make much more efficient CPU's, they just aren't anywhere close to the perf of common x86 cores. AKA a low clocked in order ARM with small caches, etc is more efficient per op but it can't touch even a medium size x86. Intel sort of was in that market for a bit and their cores were efficient too, but the main selling point for an architecture is the software around it, and a 50 mhz in-order x86 can't exactly run modern windows in a reasonable way.

Now that ARM & friends are building higher perf parts, the power efficiency keeps getting worse. When someone makes a 5Ghz ARM core it will likely consume more than a couple mW.

The perf/power ratios have more to do with culture and market than ISA.

I suspect that, at least for lower-power or lower-area designs, a compact ISA with a friendlier encoding than x86 would be a win. A significant problem with x86 is that a high-performance core needs to decode multiple instructions per cycle, but x86 has a nasty problem that the length of an instruction can’t be determined until it’s fully decoded. I think that modern front-ends try all possible offsets at once and throw out the wrong guesses. This costs area and power.

A design where all instructions have one of just a few sizes and where the first byte unambiguously encodes the length would be nicer.

RISV-V is decent in this respect.

FWIW, x86’s legacy is a security problem, too. The ISA is so overcomplicated that nasty interactions cause all manner of security bugs. As a recent example, the sequence mov (ptr), %ss; syscall with a data breakpoint at ptr could be used to root most kernels. With virtualization, this type of thing is much worse. A hypervisor needs to handle all the nasty corner cases in a guest user program without crashing itself or the guest kernel, and it needs to handle all the nasty corner cases in guest kernels without dying. There are various ways that native kernels can literally put the microcode in an infinite loop, and hypervisors need complex mitigations because an infinite-looping microcode bug triggered by a guest can’t be preempted by the host, and it will take down the system.

So yes, x86 is not fantastic.

Can you provide a link to the infinite-loop microcode issue?

I had assumed the opposite, that ARM was better in terms of Op/J. Can you provide a reference to make it certain?

Also, I wonder if the low density of RISC could be countered by introducing execution of compressed/zipped machine code. Some compressors like brotli are highly tuned to the expected type of data to be compressed and are very compact. All entry points to basic blocks in the code are generally known at compile time, so it can be ensured that the jump destinations are decompressible without any context, avoiding the slow process of needing to scan backwards to start decompressing...

Arm Thumb is a compressed ISA. Intel has been decompressing x86 instructions into an internal RISC machine since the Pentium Pro. The benefit is getting more instructions through the chokepoint from DRAM to L1 cache.

The Apple arm chips are way more efficient than any x86. They have comparable performance with a much, much lower power requirement.

The writing was on the wall in the early 90s. The SuperH ISA basically gave you the best of both world's: Fixed-width, relatively easy to decode instructions that were 16-bit, rather than 32 bits of many RISC ISAs.

All modern x86 CPUs are essentially “RISC” CPUs that translate x86 instructions into uOPS so they already do what you want to do.

I’m not seeing why the higher level x86 instruction set needs to be jettisoned or killed off...

And ARM processors these days tend to do the same, translating from the Thumb-2 variable-width instruction encoding into whatever each processor uses internally.

Thumb-2 is for microcontrollers (and not all of them even). Anything unix-capable (Cortex-A series and custom cores) reads fixed width AArch64 instructions.

> All modern x86 CPUs are essentially “RISC” CPUs

Where can I read more about that?

If you can work around the formatting issues, there's some classic articles on Ars Technica, e.g. http://archive.arstechnica.com/cpu/4q99/risc-cisc/rvc-1.html

Intel “won” because they had the money to invest in processors from owning the PC market. Now, Apple ships more ARM devices by itself than all personal computer sells combined (not counting the servers - it’s hard to find numbers on servers). Apple by itself can afford to pour money into ARM designs.

Now we see that Amazon makes enough money and has enough volume to invest in server processors.

This space will be interesting to watch! We live in extremely interesting times hardware wise.

>ARM eating their lunch in mobile.

Not just mobile. Obviously, that's the content of the OP. But, ARM based servers have been available for quite a while for dirt cheap prices[1] and they are great for workloads where ARM libs are available.

Edge ML inference[2].



Also, end of Moore's law, slowing down the upgrade cycle and making CPU more of a commodity business.

I'll add a caveat to (d) because distributed can also mean distributed across cores, and whoever handles that the best gets a strong differentiator.

That is a good point.

> With a move to more distributed programming models the network is often in the critical path now, making cpu performance less relevant for many workloads

Oddly, Intel ought to have an edge here, because they have been selling chips with embedded 100 Gbps Ethernet controllers on-die.

However, I've never seen such a beast in data centre, I suspect they've "reserved" this feature for HPC workloads only.

That was 100G Omni-Path (not Ethernet compatible) which has been subsequently canceled, I guess due to lack of demand. Intel's 100G Ethernet NICs were years behind other vendors due to the 10 nm fiasco.

Intel had always said they were behind because they wanted to be. In other words, they didn't think 100G was on the rise. I tend to believe them since nic asics are much easier to make than an x86 processor.

It's a good tenant of personal responsibility to say that one's owns actions are the result of one's own choices.

So by that logic, I suppose they wanted to be? :-)

Also, Intel is bizarrely behind on PCIe. Historically, a data center machine with big expensive GPUs or other accelerators also had a couple of expensive Xeons in it. Right now, AMD has a much better offering even ignoring cost. I’m sure someone outside x86 (Amazon? ARM itself? POWER?) will jump in, too.

POWER has had PCIe gen4 for a couple years now too.

I think they were intentionally behind. None of their products used/supported pcie4, so it was to their advantage to limit their competitors.

I would bet they were not, PCI-E 4.0 has always been on their Roadmap for 10nm+. The whole 10nm fiasco shifted all of their product roadmap from Modem, Chipset, Controller, to Custom Foundry off the wrong path.

About point i, using different languages than native is really weird and in the age of green computing really not efficient at all.

UIs based on JavaScript, Python or even Java are heavier and less efficient than what a platform can do with native compiled code.

There are now also loads of possibilities to build native code, squeezing the best performance, especially for critical systems.

I still don't understand why there's now a hype in using JavaScript or Python in microcontrollers as well

Speed of development currently trumps everything else. Developers are expensive, electricity is cheap, and software changes at a high rate. If we could reduce the "churn" perhaps people could transition to optimised versions, but optimisation makes rapid change harder.

as PoC sure, but as productive system, I would always rely on the best performance out there

Not every system is critical, and not every application needs native compiled code.

> i. GUIs moving to browsers and a lot of compute moving to runtime based languages like Python/JS means that any new ISA features are initially hard to utilize in killer apps on the client side. So if Intel ships a killer new set of instructions there is a significant lag between it doing something useful for the user.

I’d argue that the deployment of new native code has never been easier, because it will be the browsers that first make use of new instructions, and they have large sophisticated dev teams who push out frequent releases with minimal user effort to adopt.

For desktop computing, Intel (or possibly "h. AMD") is still the best solution (imho)

This is an interesting twist in the processor wars, here is the #1 cloud company that is turning its profits in providing cloud services in to processor R&D to build new CPUs.

This is a much more important announcement than the press is giving it credit for.

I have argued in the past the Intel "lost" the smartphone CPU war because Apple decided not to wait for them to come up with some high margin processor compromise and instead to re-invest profits into the development of bespoke processors that gave their products an edge over their rivals. Others like Samsung followed suit with own processors for Android.

Doing this both takes away Intel's ability to 'gate' what is and isn't a cellphone processor, and their ability to set the margins they would like.

Amazon, Facebook, and Google have been designing and building their own server designs for years. This takes away Intel's ability to gate their choices at the Server manufacturer. As a result more AMD server processors were deployed by these three companies than the rest of the market combined.

Now Amazon is taking profits from the AWS service and re-investing them in bespoke CPUs that are tuned to the workloads they can see customers running on their infrastructure. As a result they will not only enhance their cost/power edge over Google, Microsoft, and everyone else, their infrastructure can be better than anything you can buy in order to run your own workload, locking you into their service (a moat if you will for keeping you there).

If they succeed, Google and Facebook will follow suit. (I am guessing Google already is well down this path, knowing them but also knowing their secrecy about such things)

If you take 50% of the enterprise server market out of Intel's portfolio they are left fighting for enthusiast/gamer share and AMD is eating their lunch there.

It is going to be really interesting to watch this play out.

> I have argued in the past the Intel "lost" the smartphone CPU war because Apple decided not to wait for them to come up with some high margin processor compromise and instead to re-invest profits into the development of bespoke processors that gave their products an edge over their rivals. Others like Samsung followed suit with own processors for Android.

Kind of a weird way to phrase it given that iPhone 1 through 3GS ran on Samsung SoCs. It wasn't until the iPhone 4 that Apple used their own SoCs. Not sure how Samsung could be following suit in that case.

Edit: also, Apple was never going to wait around for Intel. Apple is actually one of the early investors in ARM, back in the early ninties, had used ARM.in the whole iPod line (as well as the ill fated Newton), and iPhones lined up with ARMs in mobile TDPs starting to have full MMUs.

At a certain time, a mobile x86 product had a lot of appeal because of the potential to reuse the software already optimized for x86. The choice then was, wait for Intel to produce good x86 mobile chips to leverage that, or accept that you have to continue using a different architecture. At the point you are deciding to give up on Intel, rolling your own chips starts to look a lot more appealing, since ARM just licenses the IP anyway (and it becomes a differentiator between their product and Samsung's, which is arguably their closest competitor).

There were two good paths to take, but one depended in Intel and they never provided it.

I mean, see my edit. They were an early investors in ARM, pretty exclusively used ARM in their mobile products for about a decade at the release of the iPhone. They were never going to wait for Intel.

Edit: Like they had been using ARM since before Intel released the original Pentium.

Everyone was waiting for Intel, either to switch ot to know how to respond to compete.

Inthink if Intel came out with a compelling product, Apple would have switched. I suspect they invested in ARM because it made sense at the iPod level, but once you start converging mobile and desktop experiences, there are major benefits to having the same architecture for both. Just look at how rampant the ARM MacBook rumors have been for the last few years.

I think the ARM investment by Apple was good regardless of whether they wanted to switch to Intel for higher end mobile, so I don't see it as evidence as to why they didn't want to.

Intel didn't care. They dabbled into it with XScale, at a time the best mobile ARM processors, then gave up and sold the whole thing. Shortsighted, should've kept at it.

They had been using ARM in released products since before Intel released the original Pentium.

Nobody was waiting for Intel to get into the cellphone market. The TDP of Intel chips just never made sense, and it was clear that this was because they just didn't institutionally care about that market segment in a real way.

Maybe pundits were waiting, but nobody serious.

Edit: and all these ARM desktop rumors are just that, rumors. They have the ability to ship a competive low to mid end laptop on their own arm chips today, and aren't. Their wide OoO designs would run beautifully in a formfactor that isn't thermally throttled so much. IMO they don't want to switch ISAs and are waiting out the x86_64 patents.

"They had been using ARM in released products since before Intel released the original Pentium."


...I'm not sure it was before the Pentium though, because it appears they both came out in 1993.

TDP of Intel chips like the PXA270, amirite

> Apple is actually one of the early investors in ARM

This is such an understatement. From Wikipedia: The company was founded in November 1990 as Advanced RISC Machines Ltd and structured as a joint venture between Acorn Computers, Apple Computer (now Apple Inc.) and VLSI Technology.

Apple is a cofounder of ARM.

I guess not everyone read books about Steve Jobs.

It was Steve's idea that Intel had the best chip and Fab, and he wanted the best in iPhone. He was willing to wait for Intel to give him what he demands, the price he asked for was way lower than Intel's usual margin but very high for Apple's BOM Cost.

Ultimately one of this Top Engineers put his name badge on the table and said if Steve insist on Intel he will quit. Steve Jobs ultimately backed off. They than bought P.A. Semi and the rest is history.

Intel's failure to capture the iPhone SoC opportunity was described by Paul Otellini, Intel CEO at the time as the biggest missed opportunity / failure of his life. ( And I am still quite pissed they forced out Pat Gelsinger )

If Steve had persisted he would have repeated the same Apple Lisa mistake. And the world might not be quite the same.

*Makes me rather sentimental while typing this. I miss Steve Jobs.

Samsung has Exynos however, due to Qualcomm and patents you don't ever see their phones equipped with it here in America.

I mean, now you don't. But earlier cellphones hadn't integrated the baseband yet so it was a moot point either way.

Still not seeing how Samsung followed Apple in SoC design when Apple started by using Samsung SoCs.

Apple followed Samsung in SoC design, but Samsung (in 2016) followed Apple (2011) in core design. I expect that’s what they’re getting at.

AMD is also eating Intel's lunch in datacenter. Last time I checked (I don't keep track of Intel pricing) AMD had a 5x price/performance advantage over Intel at the launch of EPYC ROME.

Winning a benchmark isn’t winning. Winning sales is winning. Last I heard most sales still go to intel.

But exciting times if this flips, competition is good for all

Winning a benchmark by a wide margin is pretty close if you have your foot in the door, which of course AMD does not.

Their success here is going to be nonlinear depending on how long they stay in the lead.

AWS has AMD offerings. That's plenty foot in the door.

It's very hard to buy prebuilt AMD products into the enterprise still in my experience

I assume there is a pretty big lag between AMD CPUs being much better and prebuilts including them and businesses noticing they are better.

It’s generally hard to buy AMD CPUs period.

I can buy lots of AMD CPUs and motherboards online - tons available. But prebuild AMD from Dell? No SKU's. I think it's going to be a MAJOR lag there.

In part this may be because AMD bios stability has historically been poor - perhaps folks don't want the headaches? But intel bios security with ME has also been poor - soo...

Last time I visited a Best Buy in NYC, they had a large desk full of AMD-based laptops, several featuring Ryzen.

Maybe they figured availability finally, at least for systems producers.

They're probably speaking about epyc. In general, the availability of epyc, and the number of servers made for it, are very small.

True, but these things take time to move mostly due to time-constrained contracts.

Have there been any published benchmarks of relevant datacenter workloads on this CPU? I certainly haven’t seen any. The previous generation EPYC was really slow on branchy code, despite looking ok on ffmpeg benchmarks and the like.

Intel became too big and too slow. Their R&D efforts are not as efficient as Apple's or in this case Amazon's. This story also proves that in many cases you are better off with a specific design instead of a generic one. If the marketing message is true Amazon has a huge edge.[1]

1. 40% better price performance over comparable current generation instances

Keep in mind that getting 40% better price performance doesn't mean their tech is better. It could just mean they're cutting margins (to drive adoption).

Absolutely. From the customer perspective does it matter? If later on it goes back to Intel levels than it means other cloud vendors can compete without a custom CPU. What worries me is what if it is true and not for the too thin margin?

This is entirely Intel's fault for trying to get fat by choking out the market with a gazillion UPCs and a ton of markup. They exhausted their own food supply.

Intel's gazillion SKUs are really just a few abstract-ideal "optimal product, highest price" makes of processor, binned down into many different "sub-optimal product, lower price" channels. If you collapse the binnings together, they're really only producing ~four CPUs at a time.

At least for commercial availability. There's tons of enterprise SKUs just to serve customers like AWS, of course, but that's not to "squeeze the market"; that's because the customer wants custom IP cores in their chips.

With how 10mn is going for Intel, I bet the SKUs really are just the result of binning, not something driven by the sales and marketing side.

they also made bank, like good monopolists ought

But who cares about a R&D processor that you can only use on a specific cloud hosting provider? All people will care about is if they will have to modify their applications to utilize it, and knowing Amazon that is precisely what they'll try to do. AWS is anti-competitive and it should raise alarms with many in the tech community.

Agree and to add a few more things.

These ARM CPU are made with ARM Core Design ( NEOVERSE N1 ) , which means there are no reason why Google or Microsoft, or even Apple and Facebook cant have their own N1. And they will because of the competitive advantage as you mentioned.

This solves the Chicken and Egg Problem where ARM CPU from Qualcomm and Ampere or other Vendors could not get into Server Market due to software compatibility. And no one was willing to invest porting and making sure every software runs well on these chip without Hardware. ( One of the reason why Qualcomm exit the Server Market ) Now Amazon is doing both, all the Open Source Web Stack AWS offers will support ARM, assuming Amazon upstream all those work.

And once those Software are ARM ready, other smaller Cloud Providers such as Digital Ocean, OVH, Linode etc.. will offer and switch to ARM once there are ARM CPU available.

On top of my head, the last estimate HyperScaler represent 50% of Intel's DC and Enterprise Revenue. Intel's yearly revenue from DC is roughly $20B, so we are talking about a potential lost of $10B+ over the next decade. And DC has been the most profitable segment for Intel.

Microsoft now also has new incentives to port Windows to work on ARM. ( A lot of Azure Customers runs out Windows VM ) They are also on their way with Windows on Snapdragon. Although I think it will be a few years before it becomes stable enough and mainstream.

So 5 years down the road, You now have the servers on AWS running on ARM, all mobile and Tablet are on ARM. Intel are selling less unit to amortise its cost of R&D in each leading node. AMD with better Node fighting for marketshare. If Windows on ARM worked, it is only a matter of time Apple moves its whole Mac line across to ARM as well, may be 2025+. ( Pro Application developers will have incentive to port their work to ARM on both platform )

If Andy Grove were still alive, or his apprentice Pat Gelsinger were still around at Intel, they would be facing the same dilemma half a century ago, -Should Intel invest more in protecting the memory business and fight Japanese manufacturers or should the company flee from memories market and create a new growth market?

And this time around, Should Intel invest more in protecting the x86 business and fight ARM or should the company flee from x86 and create new growth market?

It was nearly 5 years ago Intel announced Custom Foundry. They had a chance to take on and compete with TSMC. They were far too worried their best tech and IP would info would leak out. They would rather do it all by themselves. 5 years later, I doubt even if they are reopening Custom Foundry there will be any customer wanting to sign up.

Sorry if this is a grim outlook of Intel, I am sure they will stick around, just like any great companies once were, IBM, Sun, HP, they will have a long tail of decline until they become mostly irrelevant.

The pricing level for the ARM instances should be interesting. If Amazon prices them well below their Intel and AMD instance types, they could really drive adoption and lock-in.

> If Amazon prices them well below their Intel and AMD instance types, they could really drive adoption and lock-in.

Adoption? Sure.

Lock-in? Umm...

Most servers run Linux, and most software on Linux is distributed as source. The same reason that people can easily move to ARM - they can just recompile/download software for the correct architecture and everything mostly works - is the same reason they can leave easily.

All the other AWS proprietary APIs: sure. It'd take a stupendous amount of work for Netflix to migrate away from AWS. But running on ARM isn't really part of that.

> Most servers run Linux, and most software on Linux is distributed as source.

Being distributed as source code does not mean the source code is not architecture-specific. A lot of software has SIMD-optimized code paths, and it's common for these SIMD code paths to target SSE2 as the least common denominator (since AMD included SSE2 as part of the baseline of its 64-bit evolution of Intel's 32-bit architecture, every 64-bit AMD or Intel CPU on the market will have at least SSE2), so they have to be ported to NEON. And that's before considering things like JIT or FFI code.

> and most software on Linux is distributed as source

Clearly you aren't purchasing much software. If you're one of the (majority of) large software companies that use precompiled binaries from vendors for any of the components in your systems, you're at the mercy of which architectures your vendor(s) support.

Yeah exactly, you're not going to skimp on the CPU by choosing ARM to save $500 a year when you have software that costs you thousands every single year. You're going to stick with x86_64 instead of the crappy budget option that doesn't even run your software.

Twist: enterprise software goes the way of IBM mainframes, with a small number of large customers paying crazy prices for compliance and customization, and everything in the hardware/software stack being 10+x more expensive than its commodity equivalent.

Not every corporation has proprietary x86 binaries they need to run on a server. Every company I have worked at has had entirely open source server software and a few proprietary javascript libraries.

How many of those "precompiled binaries" are truly performance critical? If it's just a single component, you can run it in qemu-user mode and still come out ahead overall.

How many of those precompiled binaries are performance critical? How about all?

From the software that runs for hours to compile a model, the software that calculates results, to the interactive analysis software that needs to load GBs of data.

The whole reason to run these things on a server farm is that you need large and fast machines that are better shared to make sure they get optimal use of the machine and of the license pool.

Is this the archetypal AWS customer that drives the majority of intel’s sales to AMZN? That not only has lock-in with a closed source vendor, but that vendor is not agile enough to give ARM binaries for AWS workloads? Doubtful IMO. Maybe for EDA and CAD/CAM setups, but somehow I doubt those are enough to keep Intel as we know them afloat. Intel has huge reason to fear ARM on the server.

I was thinking about server farms in general, not AWS specifically. You’re probably right that AWS-type server are more likely to skew towards software that is available in source form.

A commercial EDA tool was one of the use cases benchmarked in the source article.

Does anyone run proprietary software in a cloud context?

The whole point is that the hardware becomes as flexible as the software, spin up, spin down, blow it away and deploy a fresh instance if something goes wrong.

I can't see many people trying to do that with proprietary licensing?

Absolutely. You can even get the license and support contract costs built into the instance’s hourly rate. RHEL, Oracle DB, Windows Server, and SQL Server are all offered this way.

All the software on the Linux server farm of my company that truly matters for my job is extremely expensive proprietary binary-only software.

Amazon doesn’t have to offer ARM based CPU instances to customers to benefit. If it just runs it own offerings on ARM like the servers that run load balancers, SNS, SQS, etc where they control the software they can save a lot of money.

> Based on these results, we are planning to use these instances to power Amazon EMR, Elastic Load Balancing, Amazon ElastiCache, and other AWS services.

Clouds are migrating to AMD and Amazon kicks back with home grown cores. Ouch. Chip vendors have to let go of their ISAs, no one cares, as long as you can run Debian on it.

Whatever happened with the Qualcomm Centriq ARM based server CPU? It never became available at any reasonable price in small quantities for software development/test/prototyping as something you could actually buy an ATX motherboard + CPU together and install (redhat, centos, debian, freebsd, whatever) on.




Amazon can afford to go all-in on building their own chips, with a solid goal of moving (almost?) all their AWS services over to them (I would bet good money that will become a requirement for services). They're also more forward thinking and experimental than most companies are, or can afford to be. The x86 is the "no one ever got fired for buying IBM" of CPUs.

This promises to be quite interesting. If Amazon can prove ARM as viable on their scale, that'll help companies like Qualcomm who were struggling to drum up attention from the more traditional markets.

Yes, that's great for amazon, but what about everyone else on the planet who has x86-64 hypervisors and would like to have the option of purchasing a 3rd choice of CPU which is not Intel or AMD, at a price/performance ratio that is somewhat competitive with existing options?

There are other x86-64 CPU manufacturers, but you're making a massive trade off for performance. Intel and AMD really have polished their stuff until it shines.

If Amazon can prove ARM can cope with server scale operations, and even outperform x86-64, you can have a high degree of confidence that other companies will get in the game. There are many more companies manufacturing ARM chips than x86-64 (due to licensing agreements, in part)

Qualcomm abandoned it when they became a target of an LBO.

It made sense for them to reuse their design expertise and hedge into other markets so that they didn't get stuck exclusively with trends in mobile devices. AAPL told their suppliers to withhold royalties, hurting revenue. And once AVGO took aim, they needed better focus. Centriq was a long term bet that could've worked out (cloudflare had a positive review) but it just wasn't in the cards.

We just released our test results of the new Graviton2-based M6g server. Pretty solid numbers. This is a game-changer.


I wish individual consumers could go out and buy these processors. Amazon really has little incentive to sell them - they are not primarily a chip manufacturing business, and anyways, why give away their hard-fought competitive advantage? - but damn, it would be cool to stick one of these in a workstation and watch it fly.

I expect Ampere eMAG 2 to be pretty similar to Graviton2, so you can buy that instead. (Or pretend that you're going to buy it until you see the price.)

Oh, interesting. Is Graviton designed in collaboration with Ampere?

No, but one has 64 N1 cores and the other has 80 N1 cores, probably with similar memory controllers and PCIe as well.

You can get rack mount servers easily enough, the ThunderX2 processors are pretty damn fast!


Stay tuned and look into Nuvia.


An ARM workstation is going to fly like a Cessna.

So it'll be popular with rich hobbyists?

In case pvg's "yesterday" comment isn't clear, this link is to a previous discussion of the same story.

I also meant the apparent distance to all my troubles was significantly greater.

Looks like they're going to stick around for a while.

What % of the server cost for Amazon is the CPU? I mean you have memory, disk, network, power...

I'm guessing power is the big one. A marginally cheaper cpu is not that big of a deal, but a cpu that costs marginally less per hour to run will save them a lot at scale.

AWS is running on very thin margins, and every couple precent in extra productivity could easily mean doubling profits.

> AWS is running on very thin margins

Incorrect. Amazon is a public company, and their own report tells us that AWS margin is 25%, as of 2019 Q3.

Are we thinking Intel is Amazon’s next Barnes and Noble?

AMD already doing that to Intel

Ya I get it but they are in the chip biz.

I’m always fascinated when an incumbent gets beat by a company that views that core business as an ancillary means to their own core business.

Probably I think another example is Netflix creating its own content. It’s a streaming company... but now they produce award winning content.

> when an incumbent gets beat by a company that views that core business as an ancillary means to their own core business.

Early on, books sales was Amazon's core business.

What is "Graviton2" in this context?

There's no Wikipedia page for it, and the regular google search results all loop back to this AWS announcement.

I am gathering that it's a chip, a CPU chip, seem to be ARM architecture not intel x86; and is an important chip or will be due to this? I don't know why though.

I'm still reading between the lines here, which is frustrating and doesn't reflect well on the journalism; So you say:

"AWS announced in late 2018 the EC2 A1 instances, featuring their own AWS-manufactured Arm silicon"


"AWS during its annual re:Invent conference announced the availability of their new class of Arm-based servers, the M6g and M6gd instances among others, based on the Graviton2 processor."

So the Graviton2 is made by Amazon as well? Under licence from ARM, I mean, like other ARM design chips.

_edit_ Looks like it, yes. Here is the context necessary to understand the story:



Complexity and cost have cooled my enthusiasm for AWS.

Its interesting but I wouldn't say Graviton2 is likely to succeed.

Every ARM based PC I've seen has been disappointingly slow, which I dont really understand because phones usually are amazingly quick.

DEC Alpha, Sun Sparc, MIPS, PowerPC couldn't keep up with Intel in the recent past either.

Intel CPUs aren't cheap, but they're a fraction of the cost of a server.

All adds up to interesting innovation but not necessarily the future.

> Every ARM based PC I've seen has been disappointingly slow

Most recent ARM PCs were explicitly low-power low-performance. That was the goal, not a description of the architecture. The mobile part pretty much confirms that.

The ARM chips that Amazon is putting in the servers is nothing like what you'll find in a consumer arm laptop.

The capital cost is only one part of the equation. These modern ARM cores have insane per-watt performance and that’s what adds up for horizontally scalable services.

RISC-V chips will handily beat ARM on performance per watt, given similar specs and fabrication technology. The margin is not insanely large but it's definitely substantial in existing benchmarks.

Which will be great, just as soon as someone comes out with a competitive server CPU + chipset.

Surely these are in development, but based on the ARM experience, it'll be at least 5 years from availability of reasonable parts to any decent volume.


> Intel CPUs aren't cheap, but they're a fraction of the cost of a server.

Depending what what performance level you're looking at, they're a very large fraction of the server cost. Get a dual or quad socket board and buy the higher end CPUs and you might be looking at well over half the cost.

I think the point is that, if your buying a quad socket, the 2TB of ECC ram you stick in it will dwarf the core prices, or similarly its not hard to drop $50k on U.2 NVMe, etc.

Its possible to shift this argument different ways depending on the part of the market your looking at.

I'm not sure about MIPS, but over the years there were times when Alpha, Sparc, and PowerPC were each ahead of x86 in performance. Intel outpaced them by using their huge volume sales to invest in fabrication improvement at a scale none of them could hope to match, going to ever more complicated internal architectures, and a dash of perfectly legal but slightly expensive tricks (like hiring away key designers).

Intel stumbled in fabrication and complicated architecture is having a bit of a growing pain. The CPU market doesn't make revenue like it used it – very few desktop users feel the need to upgrade every two years and for many applications pretty much any CPU is fast enough. Growth in the cloud market could be good, but any thing that threatens their market share of the cloud is threat to Intel.

> Every ARM based PC I've seen has been disappointingly slow,

The only ARM based PCs I've ever seen are the Raspberry Pi and other products intended to compete with the Pi, and one of the core features of these PCs is being a complete computer for under $100. Keeping the price low was always the top priority, not performance.

The Graviton2 is about performance. We did tests. This thing blitzes.


> Every ARM based PC I've seen has been disappointingly slow, which I dont really understand because phones usually are amazingly quick.

Maybe it’s a latency vs throughput thing?

As far as I know, ARM is great at balancing latency and power consumption, but behind x64 in throughput. [citation needed]

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact