Hacker News new | past | comments | ask | show | jobs | submit login
AMD Discloses Initial Zen 2 Details (wikichip.org)
446 points by throwaway2048 on Nov 18, 2018 | hide | past | favorite | 197 comments

This is a pretty bold change and really makes Intel chips less appealing. With the recent kernel changes for Intel's Spectre issues and this chip coming up, I think it's a distinct possibility that AMD will take back a ton of marketshare in the server space.

If these chips really do have a ~25% performance gain over Ryzen (not counting IPC gains), I might just upgrade, and I think a lot of other people are in the same boat.

This could be a fantastic win for AMD. I was worried that Zen 2 was going to be a moderate improvement, but this seems to be pretty significant. It's been a while since AMD has been first to market with a big winner...

What's disappointing is that none of the major cloud vendors has made any real commitments to AMD, at least none that I've seen.

If speculative execution is a problem and you need to give each VM its own exclusive cores, great when you can source parts with 64+ cores on them!

AWS, Microsoft, Oracle, Packet... all have AMD shapes, in bare metal and VM to varying degrees. The list goes on and on.

Oracle: https://www.amd.com/en/press-releases/2018-10-23-amd-and-ora...

Azure: https://azure.microsoft.com/en-us/blog/announcing-the-lv2-se...

AWS: https://aws.amazon.com/about-aws/whats-new/2018/11/introduci...

Packet: https://www.packet.com/hardware/amd/

LV2 is invite only on Azure, eg. not generally available.

Off-topic but eg = for example, ie = in other words.

Both those are from Latin; I knew what e.g. is short for, but now I got around to looking up i.e. too:

ie => id est => that is

eg => exempli gratia => for example

Thanks, as a non native speaker this is very useful for me

FWIW nobody's been a native Latin speaker for a thousand-plus years. You're in the same boat as everybody else! ;-)

True, but the problem is that my language (Italian) only uses "es." for "e.g." and doesn't have an equivalent acronym for "i.e." (we usually use other complete words that roughly translate to "therefore"), so it's easy for me to mix them up since I can't easily map them.

Thanks, I always mix them up.

AWS began releasing a new AMD generation a few weeks ago.. https://aws.amazon.com/blogs/aws/new-lower-cost-amd-powered-...

Fantastic news. Thanks for the link!

How could they? AMD has been off the map for generations.

Providers will begin uptake of AMD again, it just takes time for contracts to roll over and the new hardware to reach production.

What's even more disappointing is that Zen 2 doesn't include support for new techs like the upcoming DDR5 & PCIe 5.0 which seems are going to be supported in Zen 4


IC: AMD has already committed that Milan, the next generation after Rome, will have the same socket as Rome. Can you make the same commitment with Zen 4 that was shown on the roadmap slides?

MP: We’re certainly committed to that socket continuity through Milan, and we haven’t commented beyond that. Obviously at some point the industry transitions to PCIe 5.0 and DDR5 which will necessitate a socket change.

IC: So one might assume that an intercept might occur with Zen 4?

MP: No comment (!)

Hoping for DDR 5 or PCI-E 5 support in Zen 2 is equivalent to hoping that Zen 2 is at least a 2020 architecture rather than a 2019 one. This is because:

1) PCI-E 5 isn't even standardized yet

2) Mass production of DDR5 memory chips might begin ramping up in 2019, but it's unlikely that DDR 5 memory sticks will be readily available on the market until 2020.

3) It takes a really long time to engineer a cpu architecture and bring it to market (generally a year or more for even just a revision of an existing architecture).

4) Adding support for a new memory or PCI-E version requires a new socket and a new chipset, which creates even more work that by definition can't be "done" until things are fully standardized.

AMD are going to support DDR5 and PCI-E 5 eventually... But they want to release a new architecture revision on their current socket and chipset now before it's reasonably possible to do that.

Edit: All this is to say nothing of AMD's commitment to potential partners to keep the underlying socket and chipset the same for a reasonable period of time, which is critical to planning supply chains and upgrade paths for major partnerships.

> Adding support for a new memory or PCI-E version requires a new socket and a new chipset

Memory, sure, you don't want compatibility issues. Why PCIe, when the important lanes are all driven directly off the CPU?

Ah, fair. I guess I should say "can require".

PCIE 4.0 was just standardized last year, it's not surprising that Zen 2 won't support 5.0, which is expected to be introduced mid next year.

Zen 2 will be the first CPUs supporting 4.0.

Power9 was the first to support it over a year ago. But you're right, 5 is not finalized, so it's not possible for amd to support it.

Release date for PCIe 5.0 is Q1 2019


On June 5th, 2018, the PCI SIG released version 0.7 of the PCIe 5.0 specification to its members.

PLDA announced the availability of their XpressRICH5 PCIe 5.0 Controller IP based on draft 0.7 of the PCIe 5.0 specification on the same day



Historically, the earliest adopters of a new PCIe specification generally begin designing with the Draft 0.5 as they can confidently build up their application logic around the new bandwidth definition and often even start developing for any new protocol features. At the Draft 0.5 stage, however, there is still a strong likelihood of changes in the actual PCIe protocol layer implementation, so designers responsible for developing these blocks internally may be more hesitant to begin work than those using interface IP from external sources.

From here: https://en.wikipedia.org/wiki/PCI_Express

AMD had plenty of time to include PCIe 5.0 even in the Draft 0.7 stage which is pretty much the same as final draft/release, but decided not to.

Yes. AMD could have simply switched the chip over to non-finalized standards between June 5th and now. Clearly

No possible way that could backfire or be impossible

> AMD had plenty of time to include PCIe 5.0 even in the Draft 0.7 stage which is pretty much the same as final draft/release, but decided not to.

... Just how long do you think it takes to make that kind of design changes? This requires a major change in silicon. It would have had to been done more than a year ago.

Not to mention there are no PCIe switches with it. You'd have to wait for Avago/PLX/Broadcom/whatever it's called now to update them.

Do the new 8th gen Intel chips finally deal with Spectre/Meltdown issues in hardware or do they still depend on OS level fixes?

Yes. I expect both vendors to reach parity with respect to mitigations of known speculative execution vulnerabilities. Whether their architectures have been fundamentally redesigned to make these issues more difficult to exploit is anyone's guess.

Variant 1: Both AMD and Intel will continue to rely on OS patches.

Variant 2: Both AMD and Intel will fix in silicon. Intel's fix comes in Cascade Lake.

Variant 3 (aka Meltdown): AMD was not affected, Intel will fix it in silicon in Coffee Lake Refresh and Cascade Lake.

Variant 4: Both AMD and Intel will rely on OS and Microcode patches

L1TF: AMD was not affected, Intel will fix in silicon in Coffee Lake Refresh and Cascade Lake.

I'll note that for the AMD products, this is my best guess from Lisa Su's comments. I do wish they put out a table like Intel did.

[1]: https://www.anandtech.com/show/13450/intels-new-core-and-xeo...

[2]: https://wccftech.com/amd-zen-2-cpus-fix-spectre-exploit/

EDIT: Replaced the Intel source with a better link

Don't forget the recent seven additions to meltdown/spectre. It'll be interesting to see if the latest round of hardware protections help there.

8th gen (Coffee Lake) have the same uarch as Skylake and have no hardware fixes for Meltdown or Spectre.

And no pcie4 either, unfortunately.

Sure they take care of it real well... in the OS.. can’t wait till this comes to windows by default.


Most of what I do is floating point operations, so I'll see >100% performance gains with the avx boost. I will upgrade.

Any chance your workload would see a benefit from porting the code to a GPU?

I work on tech to automate that. The generic answer is yes as long as you overcome the PCIe bottleneck (or can hide the latency).

I assume that should be "and can hide latency"?

While GPU offloading is extremely powerful in terms of compute, you have to deal with both bandwidth limitations (de facto ~12 GB/s for x16 PCIe 3.0) and the latency of launching compute kernels and waiting for them to complete.

That depends on the problem space. For some problems latency does not need to be an issue, if the problem can be pipe-lined then you can keep all your cores running at a high percentage of their theoretical maximum load even if latency is high.

For real time applications this is usually not an option, there latency > some threshold will kill your proposed solution.

In those cases where the latency is not a problem, it tends to be precisely because you can hide it e.g. via pipelining :)

You overcome the bottleneck by doing as much as possible on the GPU. Even things you don’t think are fully suited to it.

I humbly request in every conversation whenever someone uses "distinct possibility" that they quantify it with a number.

This phrase may be worse-than-useless: it creates the impression between people that they agree, but could mean a discrepancy between 0.0001% to 10%.

I know this isn't central to what you're saying, I just want a better conversational norm around probabilities during conversations in the world.

Intel has a chip shortage, so AMD might be able to sell their CPUs and motherboards while Intel cannot meet demand.

I always wanted to make an Amiga with an AMD chip and get AmigaOS into X86-X64 and away from PowerPC so it can sell cheaper and faster systems. Compete with Apple at least by merging AmigaOS with GNU/Linux work with Valve on SteamOS to run AmigaOS apps and get some classic Amiga games into SteamOS and GNU/Linux.

Amiga hardware is what made it special, an generic OS on a generic hardware is just yet another OS.

I was actually underwhelmed by the post, what exactly is the "pretty bold" change?

Seperating the analog logic (IO die) from the digital logic (CPU cores). Analog shrinks really bad with smaller nodes wheras digital logic shrinks really good.

New process nodes get more and more expensive - especially if the node is new (bad yields etc.) So you want to get the most out of it as possible, which AMD does with this strategy.

For example Zen 1 8 core chips are 213 mm2 on 14nm. On 7nm these chipse would be > 106 mm2 (because the analog elements don't scale). Zen 2 chiplets should be at most 75 mm2 (while also increasing cache, widen AVX etc.) which is mobile SoC territory which means they are cheaper to produce (better yields etc.). Since Intel uses monolithic dies AMD already had an cost advatage, with Zen 2 it's not even funny.

Especially if the guesses of many people are true and AMD designed the IO chip to be cut down, which means they can reuse it for their desktop CPUs (Ryzen on AM4). Anonther nice advantage of this would be that they can bin the chip even better for EPYC, Threadripper and Ryzen.

AdoredTV does a good job of aggregating all the infos/ideas that are floating around and presenting them:


There's also the clever business reason where they're obligated to keep buying silicon from Global Foundaries but GF had bowed out of pursuing 7nm. So here they can use GF's 14nm for enough silicon to cover their obligations.

I/O module is not analog, but most of your points still stand, It is mostly for increasing yields and reducing costs. With this separation they are reducing chiplet size quite a bit, so that they will have superlinear cost reduction on 7nm process.

IO is 100% analog. Slew rates, crosstalk, reflections, routing, etc. It is literary called Black Magic in EE community.

Must read "High Speed Digital Design: A Handbook of Black Magic": https://books.google.pl/books/about/High_speed_Digital_Desig...

I am not sure our definition of analog are same. Are you claiming that gigantic chip shown in the zen 2 picture is an analog chip?

Whole CPU is analog, just hidden away under vendor(cadence/mentor/synopsys/etc) supplied RTL IP building blocks (GDS libraries) abstraction whose specification is highly customized to particular fab house. This lets you design logic part of the circuit without getting into analog hell.

The difference between logic part of CPU and IO is in the amount of work spend on RTL vs direct custom layout of geometric structures on the chip.

There is nothing analog about the IO die. It interfaces the memory, and the system's busses like PCIe. They are all digital.

Sorry, I've meant analog circuitry, which I've read again and again about on CPUs - see below as one example. And it was also mentioned in some articles about the IO die on Zen 2.

> Intel’s analog designers are responsible for creating circuits such as PLLs (for clocking) and PCIe (for I/O) that are critical for modern SoCs on a process that isn’t optimized for analog


its all analog at that level

Yeah, anything measured in GHz going off-chip is well into analog signal processing territory.

the incorporation of the IO chiplet is a sea change for processor tech. It means many physically separate core-chipsets get the same path to RAM and overcomes the many issues with NUMA, while avoiding massive and expensive cpu dies.

The UMA benefit is a double-edged sword. Best-case RAM latency probably gets worse, all else being equal. But maybe they've made Infinity fabric faster enough in Zen2 to eliminate any additional latency compared against the local memory domain in Zen1.

"Oh the places you'll go" :-)

Assuming this chip doesn't trip all over itself moving things around, it will be an astonishing amount of computer power in a reasonably sized package. This is for me, the only reason to work at an internet giant; because they will build a tricked out motherboard with two sockets and up to 8TB of RAM and with say a petabyte of attached non-volatile storage available, field solvers, CFD apps, EM analysis would just melt away. Whether it is designing a rocket engine, or folding a protein, or annealing a semiconductor on the quantum level. So often these programs use approximations to allow them to run in finite time, and now more and more of the approximations are replaced with exact numerical solutions making the models more and more accurate.

Five years from now when people are throwing out these machines to replace them with Zen3 or Zen4 machines, I'm going to be super glad to get one and play with it.

> exact numerical solutions

Isn't that an oxymoron? Numerical solutions always involve some sort of rounding errors because of limited floating-point precision, so they cannot be exact.

Reminds me of the mathematician joke, "We know that one side of one house is painted brown." :-)

You are correct in that most of these systems don't have a closed form solution that would yield an exact result, and they rely on either an interpolated value which is inexact depending on deviation or numerical solutions which typically iterate to the nearest point, that the system can represent in its limited 80 or 128 bit floating point system.

One can always build a counter-example, for example when candidate solution space is discreet but continuous-space solvers are fast and convenient. Then as soon as you get within half of discretization step you get an exact solution.

A computer can be however exact you want it to be, that it supports floating point in hardware doesn't mean that is as accurate as it can get.

It is far more apt to bring up most "interesting" solutions do not have an exact closed-form expression.

I'm excited. My next build shall be AMD.

I hope AMD has something on their video chip line-up against nVidia as well. I'm curious if they knew nVidia was working on ray-tracing and plan to implement the same API or if they have some other tech they have up their sleeve they've been working on vendors with. I hate how nVidia has such a monopoly on the video chip market. Who knows, maybe Intel will finally get back into the gamer 3D market and we might finally have three options again.

AMD almost certainly knew about it as it's in DirectX. This standard, called DirectX Raytracing (or DXR for short) is actually what the very first game to ship with real-time raytracing uses. This is what Battlefield V is using. Not Nvidia's proprietary extensions.

Since nobody pushes to add a fancy new thing to DirectX unless they plan to ship hardware specifically to do that new thing, AMD would have to have been blind & deaf to not know Nvidia was going to ship RTX this year.

I hate the Cuda monopoly in machine learning right now. I hope some of the mature libraries (TF, Pytorch) start officially supporting AMD GPUs.

Nvidia is looting customers that want to use GPUs for Machine Learning on the cloud (like AWS, GCP)

Nvidia is cashing in their investments for software and integration.

It's AMD's job to make machine learning work in their GPU's. If they don't believe in it and spend the necessary time and effort, nobody else will. Radeon Open Compute Platform (ROCm) has existed for years but apparently it's not good enough.

ps. Tensorflow has ROCm backend support. https://hub.docker.com/r/rocm/tensorflow/ but is MI25 competitive? https://www.amd.com/en/products/professional-graphics/instin...

No clue but if consumer vega is competitive http://blog.gpueater.com/en/2018/04/23/00011_tech_cifar10_be... MI25 has a chance of being

> Tensorflow has ROCm backend support.

Why is it a separate repo rather than being contributed upstream to TF?

PyTorch has AMD support in the main repo: https://github.com/pytorch/pytorch/tree/master/tools/amd_bui...

AMD is planning to upstream their work and some is already there, but are still behind in versions they support.

There are some efforts to improve the situation: https://www.phoronix.com/scan.php?page=news_item&px=Red-Hat-...

Isn't ROCm a viable alternative?

Not really, at least not yet. It doesn't exactly provide the same API capabilities as CUDA yet (there are some unsupported functions) and no one has realistic comparisons of benchmarks but the consensus is that there's some performance gap between similar chips from Nvidia and AMD.

Both are being actively worked on, and I expect to see this gap shrink in the next few months to a year.

Khronos is to blame by focusing too much on C, without convient tooling.

CUDA had C, C++ and Fortran support since the early days, followed by the PTX bytecode format for any compiler vendor that wanted to support CUDA on their languages.

It was necessary to loose the race for them to focus on C++ and come up with SPIR and SYSCL.

And tools still seem not to be on par with what NVidia offers.

Khronos is a standards group. It's their job to make standards, not tooling. Individual vendors that are part of Khronos can make tooling.

Then they made a bad standard to start with.

Also regarding tooling, that isn't quite true, as they have engaged with LunarG for Vulkan, because they have learned from lack of OpenGL adoption, that the large majority of games developers wouldn't even bother without a proper SDK.

> Then they made a bad standard to start with.

Nonsense. You'll be hard-pressed to find a platform that doesn't support C or a competent system design or even graphics design engineer who isn't confortable with C.

If their goal is to establish standards, they build upon standards.

Developers jumping on CUDA due to lack of C++ and Fortran support on OpenCL proves how good that decision was.

Likewise for game consoles, Windows and now Apple graphics APIs.

Defining standards does not mean they must provide a lowest denominator single implementation.

They can be defined in an IDL kind of way, like e.g. WebIDL, abstractly e.g. Internet RFCs, or define mappings to all major languages in the GPGPU field, namely C, C++ and Fortran.

> They can be defined in an IDL kind of way, like e.g. WebIDL, abstractly e.g. Internet RFCs, or define mappings to all major languages in the GPGPU field, namely C, C++ and Fortran.

That's already done. Mozilla's Obsidian is an object-oriented WebIDL interface to Vulkan: https://github.com/KhronosGroup/WebGLNext-Proposals/tree/mas...

Again, Khronos is just a standards group. It does what its members tell it to. Khronos is far more lightweight of an organization than, say, ISO's C++ committee. Instead of blaming Khronos, you could perhaps encourage them to focus on initiatives like Obsidian.

I find it unlikely that AMD would be able to release a proper peer competitor to nVidia's best any time soon.

Prior to the Ryzen launch, AMD lived through some really lean years. They saved wherever they could, and their GPU department saw very little investment on development for a long period. Since then, they have recapitalized the department, and are probably designing a proper next gen GPU architecture to replace GCN. However, it takes ~4-5 years to get a completely new arch on the market, and they only got to start a few years ago. Vega and Navi are both still shoestring budget designs.

IF you hadn't heard Intel actual has thrown its hat into the GPU ring again. Set to launch in 2020.

By all accounts RTX delivers negligible visual improvement for games, for a huge performance hit.

The technology is a lot more interesting in professional 3D modelling where light baking and rendering can be done much faster.

The hardware raytracing support is a huge step in the right direction for real time graphics. The shaky jenga tower of hacks on top of the rasterization pipeline that is in the current generation of rendering engines is barely sustainable. Fast raytracing provides an alternative that is much closer to the actual physical model of light transport in the real world and does not require as many nasty approximations and cheats as rasterization for the same quality.

Mind you that you are seeing the early generation of hardware based real time raytracing now and it already can beat the quality of rasterization.

There is still a ton of room for improvements. One or two hardware generations from now we may finally see realtime generated images for which the term photorealistic is not just a silly marketing hype. They could truly fool you.

The currently available generation of hardware based real time raytracing cannot beat the quality of rasterization at the same framerates. NVidia built a cool bit of hardware, but their media blitz was seriously misleading. It's much faster than anything you can do on earlier hardware, but that doesn't mean it's actually fast yet for general game engines.

One of the problems is that raytracing is inherently noisy, so you actually have to shoot many rays (or trace many paths) for each pixel on the screen, and you really need to get into the high double digits for acceptable results purely from raytracing.

Part of NVidia's sales pitch is that you can reduce the number of samples / rays using advanced post-processing filters based on CNNs. But then you're back to building a jenga tower of hacks, just this time on top of raytracing instead of on top of rasterization. You may prefer the raytracing jenga tower perhaps, but truth in advertisement should make it clear that raytracing alone just isn't there yet.

It's true though that this current generation of hardware is promising, and things are bound to get better quickly. For now, the most you're going to get is hybrid techniques where rasterization is used for the bulk of the rendering, and raytracing is used selectively e.g. for parts of a scene or for shadows.

I personally know some of the guys who developed the CNN based denoising. This stuff really works and it is actually pretty simple to implement compared to the truly horribly complicated hacks that you need to get e.g. realtime reflections even remotely to a level that is not outright jarring and obviously wrong looking.

What has me worried is not the performance of the ray-scene intersection, but the dynamic BVH update that is required for each frame. Building acceleration structures for ray intersection tests is a hard tradeoff between fast construction times and fast intersection tests. With animated/deforming objects, this is quite an interesting challenge. But we might eventually see some dedicated hardware for that, too. Academic implementations exist.

For a gamer, Nvidia's latest card offering is vastly disappointing. Technology that is not there and won't be there for maybe next 6 years (if they stick to 3 year release cycle), offering some shiny surfaces at best, at massively increased price. I really hope AMD will come with some good cheaper competition.

Gamers want fast framerates, high polygon count, high textures much more than some currently-useless raytracing. Solve those into near-reality quality in at least 1440p resolution, and then we can talk about these gimmicks. I'll happily skip this (and next) gen and will be OK with some 1080 card.

I don't get the ridiculous pricing of the current generation of nVidia chips, either. But this generation of hardware is at the same time an important step towards getting rendering algorithms that have vastly superior quality while having a much simpler structure. I know that anybody that works on high quality real time renderers and knows what they are doing wants this switch to happen.

Heck, I've only read papers about most rasterization based "realism" tricks and I want this switch to happen. Polemically speaking, current realtime rendering pipelines are based on sprite rotoscalers and the end user being too busy to notice the errors. Raytracing is based on how light works.

The literature conveniently neglects to mention how fiddly and difficult it is to implement that stuff reliably. Debugging the rendering pipeline is not exactly easy, especially when the data is not just positions and colors, but more abstract stuff. Also, there's tons of edge cases that need hacks and workarounds, crazy driver bugs etc...

Buying an 800mm2 Die along with 12GB GDDR6 Memory for $999 isn't really ridiculous. Try getting Intel or AMD to sell you an 800mm2 CPU Die for $999. As a matter of fact I thought it was reasonably priced in terms of Die Size and Transistors. Now whether those 800mm2 perform up to our expectation is an entirely different matter.

Honestly, once we have a suitable HMD (aka. VR goggles), we can use a cluster of raytracers to experiment with actual photorealistic reality. My previous best guess would have been a clustered version of Luxrender, but most recent projections of mine yielded ~20MW of current-gen CPU cluster (with a hint of GPU (about 3:1 electrical CPU:GPU) to offload ray intersection for a 2~3x speed boost) as what would be needed to feed a display that can reach the far end of uncanny valley as far as the visuals themselves go. This would be _expensive_ and quite likely considered a useless waste of computing, but I seriously think we should try to get to the point where we can do this. And it truly requires ray tracing to handle the vast amount of detail the scene would require. Also testing scales much better down to a workstation for photorealism than for textured and shaded polygon technology. The main reason is just that most of ray tracing is parallel, save for sequential tracing of each individual ray. The result is that much more of the behavior is scale-free which enables you to test materials much quicker/more interactively than with shaded polygons. The other part of that is how a material's behavior doesn't depend on other/nearby objects as much, so you can mock those better and have few surprises as far as their interaction is concerned. Luxrender specifically was used years ago for such things as rendering imaged for a iirc. perfume bottle catalogue, as it was easier to specify the material properties and adapt for the colors/opacities and label prints than to get regular, symmetrical photos from a photographer in a studio. The requirement was that the realism allowed skipping a disclaimer about the images not being photographs. And that implies confidence in the mathematical models used in the "simulation".

RTX cards might allow porting Luxrender in a way that retains it's spectral treatment that actually physically simulated a dispersive prism generating a rainbow from a ray of white light, without more than specifying the dispersion of the glass. It goes without saying that such effects are inherently very noisy, but there might be ways using automatic differentiation (c.f. Julia (language)) combined with advanced numerical integration to make use of calculus to reduce the noise in individual samples.

If anyone knows about attempts to combine Metropolis light transport (MLT) with automatic differentiation or just about a more concrete idea to incorporate advanced numerical integration with MLT, or a potentially suitable MLT implementation in Julia, please let me know, i'd like to check it out and actually consider it a suitable "toy" project for learning Julia. The ease of efficient GPU use from this high level just allows so much flexibility in e.g. complex material node graphs getting handled with no extra work on the integrator then.

Your power estimates are off by a factor of ~10 with current technology. Current DGX workstation deliver a much higher speedup for ray scene intersection than 3x.

MLT is not used for animations because it is temporally unstable in a perceptually unfavorable way: the artifacts are more blotchy than noise while the human vision is more tolerant to noise. Recent developments around temporal MLT should mitigate that, but these require you to render multiple frames simultaneously. Thisnmakes them unsuitable for real time applications.

Also, the thing about MLT/bidirectional/unidirectional pathtracing is that there is no universally best method. All of then have weaknesses. There are examples where each method is worse than the others in an equal time comparison.

The best performance improvements that you can get with any of these Monte Carlo methods are always based on improved strategies for drawing the samples (importance sampling, QMC). Incorporating non-local information for local importance sampling will be the next big thing.

As for color noise: it is an undesirable artefact, but if you importance sample the spectral color response curves for the human eye (or your display device) correctly, then the spectral noise vanishes at least as fast as the other sources of noise in your scene.

Can't help but think of this:

> The only real performance advantages the GeForce3 currently offers exist in three situations: 1) very high-resolutions, 2) with AA enabled or 3) in DX8 specific benchmarks. You should honestly not concern yourself with the latter, simply because you buy a video card to play games, not to run 3DMark.

> The GeForce3 is consistently a couple of frames slower than the older GeForce2 cards

> Between now and the release of the GeForce3's successor, it is doubtful that there will be many games that absolutely require the programmable pixel and vertex shaders of the GeForce3


"between now and the release of the successor" being about a year, so that sounds pretty accurate.

It would help if there was more explanation of what the RTX units are actually good at. Not every GPU innovation actually ends up widely used. It would be amusing if they lost out by being not programmable enough...

Nobody is arguing that ray tracing is not going to be good, or that its not eventually going to be supported by more games. I think most people are just saying that the push will come from developers when the fps hit isn't so bad, and when more console hardware supports ray tracing.

Damn, with 64core/128 threads becoming widely available a lot of Windows software will have to be updated to use that because of "64 bits should be enough for everybody" kind of decision in Windows when implementing affinity. You can't get more than 64 threads in OpenMp when compiling with MinGW and you can get it with Clang but the implementation wasn't very efficient when tested it. I suspect the problem is there in most Windows thread pool implementations.

Last I saw MS SQL licensing among others like it have a pricing structure per CPU core...

That’s going to need to be modified I think!

Doubt they will go away from per core pricing. Most SQL servers are VMs so you can assign as many cores as you want. I do expect them to give bulk pricing for more cores. So 4 core=$x, 16cores=$3x, 64cores=$10x.

Oracle says "Haha, no."

PostgreSQL replies "Haha, thanks." :)

Oracle says "Looks like our profits will double each time AMD doubles the core count of their flagship chip."

Larry Ellison needs a bigger yacht, time to re-up your Oracle per core licensing agreement.

Hah, yacht. He needs a bigger island.

Just run 2 Windows instances. Problem solved.

Windows can use more than 64 hardware threads, but a single process kinda can't.

The KAFFITY in the GROUP_AFFINITY struct is a bitmap ULONG, so 64 bits on 64 bit archs, and 32 bits on 32 bit archs.

Yes but that means every group is 64 CPUs. Notice you can set affinity for any group (16-bit index).

Yes, starting from Windows 7 / Server 2008 R2 it is possible to support more than 64 CPUs within a single process by using new Win32 APIs supporting processor groups. You'd need to manually set the affinities for the threads, but after that you are all set.

Unfortunately, unaware (legacy) software is still limited to 64 processors.

But AFAIK, if I have 128 logical cores, I can't say "schedule this thread on any of the 128 cores because you as the scheduler should know better than me". You have to manually manage the two thread groups each of 64 cores because new threads are round robin assigned to the two groups on thread creation and aren't migrated as circumstances change.

Huh, I don't really know much about this to be honest, but are you sure that's not a consequence of prefering not to migrate threads across NUMA nodes rather than across thread groups? Are you aware if there's any documentation on the behavior you're mentioning?

Oof, I hadn't read the whole page. Interesting, thanks!

It can but you have to take care too change process affinity groups every time you want to work with threads in another group as a group is limited to 64 virtual cores. It complicates thread pool implementation significantly.

But bringing it back to the parent comment thread it's easier to manage affinity of your processes than it is to synchronize them across multiple OS instances.

No such inherent limitation on windows


No, it most definitely is not.

Anyone know why we aren't seeing Intel/AMD go the 'ultra-wide' route that we've seen in ARM processors? (Apple's in particular)

e.g. we've seen Apple's A12 processor expand ALU's from 4->6 and what seems like a strong focus on cache latency and these changes seem to be rather beneficial in real code. Why aren't we seeing the same from Intel / AMD? As someone whom isn't particularly well informed on the topic my guess is that AMD/Intel are scared of selling a wider-core but lower clocked CPU given how much marketing is attached to clock speed, but I imagine there are architectural issues as well.

Well AMD has gone wider - to 5 or 6 wide depending on how you count it in Ryzen. Intel hasn't really been able to make any changes to make any major uarch changes in years, since they've been stuck on Skylake (and all the various Skylake++ variants that followed which were Skylake in name only).

In Skylake, most of the effort seemed focused on bringing AVX-512 to bear, and outside of that the basic design is largely the same as Haswell.

Here's hoping that Apple's chips portend an increase in width in mainstream x86 chips as well.

Skylake also got rid of the ring core architecture, and went to the grid. That was a pretty drastic change, and very different from their previous generations.

Yes, true - that showed up in the SKX uncore. I wonder if it is being used in CNL client parts as well?

Instruction set and the number of registers visible to the programmer influence the practical limits to issue width. AMD64 (x86_64) only has 16 general purpose registers, so there are limits to how many instructions could possibly execute at one time. If I recall correctly the ARM ISA has 32 registers, so there is potential for a lot more data sitting there ready to do something on any given cycle. There are limits imposed by software as well - lots of real world program code simply doesn't have opportunities to do many things in parallel.

Having said all that, any extra execution units can be used more effectively with multi-threading. It sounds neat to have twice as many threads as cores, but on my workloads that's only about a 20 percent performance increase. Going wider would probably help the second thread quite a bit, but what would be sacrificed is deep in the details of a given design. I suspect they increase width so long as it doesn't impact single thread performance.

I've heard from a CPU designer that the CISC nature of x86 lets it punch above it's weight in terms of what you're talking about. There's a lot of instructions that don't reference any architectural registers, but get allocated physical registers (and would have architectural registers allocated when compiled to something RISC). He claimed it was about equivalent to a 32 register RISC for that reason.

x86 has plenty of instructions that use data from memory as one of the operands. I'm sure that offsets the limited number of registers somewhat.

In the end it's all deep in the details. What I'd really like to see is a RISC-V implementation done by a full team at Intel, AMD, or IBM. Even an ARM implementation by those teams would make a great comparison, but that seems even less likely ;-)

> x86 has plenty of instructions that use data from memory as one of the operands.

And the fact that you can have a full register width immediates in a single instruction means that you don't have to allocate an architectural register for intermediate immediate construction.

Only somewhat true. There is register renaming which increases the size of the virtual register set. rax assigned to in line 1 can be a different virtual register from rax assigned to in line 5. AMD and Intel should both be very good at exploiting it to the limits, they had to in their superscalar 32 bit x86 designs :)

Intel cores have had eight EU ports since Haswell. Zen has a split pipeline, so it's not comparable. Zen has ten units, some µops use multiple EUs (e.g. AGU and ALU).

Is Apple ahead of x86/64 on width?

The A10X article on anandtech seemed to imply that every other ARM architecture was far behind on width while Apple was mostly on par (or slightly behind) x86/64.

Apple doesn't release the details but Anandtech estimates A12 has 13 execution ports (https://www.anandtech.com/show/13392/the-iphone-xs-xs-max-re...). I'm guessing the numbers for 6 ALU ports are more accurate than the overall since it's hard to blackbox determine the port sharing structure. Recent Intel / Zen have 4 ALU ports AFAIK.

A10X or A12X? But there are a number of benchmarks showing the A12X outperforming most of Apple's laptop lineup and many competitors. The A12X doesn't have the liberty of a desktop form factor, so it is definitely a mobile part, but it seems reasonable that Apple could make another chip that scales the performance up even further if they wanted to, without really altering the underlying architecture.

64 cores with 128 threads. Hope that comes to threadripper. I love and for bringing back competition to the CPU market. We've bought a ton of amd machines in the last year, great bargains in our space.

I really think this is necessary for exascale computing. The core floodgate would have never opened without Intel getting several kicks in the pants.

You can do the math but my estimates are that the energy costs would still be too high for exascale with pure-CPU compute. Accelerators are the way anyone actually planning to build an exascale machine is getting there.

I think it’s doable with tightly integrated compute nodes, ie specially engineered.

Benchmarks on the 32-core TR are... disappointing, to say the least.[1] If you’re purely compute bound, it can be a win over the 16-core version, but if memory access is a factor, it’s a wash due to the extra hops to memory. And to my mind, there are very few pure-compute applications that wouldn’t benefit more from AVX2 and the like... in which case a cheaper Intel CPU would still wipe the floor with the 32-core chip.

I’m a huge fan of the Threadripper concept, just wish they hadn’t neutered the 32-core chip compared to its Epyc counterpart.

[1]: https://www.anandtech.com/show/13124/the-amd-threadripper-29...

Those (Anandtech) benchmarks were performed on Windows. All threadripper benchmarks on Linux show that it is nowhere near as awful a performer as on Windows and most compute workloads do scale okay. Seen multiple ideas thrown around like Windows not being NUMA aware with this processor or just plain bad core scheduling

For reference, here is a link to the Phoronix Windows vs Linux on the 2990WX article:


They did a follow up changing the scheduling policy for thread 0 (again, still on Windows) and it didn’t make a difference for almost all their workloads: https://www.anandtech.com/show/13446/the-quiz-on-cpu-0-playi...

AnandTech really needs to hire a Linux-focused editor to do some benchmarks there too, especially for these large systems that are unlikely to be running Windows anyways.

The Phoronix benchmarks are quite clear,[0] I don't know why you keep linking to AnandTech's Windows benchmarks. I say this as someone who reads tons of AnandTech reviews because they're great, but Windows just doesn't do well with high core count hardware at all.

[0]: https://www.phoronix.com/scan.php?page=article&item=2990wx-l...

This seems like a familiar issue I've run into with workstations I've used in the past running Xeons. Not sure how NTOSKRNL handles scheduling of parallel tasks. I'd venture a guess and say it's hybrid (M:N threads), where multiple userland application threads are mapped to some "virtual processor" in kernelmode. That leads to priority inversion between the userland and kernelmode threads, which could explain why Windows benchmarks are terrible when dealing with multiple physical cores.

As far as I know Win NT threads are 1:1.

Not even sure how it would work or even make any sense to have N:M handled by the kernel. N:M is usually a mainly a userspace thing. And Windows is even less likely to use that kind of convolution, because IIRC it can call back from kernel to userspace (that design I would not recommend, btw, but oh well). You have fibers, of course, but that's a different thing.

Windows does not scale probably simply because the kernel is full of "big" locks (at least not small enough...) everywhere, and they have far less fancy structures and algo than Linux (is there any equivalent of RCU that is widely used in there? - not sure). Cf the classic posts of the builder of Chrome who every now and then encounter a ridiculous slowdown of his builds on moderately big computers, sometimes because of mutexes badly placed.

> Not even sure how it would work or even make any sense to have N:M handled by the kernel. N:M is usually a mainly a userspace thing.

Correct, I meant that the benchmarking program itself probably used that implementation. Not the Win NT kernel’s implementation of OS threads.

See the phoronix benchmarks, linux performs much better on same workloads.

Threadripper’s 4 die variant had to be “neutered” to remain compatible with the TR4 socket without splitting memory bandwidth. The new I/O chiplet would solve that issue entirely, assuming it comes to Threadripper 3.

I was just thinking this. Yes it'll have half the lanes of EPYC still, okay, but now the lanes will be connected to all the dies equally rather than to just 1/2 the dies. This hopefully means Threadripper 3 can perform almost as well as EPYC with just 1/2 the memory bandwidth - that isn't insignificant but it's a big improvement on Threadripper 2.

An interesting point is that TR owners are much more likely to run their RAM at higher, out of JEDEC spec speeds, so the difference can be a lot smaller than 1/2.

The neutering isn't that bad, you can still get an improvement by adding more cores. Here's some of my results on a task doing a lot of string comparisons (plus some other stuff): https://image.ibb.co/ecH9VL/coretest.png

Predicted = Theoretical improvement (2 cores are 2x faster than 1, etc)

10k/00k = number of entries being searched (there's actually a variable number of strings per entry)

It is not just about 32-core compute performance, if you compare a $385 threadripper motherboard + $650 CPU to the equivalent Intel competition, for single socket, the Threadripper is far ahead in terms of high bandwidth I/O. The threadripper has 64 PCI-Express 3.0 lanes direct to the CPU, for use with things like 25/40/100GbE network interfaces, or very large numbers of 10GbE interfaces. Or low-latency cluster interconnect fabric cards.

They are direct to 1/2 of the CPUs, actually.

This is part of what the GP was complaining about. The 2 dies which do not have memory also do not have PCIe wired out of the socket, so they are one hop away from I/O as well. If you are trying to max out I/O, you'd probably be better off with a low-end Epyc that had fewer cores enabled per die.

It is interesting to note that the memoryless dies do have PCIe root ports, for built-in things like the PSP crypto engines. That surprised me somewhat when I first noticed it, but it makes sense in retrospect, since the issue is that the TR socket was designed for 2 dies.

The interconnect latencies give you something like an extra 50 nanoseconds[1]. That matters when accessing memory. Not so much for PCIe, where your base latency is most of a microsecond[2][3]. There seems to be plenty of bandwidth to handle it, too.

[1] https://www.servethehome.com/amd-epyc-infinity-fabric-latenc...

[2] https://forum.stanford.edu/events/posterslides/LowLatencyNet...

[3] https://gianniantichi.github.io/files/papers/pciebench.pdf

My comment was addressing the parent comment about bandwidth, not latency.

Because you only have 1/2 the memory and I/O bandwidth on threadripper, if maxing out I/O is your concern, you would likely be better off with a low-end epyc that still had all the memory controllers and pcie lanes wired up.

> My comment was addressing the parent comment about bandwidth, not latency.

Oh, it is? It really seemed to be about direct vs. non-direct, and nothing else. And that pretty much only affects latency, because it only loads the infinity fabric by about 20-25% to route 16 PCIe lanes to each die.

I suspect the main problem I tackle at work, robotic grasp planning, could make use of all those threads despite the bandwidth constraint. Now if only someone would put an Epyc in an industrial PC...

The most interesting thing about this is the speculation about the I/O die being flexible enough to take other workloads - it's basically the "chipset" of old.

I wonder if they'll license it - with Apple's A12 already on the 7nm TSMC process, building a Xeon crushing ARM monster for the new Mac Pro by swapping the Ryzen dies for ARM dies seems like a great bit of leverage, assuming Cook and Su could arrange it.

From a desktop/casual system builder/potential new macbook pro buyer question but will these surpass Intel's core i5/i7 offerings in single thread + power efficiency, given Intel is stuck on 14 nm for a while?

Thinking of upgrade my 2014 era stuff due to massive improvements in SSD, memory etc. but not sure its worth dropping so much money for coffee lake or just waiting for something better x86 (or ARM...)-wise.

Does 1.25x performance at the same power refer to clock speed? Does that mean that we can expect 5 Ghz in Ryzen 3000?

I don't know a lot about CPUs, but my guess is No. That just sounds too good to be true. I'd love to be wrong though.

Look up Dennard Scaling and how it broke.

Dennard Scaling would have been 2x clock-rate from an improved node.

1.25x scaling from an improved node is way, way, way worse than Dennard Scaling of the past.

Intel Pentium III Coppermine (1999) went from 733 MHz on the 180nm node to Pentium III Tualatin (2001) 1400 MHz on the 130nm node. THAT was Dennard scaling.

Today, we "only" get double-digit gains from an improved process node. Dennard Scaling was triple-digit gains. Furthermore, most CPU makers focus on the power-saving aspects (which seem to be scaling somewhat well still).

If you look at GP's question, (s)he was asking about 5Ghz in Zen2. Since Epyc 1 is ~3Ghz, a 1.6x increase in clock speed for a ~1.4x smaller process size (AFAIK "7nm" is overselling it compared to 14nm) to me smells like Dennard scaling and thus cannot be expected anymore (if you disregard tricks like turbo boost where a bunch of hardware gets disabled such that the rest can be boosted).

They're asking about a 1.25x increase on a 1.4x smaller process.

Even if you completely ignore boost clocks, the 1900X has a base clock of 3.8GHz, so interpret it as "4.75GHz base clock on a non-Epyc part" if you must.

But I don't think you should ignore boost clocks. They're not a trick to make the silicon seem more capable. The silicon really is that capable and boost clocks are a trick to cap power draw. It's entirely fair to look at the 4.2 boost clock on the 2990WX and conclude that the silicon is capable of 4GHz under non-exotic conditions.

>On the security side, Zen 2 introduces in-silicon enhanced Spectre mitigations that were originally offered in firmware and software in Zen.

does intel have something comparable on their roadmap?

Are there any more concrete proofs for the HBM2 on package possibility?

Do you mean the comment under the article that mentions HBM? If so I think they refer to these papers by AMD:



Is it going to support ECC memory?

Almost certainly. Zen 1 supported ECC on everything down to low-end consumer chips. It's unlikely but within the realm of possibility that ECC support will move up-market slightly with Zen 2 but is almost unthinkable that Zen 2 won't support it at an architectural level.

All AMD processors have supported ECC for a very long time, its trivial to support it, Intel have just decided to gate it as a premium feature.

Sort of - they let mainboard vendors decide whether to support it or not, which means it can be a crapshoot. For example, MSI's been known in the past to kill ECC support with a BIOS update; some vendors have tested that you can use ECC RAM but won't enable any of the error correction (for example, Gigabyte say this in [1]: "non-ECC mode".)

Selfishly I really wish they would make it easier, because I'm in the market for a new personal-use storage machine and I've spent far too long researching all this crap but it's looking like I'll have much more certainty that it'll all just work if I buy a Xeon E3/E-2000 series and that's unfortunate.

1: http://download.gigabyte.eu/FileList/Manual/mb_manual_ga-h11...

Asrock have been good in my experience at enabling all features on their boards. Back when Intel's Vt-d support depended on your board they reliably had support, and I believe they support ECC on all their new AMD boards.

I can confirm that ECC at least works for their Threadripper mainboards.

I have ECC working on an AB350M Pro4. See also https://www.hardwarecanucks.com/forum/hardware-canucks-revie... The 'edac_mce_amd' module needs to be loaded on Linux.

I have one too, seems to be trucking along fine.

If it helps, I have a Gigabyte Designare X399 EX and it has full ECC support (have 64GB installed at the moment, edac-util checks out).

It is TR4 though. I grabbed a 1920X after the price drops.

If you’re wanting a storage box usually it’s just less headache to buy used server gear off eBay. Plenty of bays, DDR3 RDIMM’s are cheap, and power efficient ivy/sandy bridge systems are finally in affordable price ranges.

I’m the kind of person that lurks in /r/homelab though - so I’ve also got a 25U rack to keep all my gear. If you want a tower to stuff in a corner things get more dicey.

2011-1 systems really aren't that power efficient. 16 DIMMs, 2 sockets will draw ~120 W idle.

Don’t populate all 16 DIMM’s unless you need that much memory? Both my single-socket R320 and dual-socket R520 idle at 70W each with 6 DIMM’s installed.

Most people probably don’t need the gobs of memory I have either. A R520 with one socket populated and 2x8 or 16GB dual-ranked RDIMM’s would be more than sufficient.

I actually misspoke. The system above was configured with only 8 DIMMs (8 GB each) at the time. That's the lower limit for this platform before performance is degraded.

Single socket Ivy/Sandy Bridge-EN servers exist and only need three DIMM’s for max performance, that’s what I run FreeNAS on (R320 with a E5-2450L, mind you I have six DIMM’s).

I recently built a home server with Ryzen 5 2500. Pretty happy with it. Not using ecc ram though.

I recall hearing that certain motherboard manufacturers were disabling ECC for their lower-end AMD boards.

Does anyone know if this was ever confirmed or not?

My newly purchased Gigabyte B450M DS3H, says clearly 'Dual Channel Non-ECC Unbuffered DDR4, 4 DIMMs'. https://www.gigabyte.com/us/Motherboard/B450M-DS3H-rev-10#kf

It also mentions that:

>Support for ECC Un-buffered DIMM 1Rx8/2Rx8 memory modules (operate in non-ECC mode)

Note the operate in non-ECC mode remark.

Seems pretty clear to me. The CPU might allow ECC but now it's the motherboard playing tricks.

I also have a Gigabyte motherboard. Pretty disappointed that it doesn't have ECC support. For my next machine I will look more closely.

playing tricks as in not routing the extra lines

I don't know the hardware details, but as a buyer this is like moving the goalpost.

It was always the CPU that didn't have ECC, but now apparently your whole system has to be designed to be a fancy workstation.

Everybody keeps repeating AMD supports ECC out of the box, but the fact that you will most likely not have an AM4 motherboard which does have ECC enabled is new to me.

the irony ..

Looks good. Sidenote would love to see some ml-benchmarks in future cpu comparison articles.

Given that AMD has introduced their own version of Intel's IME, there's very little incentive for me to consider their CPUs. At least there are workarounds for some of Intel's CPUs.

You can disable the PSP in newer BIOSes. I can do it on my X370 Taichi. The link below is old but accurate. Go AMD and don't look back.


That doesn't really change much in all honesty; it just disables support for things like the fTPM, secure sleep states, and some communication mailbox primitives (that allow things like offloading encryption to the PSP coprocessor, through the Linux crypto API subsystem -- this is all supported in upstream Linux.)

The PSP is still essential to the boot process and many probably other things (power management, etc), and you aren't going to just magically turn it off with a UEFI option. If your concern is that the PSP is a blackbox covert channel, it probably changes almost nothing.

Both AMD and Intel are functionally equivalent here, as far as I'm concerned.

(My ASRock X399 board also has this UEFI option and specifically calls out that it only disables a few key features.)

Sources are conflicting, but the general consensus I've seen is that the Intel ME has direct access to your networking hardware, while the AMD PSP does not, and basically just exposes interfaces to the CPU.

If accurate, this is a significant functional difference as far as I'm concerned.

But, what's the old saying again? "The only truly secure system is one that is powered off, cast in a block of concrete and sealed in a lead-lined room with armed guards - and even then I have my doubts."

And PSP also taking part in the memory-training process during the boot AFAIK. So you can't disable it completely.

that doesn't actually disable the PSP. read the Reddit thread

I'm in the same boat as you.. me_cleaner.py

Yeah, I've got a Libreboot'ed X200 and an X230 that I've Coreboot'ed+me_clean'ed. My desktops aren't in such good shape w.r.t. BIOS firmware though.

Talos using POWER is about the only option for us.

I know, and I have my eye on it. But it's a bit out of my price range at the moment.

> meaning 256-bit AVX operations no longer need to be cracked into two 128-bit micro-ops per instruction

The instruction set stayed the same. And in the current instruction set, a lot of these AVX instruction still operate on 128 bit lanes. Instructions like vpshufd, vshufps, vpblendw only shuffle/blend/permute within 128-bit lanes, so do AVX512 equivalents.

Yes, but that was not what the article was referring to. Zen1 CPUs only have 128-bit FPU lanes, and execute wider instructions by splitting them into two in the frontend.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact