If these chips really do have a ~25% performance gain over Ryzen (not counting IPC gains), I might just upgrade, and I think a lot of other people are in the same boat.
This could be a fantastic win for AMD. I was worried that Zen 2 was going to be a moderate improvement, but this seems to be pretty significant. It's been a while since AMD has been first to market with a big winner...
If speculative execution is a problem and you need to give each VM its own exclusive cores, great when you can source parts with 64+ cores on them!
ie => id est => that is
eg => exempli gratia => for example
Providers will begin uptake of AMD again, it just takes time for contracts to roll over and the new hardware to reach production.
IC: AMD has already committed that Milan, the next generation after Rome, will have the same socket as Rome. Can you make the same commitment with Zen 4 that was shown on the roadmap slides?
MP: We’re certainly committed to that socket continuity through Milan, and we haven’t commented beyond that. Obviously at some point the industry transitions to PCIe 5.0 and DDR5 which will necessitate a socket change.
IC: So one might assume that an intercept might occur with Zen 4?
MP: No comment (!)
1) PCI-E 5 isn't even standardized yet
2) Mass production of DDR5 memory chips might begin ramping up in 2019, but it's unlikely that DDR 5 memory sticks will be readily available on the market until 2020.
3) It takes a really long time to engineer a cpu architecture and bring it to market (generally a year or more for even just a revision of an existing architecture).
4) Adding support for a new memory or PCI-E version requires a new socket and a new chipset, which creates even more work that by definition can't be "done" until things are fully standardized.
AMD are going to support DDR5 and PCI-E 5 eventually... But they want to release a new architecture revision on their current socket and chipset now before it's reasonably possible to do that.
Edit: All this is to say nothing of AMD's commitment to potential partners to keep the underlying socket and chipset the same for a reasonable period of time, which is critical to planning supply chains and upgrade paths for major partnerships.
Memory, sure, you don't want compatibility issues. Why PCIe, when the important lanes are all driven directly off the CPU?
Zen 2 will be the first CPUs supporting 4.0.
On June 5th, 2018, the PCI SIG released version 0.7 of the PCIe 5.0 specification to its members.
PLDA announced the availability of their XpressRICH5 PCIe 5.0 Controller IP based on draft 0.7 of the PCIe 5.0 specification on the same day
Historically, the earliest adopters of a new PCIe specification generally begin designing with the Draft 0.5 as they can confidently build up their application logic around the new bandwidth definition and often even start developing for any new protocol features. At the Draft 0.5 stage, however, there is still a strong likelihood of changes in the actual PCIe protocol layer implementation, so designers responsible for developing these blocks internally may be more hesitant to begin work than those using interface IP from external sources.
From here: https://en.wikipedia.org/wiki/PCI_Express
AMD had plenty of time to include PCIe 5.0 even in the Draft 0.7 stage which is pretty much the same as final draft/release, but decided not to.
No possible way that could backfire or be impossible
... Just how long do you think it takes to make that kind of design changes? This requires a major change in silicon. It would have had to been done more than a year ago.
Variant 1: Both AMD and Intel will continue to rely on OS patches.
Variant 2: Both AMD and Intel will fix in silicon. Intel's fix comes in Cascade Lake.
Variant 3 (aka Meltdown): AMD was not affected, Intel will fix it in silicon in Coffee Lake Refresh and Cascade Lake.
Variant 4: Both AMD and Intel will rely on OS and Microcode patches
L1TF: AMD was not affected, Intel will fix in silicon in Coffee Lake Refresh and Cascade Lake.
I'll note that for the AMD products, this is my best guess from Lisa Su's comments. I do wish they put out a table like Intel did.
EDIT: Replaced the Intel source with a better link
While GPU offloading is extremely powerful in terms of compute, you have to deal with both bandwidth limitations (de facto ~12 GB/s for x16 PCIe 3.0) and the latency of launching compute kernels and waiting for them to complete.
For real time applications this is usually not an option, there latency > some threshold will kill your proposed solution.
This phrase may be worse-than-useless: it creates the impression between people that they agree, but could mean a discrepancy between 0.0001% to 10%.
I know this isn't central to what you're saying, I just want a better conversational norm around probabilities during conversations in the world.
I always wanted to make an Amiga with an AMD chip and get AmigaOS into X86-X64 and away from PowerPC so it can sell cheaper and faster systems. Compete with Apple at least by merging AmigaOS with GNU/Linux work with Valve on SteamOS to run AmigaOS apps and get some classic Amiga games into SteamOS and GNU/Linux.
New process nodes get more and more expensive - especially if the node is new (bad yields etc.) So you want to get the most out of it as possible, which AMD does with this strategy.
For example Zen 1 8 core chips are 213 mm2 on 14nm. On 7nm these chipse would be > 106 mm2 (because the analog elements don't scale). Zen 2 chiplets should be at most 75 mm2 (while also increasing cache, widen AVX etc.) which is mobile SoC territory which means they are cheaper to produce (better yields etc.). Since Intel uses monolithic dies AMD already had an cost advatage, with Zen 2 it's not even funny.
Especially if the guesses of many people are true and AMD designed the IO chip to be cut down, which means they can reuse it for their desktop CPUs (Ryzen on AM4). Anonther nice advantage of this would be that they can bin the chip even better for EPYC, Threadripper and Ryzen.
AdoredTV does a good job of aggregating all the infos/ideas that are floating around and presenting them:
Must read "High Speed Digital Design: A Handbook of Black Magic": https://books.google.pl/books/about/High_speed_Digital_Desig...
The difference between logic part of CPU and IO is in the amount of work spend on RTL vs direct custom layout of geometric structures on the chip.
> Intel’s analog designers are responsible for creating circuits such as PLLs (for clocking) and PCIe (for I/O) that are critical for modern SoCs on a process that isn’t optimized for analog
Assuming this chip doesn't trip all over itself moving things around, it will be an astonishing amount of computer power in a reasonably sized package. This is for me, the only reason to work at an internet giant; because they will build a tricked out motherboard with two sockets and up to 8TB of RAM and with say a petabyte of attached non-volatile storage available, field solvers, CFD apps, EM analysis would just melt away. Whether it is designing a rocket engine, or folding a protein, or annealing a semiconductor on the quantum level. So often these programs use approximations to allow them to run in finite time, and now more and more of the approximations are replaced with exact numerical solutions making the models more and more accurate.
Five years from now when people are throwing out these machines to replace them with Zen3 or Zen4 machines, I'm going to be super glad to get one and play with it.
Isn't that an oxymoron? Numerical solutions always involve some sort of rounding errors because of limited floating-point precision, so they cannot be exact.
You are correct in that most of these systems don't have a closed form solution that would yield an exact result, and they rely on either an interpolated value which is inexact depending on deviation or numerical solutions which typically iterate to the nearest point, that the system can represent in its limited 80 or 128 bit floating point system.
It is far more apt to bring up most "interesting" solutions do not have an exact closed-form expression.
Since nobody pushes to add a fancy new thing to DirectX unless they plan to ship hardware specifically to do that new thing, AMD would have to have been blind & deaf to not know Nvidia was going to ship RTX this year.
Nvidia is looting customers that want to use GPUs for Machine Learning on the cloud (like AWS, GCP)
It's AMD's job to make machine learning work in their GPU's. If they don't believe in it and spend the necessary time and effort, nobody else will. Radeon Open Compute Platform (ROCm) has existed for years but apparently it's not good enough.
ps. Tensorflow has ROCm backend support. https://hub.docker.com/r/rocm/tensorflow/ but is MI25 competitive? https://www.amd.com/en/products/professional-graphics/instin...
Why is it a separate repo rather than being contributed upstream to TF?
Both are being actively worked on, and I expect to see this gap shrink in the next few months to a year.
CUDA had C, C++ and Fortran support since the early days, followed by the PTX bytecode format for any compiler vendor that wanted to support CUDA on their languages.
It was necessary to loose the race for them to focus on C++ and come up with SPIR and SYSCL.
And tools still seem not to be on par with what NVidia offers.
Also regarding tooling, that isn't quite true, as they have engaged with LunarG for Vulkan, because they have learned from lack of OpenGL adoption, that the large majority of games developers wouldn't even bother without a proper SDK.
Nonsense. You'll be hard-pressed to find a platform that doesn't support C or a competent system design or even graphics design engineer who isn't confortable with C.
If their goal is to establish standards, they build upon standards.
Likewise for game consoles, Windows and now Apple graphics APIs.
Defining standards does not mean they must provide a lowest denominator single implementation.
They can be defined in an IDL kind of way, like e.g. WebIDL, abstractly e.g. Internet RFCs, or define mappings to all major languages in the GPGPU field, namely C, C++ and Fortran.
That's already done. Mozilla's Obsidian is an object-oriented WebIDL interface to Vulkan: https://github.com/KhronosGroup/WebGLNext-Proposals/tree/mas...
Again, Khronos is just a standards group. It does what its members tell it to. Khronos is far more lightweight of an organization than, say, ISO's C++ committee. Instead of blaming Khronos, you could perhaps encourage them to focus on initiatives like Obsidian.
Prior to the Ryzen launch, AMD lived through some really lean years. They saved wherever they could, and their GPU department saw very little investment on development for a long period. Since then, they have recapitalized the department, and are probably designing a proper next gen GPU architecture to replace GCN. However, it takes ~4-5 years to get a completely new arch on the market, and they only got to start a few years ago. Vega and Navi are both still shoestring budget designs.
The technology is a lot more interesting in professional 3D modelling where light baking and rendering can be done much faster.
Mind you that you are seeing the early generation of hardware based real time raytracing now and it already can beat the quality of rasterization.
There is still a ton of room for improvements. One or two hardware generations from now we may finally see realtime generated images for which the term photorealistic is not just a silly marketing hype. They could truly fool you.
One of the problems is that raytracing is inherently noisy, so you actually have to shoot many rays (or trace many paths) for each pixel on the screen, and you really need to get into the high double digits for acceptable results purely from raytracing.
Part of NVidia's sales pitch is that you can reduce the number of samples / rays using advanced post-processing filters based on CNNs. But then you're back to building a jenga tower of hacks, just this time on top of raytracing instead of on top of rasterization. You may prefer the raytracing jenga tower perhaps, but truth in advertisement should make it clear that raytracing alone just isn't there yet.
It's true though that this current generation of hardware is promising, and things are bound to get better quickly. For now, the most you're going to get is hybrid techniques where rasterization is used for the bulk of the rendering, and raytracing is used selectively e.g. for parts of a scene or for shadows.
What has me worried is not the performance of the ray-scene intersection, but the dynamic BVH update that is required for each frame. Building acceleration structures for ray intersection tests is a hard tradeoff between fast construction times and fast intersection tests. With animated/deforming objects, this is quite an interesting challenge. But we might eventually see some dedicated hardware for that, too. Academic implementations exist.
Gamers want fast framerates, high polygon count, high textures much more than some currently-useless raytracing. Solve those into near-reality quality in at least 1440p resolution, and then we can talk about these gimmicks. I'll happily skip this (and next) gen and will be OK with some 1080 card.
RTX cards might allow porting Luxrender in a way that retains it's spectral treatment that actually physically simulated a dispersive prism generating a rainbow from a ray of white light, without more than specifying the dispersion of the glass.
It goes without saying that such effects are inherently very noisy, but there might be ways using automatic differentiation (c.f. Julia (language)) combined with advanced numerical integration to make use of calculus to reduce the noise in individual samples.
If anyone knows about attempts to combine Metropolis light transport (MLT) with automatic differentiation or just about a more concrete idea to incorporate advanced numerical integration with MLT, or a potentially suitable MLT implementation in Julia, please let me know, i'd like to check it out and actually consider it a suitable "toy" project for learning Julia. The ease of efficient GPU use from this high level just allows so much flexibility in e.g. complex material node graphs getting handled with no extra work on the integrator then.
MLT is not used for animations because it is temporally unstable in a perceptually unfavorable way: the artifacts are more blotchy than noise while the human vision is more tolerant to noise. Recent developments around temporal MLT should mitigate that, but these require you to render multiple frames simultaneously. Thisnmakes them unsuitable for real time applications.
Also, the thing about MLT/bidirectional/unidirectional pathtracing is that there is no universally best method. All of then have weaknesses. There are examples where each method is worse than the others in an equal time comparison.
The best performance improvements that you can get with any of these Monte Carlo methods are always based on improved strategies for drawing the samples (importance sampling, QMC). Incorporating non-local information for local importance sampling will be the next big thing.
As for color noise: it is an undesirable artefact, but if you importance sample the spectral color response curves for the human eye (or your display device) correctly, then the spectral noise vanishes at least as fast as the other sources of noise in your scene.
> The only real performance advantages the GeForce3 currently offers exist in three situations: 1) very high-resolutions, 2) with AA enabled or 3) in DX8 specific benchmarks. You should honestly not concern yourself with the latter, simply because you buy a video card to play games, not to run 3DMark.
> The GeForce3 is consistently a couple of frames slower than the older GeForce2 cards
> Between now and the release of the GeForce3's successor, it is doubtful that there will be many games that absolutely require the programmable pixel and vertex shaders of the GeForce3
It would help if there was more explanation of what the RTX units are actually good at. Not every GPU innovation actually ends up widely used. It would be amusing if they lost out by being not programmable enough...
That’s going to need to be modified I think!
Unfortunately, unaware (legacy) software is still limited to 64 processors.
e.g. we've seen Apple's A12 processor expand ALU's from 4->6 and what seems like a strong focus on cache latency and these changes seem to be rather beneficial in real code. Why aren't we seeing the same from Intel / AMD? As someone whom isn't particularly well informed on the topic my guess is that AMD/Intel are scared of selling a wider-core but lower clocked CPU given how much marketing is attached to clock speed, but I imagine there are architectural issues as well.
In Skylake, most of the effort seemed focused on bringing AVX-512 to bear, and outside of that the basic design is largely the same as Haswell.
Here's hoping that Apple's chips portend an increase in width in mainstream x86 chips as well.
Having said all that, any extra execution units can be used more effectively with multi-threading. It sounds neat to have twice as many threads as cores, but on my workloads that's only about a 20 percent performance increase. Going wider would probably help the second thread quite a bit, but what would be sacrificed is deep in the details of a given design. I suspect they increase width so long as it doesn't impact single thread performance.
In the end it's all deep in the details. What I'd really like to see is a RISC-V implementation done by a full team at Intel, AMD, or IBM. Even an ARM implementation by those teams would make a great comparison, but that seems even less likely ;-)
And the fact that you can have a full register width immediates in a single instruction means that you don't have to allocate an architectural register for intermediate immediate construction.
The A10X article on anandtech seemed to imply that every other ARM architecture was far behind on width while Apple was mostly on par (or slightly behind) x86/64.
I’m a huge fan of the Threadripper concept, just wish they hadn’t neutered the 32-core chip compared to its Epyc counterpart.
The Phoronix benchmarks are quite clear, I don't know why you keep linking to AnandTech's Windows benchmarks. I say this as someone who reads tons of AnandTech reviews because they're great, but Windows just doesn't do well with high core count hardware at all.
Not even sure how it would work or even make any sense to have N:M handled by the kernel. N:M is usually a mainly a userspace thing. And Windows is even less likely to use that kind of convolution, because IIRC it can call back from kernel to userspace (that design I would not recommend, btw, but oh well). You have fibers, of course, but that's a different thing.
Windows does not scale probably simply because the kernel is full of "big" locks (at least not small enough...) everywhere, and they have far less fancy structures and algo than Linux (is there any equivalent of RCU that is widely used in there? - not sure). Cf the classic posts of the builder of Chrome who every now and then encounter a ridiculous slowdown of his builds on moderately big computers, sometimes because of mutexes badly placed.
Correct, I meant that the benchmarking program itself probably used that implementation. Not the Win NT kernel’s implementation of OS threads.
Predicted = Theoretical improvement (2 cores are 2x faster than 1, etc)
10k/00k = number of entries being searched (there's actually a variable number of strings per entry)
This is part of what the GP was complaining about. The 2 dies which do not have memory also do not have PCIe wired out of the socket, so they are one hop away from I/O as well. If you are trying to max out I/O, you'd probably be better off with a low-end Epyc that had fewer cores enabled per die.
It is interesting to note that the memoryless dies do have PCIe root ports, for built-in things like the PSP crypto engines. That surprised me somewhat when I first noticed it, but it makes sense in retrospect, since the issue is that the TR socket was designed for 2 dies.
Because you only have 1/2 the memory and I/O bandwidth on threadripper, if maxing out I/O is your concern, you would likely be better off with a low-end epyc that still had all the memory controllers and pcie lanes wired up.
Oh, it is? It really seemed to be about direct vs. non-direct, and nothing else. And that pretty much only affects latency, because it only loads the infinity fabric by about 20-25% to route 16 PCIe lanes to each die.
I wonder if they'll license it - with Apple's A12 already on the 7nm TSMC process, building a Xeon crushing ARM monster for the new Mac Pro by swapping the Ryzen dies for ARM dies seems like a great bit of leverage, assuming Cook and Su could arrange it.
Thinking of upgrade my 2014 era stuff due to massive improvements in SSD, memory etc. but not sure its worth dropping so much money for coffee lake or just waiting for something better x86 (or ARM...)-wise.
1.25x scaling from an improved node is way, way, way worse than Dennard Scaling of the past.
Intel Pentium III Coppermine (1999) went from 733 MHz on the 180nm node to Pentium III Tualatin (2001) 1400 MHz on the 130nm node. THAT was Dennard scaling.
Today, we "only" get double-digit gains from an improved process node. Dennard Scaling was triple-digit gains. Furthermore, most CPU makers focus on the power-saving aspects (which seem to be scaling somewhat well still).
Even if you completely ignore boost clocks, the 1900X has a base clock of 3.8GHz, so interpret it as "4.75GHz base clock on a non-Epyc part" if you must.
But I don't think you should ignore boost clocks. They're not a trick to make the silicon seem more capable. The silicon really is that capable and boost clocks are a trick to cap power draw. It's entirely fair to look at the 4.2 boost clock on the 2990WX and conclude that the silicon is capable of 4GHz under non-exotic conditions.
does intel have something comparable on their roadmap?
Selfishly I really wish they would make it easier, because I'm in the market for a new personal-use storage machine and I've spent far too long researching all this crap but it's looking like I'll have much more certainty that it'll all just work if I buy a Xeon E3/E-2000 series and that's unfortunate.
It is TR4 though. I grabbed a 1920X after the price drops.
I’m the kind of person that lurks in /r/homelab though - so I’ve also got a 25U rack to keep all my gear. If you want a tower to stuff in a corner things get more dicey.
Most people probably don’t need the gobs of memory I have either. A R520 with one socket populated and 2x8 or 16GB dual-ranked RDIMM’s would be more than sufficient.
Does anyone know if this was ever confirmed or not?
It also mentions that:
>Support for ECC Un-buffered DIMM 1Rx8/2Rx8 memory modules (operate in non-ECC mode)
Note the operate in non-ECC mode remark.
Seems pretty clear to me. The CPU might allow ECC but now it's the motherboard playing tricks.
It was always the CPU that didn't have ECC, but now apparently your whole system has to be designed to be a fancy workstation.
Everybody keeps repeating AMD supports ECC out of the box, but the fact that you will most likely not have an AM4 motherboard which does have ECC enabled is new to me.
The PSP is still essential to the boot process and many probably other things (power management, etc), and you aren't going to just magically turn it off with a UEFI option. If your concern is that the PSP is a blackbox covert channel, it probably changes almost nothing.
Both AMD and Intel are functionally equivalent here, as far as I'm concerned.
(My ASRock X399 board also has this UEFI option and specifically calls out that it only disables a few key features.)
If accurate, this is a significant functional difference as far as I'm concerned.
But, what's the old saying again? "The only truly secure system is one that is powered off, cast in a block of concrete and sealed in a lead-lined room with armed guards - and even then I have my doubts."
The instruction set stayed the same. And in the current instruction set, a lot of these AVX instruction still operate on 128 bit lanes. Instructions like vpshufd, vshufps, vpblendw only shuffle/blend/permute within 128-bit lanes, so do AVX512 equivalents.