Because the two NUMA nodes are ~entirely independent, it's capable of running two independent processes at full speed. In practice, that means lower latencies and less jitter, and it's been noticeable. Folklore would have it that single-thread performance is the most important aspect of desktop performance, but that isn't what I've observed.
...it's also useful when I, e.g, decide to run a Factorio server on my desktop.
I don't understand. From my (admittedly little better than layperson's) knowledge, I'm guessing the cores of most multicore processors have to compete for memory access...? Is there a good search term I can use to help me understand what's going on here?
Threadripper is able to switch between NUMA (non-uniform memory access) mode and "regular" mode. In NUMA, the OS knows that 2 channels are attached to 1 die and 2 channels on the other, thus allowing lower latencies because the OS knows what RAM to allocate based on which core the process is running on.
For Windows, it is the other way around. I hope they'll improve their NUMA handling, but I'm not holding my breath.
The Linux kernel is clever about this. You can get some idea of what it does by looking at numactl, which lists the various scheduling modes -- though in practice the kernel does a great job without any user overrides, and actually using the command is likely to slow things down.
Which is not to say that it can't occasionally be helpful, if you're trying to optimize the speed of a single thread. At a minimum, you can choose between optimizing for bandwidth (interleaving data on all four memory channels) or latency (putting everything in the local node). Usually you want the latter.
Judging by the performance I'm (not) getting, Windows does a very poor job with NUMA.
In practice this means that memory controllers are partitioned amongst groups of cores, with some slower and often otherwise busy interconnect between those groups.
The software implication is that if task X uses some bit of memory a lot, then that bit of memory better be node-local, i.e. easy to access for the core where task X is running.
Threadripper and Epyc present themselves as 2 or 4 separate NUMA nodes depending on model. Spreading a single task across multiple NUMA nodes usually hurts performance significantly (often slower than just running it on a single node using fewer threads), but you can run 2/4 separate tasks at pretty much full speed.
The new WX processors are a little weird because two of the NUMA nodes have no direct access to RAM at all, they have to ask the other 2 dies to do it for them and pass it over.
There are a few places where it matters. We talk of "gaming" but that is too broad a term imho. Many AAA games do leverage AMD chips well. They know how to split loads across threads to maximize performance. But in recent years the hot games have come from indi developers. Pre-release titles from tiny developers aren't properly optimized. If you want to play kerbal space program, single-thread performance is vital (KSP's physics is still single-threaded iirc). If you want to surf the web while running a factorio server, AMD will work fine. But if you want to maximize the performance of that factorio server, the faster single-thread performance of Intel chips would be better.
I hear the (Windows/console only?) version is faster for all the reasons you'd expect, but I'll probably end up never playing it.
I wondered if it would be a hit, since 8 bit graphics is part of the charm and concept of minecraft, but I didn't consider the mod ecosystem.
For example they decided to forcibly split everything into server thread(s) and play thread(s), even when you're doing single player. This made things more consistent, and sometimes helped performance, even though it broke a huge amount of mods.
While making this clean break it would have been easy to put world generation into its own thread, or even let it use multiple threads. Minecraft's world generation is inherently very parallel, and it bogs down servers all the time. Responsiveness could be improved by a huge amount by cutting it out of the simulation loop.
Multithreading aside: KSP's approach to physics also doesn't help them. Rather than make creative decisions re what should and shouldn't be calculated, they just hammer the physics thread with lots of calculations that have zero impact on the user experience. In a larger dev these would be optimized away.
The 1950x is looking like a viable upgrade right now as it’s becoming cheaper.
If you aren't concerned with getting absolute maximum speed from the servers, then I'd think you could run 16 of 'em with only moderate degradation; perhaps 20-30% slower (per server) than a single server. Feel free to imagine the Clusterio applications.
That's been the real blocker for me. I don't want to disable them entirely, but worlds need to keep running while there are no players in them. So...?
Use thick walls with mixture of solid walls on inside and every-other spacing on outer.
Main base exterior walls use a mixture of uranium turrets, flamethrowers and laser turrets.
Use laser turrets wrapped around large power pole within base. But no interior walls. Redundant connections using large power poles. Substations are for ordinary structures.
It still is. It is only that we have approach the end of the performance limit, or it has approached the same as its competitor that you want to trade double the core count vs ~20% more maximum single thread performance.
I do wonder if there is a point of diminishing returns though. 24, 32 cores, is it really worth the extra money?
Streaming would be a better example.
And for reference all my background tabs in chrome are using plenty of memory but <1% CPU total. It helps that the browser has aggressive throttling for background tabs.
Edit: it's even more aggressive than I thought https://arstechnica.com/information-technology/2017/03/chrom...
1. The problem was caused by Chrome actively spending CPU and memory cycles, not just having so much memory allocated you run low, because NUMA itself can't change memory consumption.
Caveat: It's possible the NUMA setting would also cause Chrome's ram use to be capped. If that's causes an improvement, then you could do the same capping on a non-NUMA machine.
2. The foreground page in chrome was something like a strategy guide, not something more intense like video streaming. Are there pages that are both strategy guides and intensive to calculate?
3. This happened in the last year or so since version 57 came out.
Not on the modern web it doesn't. A lot of pages are continuously running background tasks and refreshing over time.
Sure enough, many of the benchmarks feature dramatically better performance on Linux. Michael also writes:
"The Windows 10 vs. Linux tests were done out of opportunity with having that Windows installation around without giving it too much thought, but on this 2990WX launch-day it's been surprising to see some of the performance results from some of the Windows publications. Had I known how poorly Windows 10 works on current high core count NUMA environments under some workloads, I would have certainly ran more benchmarks. But that will come in another article then as well as possibly looking at the Windows Server 2016 vs. Linux performance on the 2990WX to see if Windows behaves better there for this NUMA box. So treat this as the introductory article and more Windows vs. Linux benchmarks will be on the way as time allows."
"Me being an idiot and leaving the plastic cover on my cooler, but it completed a set of benchmarks. I pick through the data to see if it was as bad as I expected"
It made it perfectly clear. You were reading from a list of what wasn't in the review yet: "But here's what there is to look forward to:"
1) some VM base tests for these kind of CPU - 32, 64, 128 VM all running some kind of web/db/redis benchmarks inside.
2) Some compilation testing - time the clean build of AOSP, BSD, some very complex linux app - use max jobs setting for parallel compile and time how long it take to finish the jobs. ( measure the over CPU usages at the same time. )
My initial gut feeling reading the results on Anandtech is that Window's scheduler may not be able yet to exploit the processor's architecture effectively.
I suppose it makes the most sense to frame this as an iteration of the original Threadripper with all the design compromises that entails to provide backward compatibility with existing motherboards. I'm definitely more excited for the next generation HEDT chips that hopefully have a more balanced layout like Epyc.
AMD have certainly come out of the Bulldozer era swinging, and we all benefit from their aggressive pricing.
I think this means that scaling processors to more cores (nodes) in a lightweight-NUMA scheme just isn't going to work.
My guess going forward is that the core counts will stay at the current level for quite a while, and that both AMD and Intel will optimize the heck out of their interconnects, using most power budget gains to bolster the actual cores.
As you optimize one part of the system so that it is blazingly fast or lower power or whatever, then the amount taken up by the rest of the system increases proportionally.
(In other words, real performance wins happen gradually as all the parts receive their own little optimizations which add up.)
The 7nm parts (incl. Ryzen on up) are rumored to use 16-core CCXes, so we should see 64-core EPYC and possibly Threadripper with no worse IC power use. Hopefully they'll make the IC better as well.
7nm EPYC being fabbed on TSMC does give some chance for a big monolithic die. TSMC has a long history making large dies and their 7nm process should be mature enough for decent yields at that point. I am speculating out of my depth, though.
At that point, Chromium required Visual Studio 2015 Update 3 or later. Anand use VS Community 2015.3. IIRC, by default it doesn't have LTCG enabled.
By default, release builds are almost entirely statically linked (with a few shared libraries, platform dependent), which makes linking time pretty significant. As a result, it ends up more as a linking benchmark than a compilation benchmark.
The article mentions that, due to the die packaging, only 16 of the cores have direct access to RAM. So for the 32-core version, half the cores are memory-starved and have to go through the 'connected' cores (also impacting these), while the 16-core version doesn't have that problem and can be at 100% for all process loads.
One of the projects I compile at work can take an hour running on 4 threads, jack that up to 16 and you take that down to not much over 17-18 minutes. That's a whole heap of developer time you just got back that would have been wasted on compiler swords.
The other one is running VM's/a docker swarm locally for development.
If you are making a 30-second animation at 24-frames-per-second, that would be 720 frames, or roughly 240 hours (10 days) of rendering. 30-seconds would be roughly the length of a standard commercial.
If you have a computer that is 2x or 4x faster, that cuts the time down to 5 days or 2.5 days. Which is dramatically different. Its mostly a CPU-intensive problem with relatively low RAM bandwidth. (Its RAM-heavy, especially with HDR skymaps. So you need lots of RAM but not necessarily fast RAM).
The Threadripper 1950x (16-core) is faster than the 1080 Ti in several tests. Fishy Cat for instance is faster on Threadripper, as well as the difficult "Barbershop Interior".
With all the updates to Zen2, higher clocks, and now 32-cores, I bet that the 2990wx will be incredible and give GPUs a run for their money.
Besides, you'll need a good CPU to handle physics (cloth, fluid, etc. etc.). Not everything can be done on the GPU yet.
CPUs also have the benefit that RAM is super-cheap. You can get 64GB of DDR4, but its basically impossible to get that amount of RAM on a GPU. This allows you to run multiple blender instances to handle multiple frames quite easily. A portion of rendering is still single-thread bound, so an animation can be rendered slightly faster if you allocate a blender-instance per NUMA node.
If you do have a GPU, you can still have CPU+GPU rendering by simply running Blender twice, once with GPU rendering and a 2nd time for CPU Rendering. With the proper settings, you'll generate .png files for each animation independently, which allows for nearly perfect scaling.
Every x399 board I've seen supports quad-GPUs. So you can totally build a beast rig with 4x GPUs + 32 CPU Cores for the best rendering speed possible.
Instead I built an awesome Hackintosh with 16gb ram, 8 cores, nvme drive and 1080 for like $1600 that runs High Sierra. It's definitely more work to set up initially but pretty low hassle afterwards. No regrets.
- Intel 7700k
- Geforce 1080
- Asus ROG Strix Z270E
- Samsung Evo 960 NVMe
1) Make a standard install USB
2) Run Clover Configurator on the USB with standard settings + tweaks based on your GPU and motherboard (you can find suggestions on /r/hackintosh and the TonyMac forums)
3) Install and boot
4) Tweak the Clover configuration on the EFI partition to fix any random remaining issues you find like USB or audio.
This is still the case in 10, and one of the reasons I have been seriously looking at Bitwig (Linux support being the other).
It's very apparent that my 2-core 4-thread i5 5200U is a bit too weak; I'd love to be using a 16- or 32-core machine.
That’s all I’ve got.
16-thread Threadripper is mostly idle in my video-editing tests. I mean, its a great processor. But video editing isn't "heavy enough" for me to recommend a 16-core or bigger processor.
The best upgrade I’ve made for my video editing workstation has been going to 4x GPUs for DaVinci Resolve.
I don't know enough about video games, but I would naively think that the memory latency would be a big deal there as well.
At any rate, I learned a lot from this article. Anandtech's reviews always seem well-written and well-researched.
It really depends on what else you're doing. If you're just playing a game then Disk and GPU tend to be the biggest bottlenecks in video gaming. Even a reasonably fast modern CPU is sufficient for most games.
Atm I'm running about 15 VMs on about 8 cores in a dedicated box somewhere and it's definitely noticable. I would love to shove some core services at home and have 32 cores to play with to give some more headroom
There are also reasons for having some more isolation between guest OSes.
On my ESXi box at home I have:
* A VM that hosts my NAS shares. This does nothing other than host the NAS shares, as I want to be sure no silly experiment of mine interferes with that.
* A general-purpose VM, where I do run some containers out of (UniFi controller, Plex, etc)
* A VM running Windows Server for my Domain Controller
* A secondary vSwtich with isolated no uplink to the rest of the network. This is my mini malware testing lab.
* A VM running pfSense that I'll sometimes use to allow selective access out of the isolated vSwtich out to the internet, but not to the rest of the network.
Can't do all that with containers.
I'm using FreeBSD, but these apply just as well to Linux. I wanted to run ZoneMinder, which is not available for FreeBSD, so I simply spun up a CentOS VM and installed it.
On the flip side, I wanted to run Home Assistant, Node-RED, and some related utility programs. All of these are happy to run on FreeBSD, so they can live happily in a Jail (FreeBSD's equivalent to a container).
Some people virtualize their router by dedicating a NIC to the appropriate VM. I don't know if this would even be possible in a container.
I currently run 4 linux vms for my kubernetes cluster and a
4 core macOS vm with passthrough for my gtx 1080i. I have 64 gb of memory so the only thing stopping me from running my windows 10 and arch desktop vms at the same time is more cores.
Atleast 3 VMs need patched kernels or more recent kernels/regular kernel updates than the host provides.
Additionally VMs provide a bit more isolation than a simple container (atleast unless you do unpriv'd container).
I do have containers too, about 20 of them, half of them unpriv'd, all of them LXC. Docker is not suitable for my use case at all and frankly I don't think you should suggest someone should switch to Docker without knowing their use cases.
You can do a lot with one thing and administer that one thing without having lots of individual boxes doing stuff, and for me it'd be way faster and a single cost, so it'd work out as a big improvement.
In a 'money is no object have all the time in the world' it would probably be better to have something dedicated to do each task, but that's not that flexible on top of the other drawbacks (cost in money/time).
"Don't be a patsy who pay for heating your place, be paid for it. Order our heating device for just $2999!"
There are clearly workloads where the 32 core TR chip does not perform well (probably due to the memory configuration) but it seems pretty good at rendering.
This means AMD manages to execute more work per watt (more energy efficient), and each AMD core uses less power than Intel.
¹ Anantech wrongly lists the TDP as 140W. It's in fact 165W: https://ark.intel.com/products/126699/Intel-Core-i9-7980XE-E...
Its power consumption was 19% higher than the i9-7980XE.
Tom's Hardware saw a stock 2990WX at a lower power consumption than a stock i9-7980XE during a Prime95 "torture loop". Overclocked, the AMD part was higher than the Intel one, but only slightly.
Where have you seen that it has double the power consumption? Under what workloads?
Personally, I don't care about "weaker cores". If a system has 2048 cores clocked at 7 THz and it is 20% faster at my workload than a single-core CPU at 700 MHz, it is faster.
The fact that the "weaker cored" system is cheaper than the "burly muscly" single core system is a bonus.
Power consumption doesn't even matter that much either. It is the equivalent to a single 60W light bulb (or several of those new-fangled LED bulbs). Big whoop.
But more importantly, The Tech Report looked at task energy for the Threadripper in rendering tasks and found that it took less energy to finish a render than competitorys. It's power was higher but the time was shorter to an even greater extent.
So if you're so serious about rendering that you're willing to spend thousands on a good rig for it there really isn't any reason not to use this boy.
Intel on the other hand needs to manufacture a monolithic CPU that not only is fault free in enough cores, but performs well. That's harder and yields are way lower.
80% yield on a 4 core block is a 16.7% yield on a 32 core block - and that's before binning
AMD has only been doing this "infinity fabric" thing for a year. Intel was caught with their pants down. It seems like Intel is researching chiplet technology and trying to recreate AMD's success here.
It takes several years to create chips. So Intel realistically won't be able to copy the strategy until 2020 or later. But you better bet that Intel is going to be investing heavily into chiplet technology, now that AMD demonstrated how successful it can be.
AMD "upgraded" HyperTransport to Infinity Fabric. Which IIRC uses a bit less power (taking advantage of the shorter, more efficient die-to-die interposer).
Intel has UPI (upgrade over Intel QuickPath), but it hasn't been "shrunk" to chiplet level yet. Intel has EMIB as a physical technology to connect chiplets together... but Intel still needs to create dies and a lower-power protocol for interposer (or maybe EMIB-based) communications.
So Intel has a lot of the technology ready to create a chiplet (like AMD's Zeppelin dies). But Intel wasn't gunning for chiplets as hard as AMD was. Still, Intel demonstrated their chiplet prowess with the Xeon+FPGA over EMIB. So Intel definitely "can" do the chiplet thing, they just are a little bit behind AMD for now.
Also because they didn't have to innovate - no competition since early Opterons.
Any kind of process that's batchable too.
My ideal workstation likely actually uses ~128 cores but that isn't practical for home use yet. A board with 4 2990wx would be heaven.
Personally I'm looking forward to something based on 2200GE for home use.
Hey everyone, sorry for leaving a few pages blank right now. Jet lag hit me hard over the weekend from Flash Memory Summit. Will be filling in the blanks and the analysis throughout today.
I am disappointed in that, as I was looking forward to reading the test setup and power draw sections, as I have a 2990WX on order, and I'm dithering over what motherboard to get (I'd prefer an older one, which matches the features that I want better... eg, clear support for ECC and no bling) but there is some concern that older motherboards will be too close to the edge in terms of the power draw of the 2990WX.
For raw performance though, I would guess we will see some rather extreme cooling actually becoming more mainstream in future in the workstation space.
"After core counts, the next battle will be on the interconnect. Low power, scalable, and high performance: process node scaling will mean nothing if the interconnect becomes 90% of the total chip power."
Meaning "highly vectorizable number crunching that you can't practically run on a GPU for some reason" is probably fairly niche. In LINPACK the even the 1950x beat the 7980XE on phoronix:
and I'd have thought LINPACK would really reward AVX...
Maybe it isn't optimized to take advantage of it?
I love wasting time playing with number crunchinng microbenchmarks, and when you do that avx-512 seems so unassailably impressive, it's surprising to see this not bourne out in larger benchmarks. (Also worth pointing out you need to tell gcc -mprefer-vector-width=512 on top of march=skylake-avx512, otherwise it'll prefer 256 bit vectors).
Anyway, those benchmarks definitely support compiling and number crunching. Pretty ideal for stuff like monte carlo.