Hacker News new | comments | ask | show | jobs | submit login
Intel 9th Gen Review (anandtech.com)
221 points by endorphone 4 months ago | hide | past | web | favorite | 174 comments



140 Comments, and not s single word mentioned about TDP.

It seems people are OK with Intel stating a 95W TDP when its max could be 225W. We knew TDP wasn't a reliable figure for a long time and had sort of expect 20 - 30% variant, but now we are looking at ~2.4x difference....

After all the lies Intel have told, after all the disappointment and anger that Intel has been sitting there not innovating, cost cutting in the simplest thermal measures, their near irresponsible sales pitch to enterprise, the cries for AMD to bring competition.

And they still buy an Intel Chip.

The lesson here is brand building and stickies. I could only wish AMD best of luck. I am waiting for Zen 2.


Anandtech noticed error in power measurements and tested with another motherboard that provided better results https://twitter.com/IanCutress/status/1053427417052798977


The power measurements were correct. What was wrong is that the mainboard supplied a higher voltage to the CPU, something that you only do if your want to overclock and your system is not stable at the original voltage. So all the tests are somewhat suspect as we don't know what the CPU will actually do with the changed voltages and thermals. The CPU could clock higher due to the higher voltage or reduce the clock due to the increased temperatures. Nobody really knows unless you really rerun all tests.


166w when fully loaded on a 95 TDP processor, for me, that is the textbook definition of dishonesty.


It's pretty annoying... but in lot of Intel marketing and spec sheets you'll find small print language that TDP is measured at base clock speed. The base clock has fallen from the 6700k where it was 4.0Ghz.

In the end, it seams to me like the intent is to deceive.


I don't think so, Turbo Clocks are Turbo since they're meant for burst usage. Getting above TDP for a few moments is fine, however keeping those clocks are the problem.


Thanks, ~170W is better but still far off from 95W.


It should be tested with an appropriate cooler. One that does remove 95W in the long term when the case reaches the maximum design temperature as this is what TDP intuitively means.


I'm comparing the i7 here to the i7 in my current computer, built in April 2011(!)

https://ark.intel.com/compare/186604,52214

Kinda though I would see a lot bigger differences. There was a time in my life when I expected I would have to build a new computer every two years. Weird.


The i7-2600k doesn't even have 256-bit busses on the inside. The i7-2600k barely begun to implement AVX (and doesn't even implement AVX2).

The i9-9900k is way better. 3x 256-bit execution ports, 64-byte (512-bit) messages going around the Ringbus, DDR4 RAM (double memory bandwidth).

The raw clocks ignore the huge advancements in IPC that exist between Sandy Bridge and Coffee Lake. And even then, the 5GHz clock on the i9-9900k makes it completely incomparable to compare the ancient Sandy Bridge with the modern Coffee Lake.


>The i9-9900k is way better. 3x 256-bit execution ports, 64-byte (512-bit) messages going around the Ringbus, DDR4 RAM (double memory bandwidth).

All those don't matter if they hardly make a 10-15% difference in everyday use, and require specialized code to take advantage of.


Much of what I do is vectorized, so the more fma units and the side wider, the faster. So avx512 > avx2 with two 256 bit units > avx2 with two 128 bit units.

But my work (statistics) seems poorly represented in benchmarks and the community at large. Ryzen is in the later group, and gets rave reviews.

While most press I've seen talking about avx512 has been negative. Yet I get a lot more flops out of the 10-core i9 7900x than the similarly priced 16-core Threadripper 1950x in simulations.

A big part of that is probably doing most of my work in Julia, where because the code you run is compiled locally the first time a function/argument type combination is called (a) code will be generated for all your CPU's features, and (b) it is easy to encourage aggressive optimizations to vectorized code by the compiler, eg by parameterizing arrays by their bounds.

R benchmarks will not see benefits at all, as much of R's math libraries are terribly unoptimized -- the default BLAS doesn't even use any runtime dispatch! (Thankfully it's easy to change, but the benchmarks all reflect that reality -- and most people's personal experiences too).

Similar likely applies to many NumPy benchmarks, but depending on the distributor of those binaries, the situation likely isn't better. Either generic unoptimized binaries, or shipping with something like Intel MKL which, while the fastest linear algebra library for Intel processors, dispatches to slow code paths for AMD processors -- a very unfair choice for benchmarking. I'm less less familiar with NumPy though, so perhaps there's a booth fair and well optimized distribution.

I fear the future will move against my use cases.


If you want fat vector units you should be on a GPU.


Yes, in principle. A dedicated GPU will always be "fatter" in SIMD than a CPU.

But a CPU has latency benefits. Most noticeably, it takes only a few clock cycles for an Intel CPU to transfer data from its RAX register -> AVX vector registers, compute the solution, and transfer it back to the "scalar world".

This level of data-transfer is done within single-digit nanoseconds.

In contrast, any GPU memory transfer takes hundreds-of-nanoseconds to microseconds... an order of 100x to 10,000x slower than transferring between Scalar-code and AVX-code on a CPU.

----------

So GPUs are good for bulk compute (big matrix multiplies, Deep Learning, etc. etc.)... but I'm definitely interested in the low-latency uses of SIMD.

For example: Consider a SIMD-based Bloom Filters to augment a Hash Table (or any data-structure, really). Multi-hash systems like Cuckoo-hashes have also been successfully implemented with CPU-SIMD acceleration.

You can use CPU-accelerated SIMD to accelerate CPU-based algorithms. And CPUs certainly can benefit from fatter, and better designed, SIMD instruction sets.


> it takes only a few clock cycles for an Intel CPU to transfer data from its RAX register -> AVX vector registers, compute the solution, and transfer it back to the "scalar world".

unfortunately, it takes millions of cycles for the CPU to switch to the higher-power state necessary for AVX instructions, so your latency really isn't that much better.

the radfft guy did a good writeup https://gist.github.com/rygorous/32bc3ea8301dba09358fd2c64e0...


Yes. My code is mostly AVX, so regardless of down-clocking I see substantial performance boosts.

But I've found getting into GPU computing extremely difficult. Admittedly, part of that may be having bought an AMD Vega graphics card instead of NVidea (not a fan of vendor locking, and AMD is building an open source stack with ROCm/HIP).

My code does lots of different things, many small, but given a billion iterations, it can add up. If you're running a Monte Carlo simulation over millions of different data sets, fitting each iteration with Markov Chain Monte Carlo, with a model that has a few long for loops and requires (automatic) differentiation and some number of special functions... It's not hard to rack up slow execution times out of small pieces.

Part of the problem is I need to actually find a GPU project that's accessible for me, so I can gain some experience, and learn what's actually possible. How I could actually organize computation flow. It's easy with a CPU. Break up Monte Carlo simulations among processors. Depending on the model being fit, as well as the method, different computations can vary dramatically in width. But optimizing is as simple as breaking up all those computations into the appropriately sized AVX units (and libraries + compilers auto-vectorization will often handle most of that!), so wider units directly translate to faster performance.

Part of the problem is I don't really know how to think about a GPU. Can I think of the Vega GPU as a processor with 64 cores, a SIMD width of 64 (64 * 64 = advertised "4096), and more efficient gather loads/scatter store operations?

If there were a CPU like that, compiler + library calls and macros (including libraries you've written or wrapped yourself) go a long way so you can quickly write well optimized code.

I really need to dedicate the time to learn more. My 7900x processor is about 4x faster at gemm than my 1950x, but 10x slower than the Vega graphics card that cost less money. I see the potential.

To start, I need to find a GPU project that lets me really get experimenting and figure out how programs can even look.

Is there an Agner Fog of GPGPU?


If the MC simulations vary only in the values of parameters (as opposed to memory access or execution path through an if/else), then they are ideal for GPU. You write a kernel which looks like it’s doing one of those simulations, and call it on arrays on parameters, and you’re in business.

Concretely, I’ve used this for parameter sweeps of large systems of ODEs and SDEs.


Did you see https://news.ycombinator.com/item?id=18249651 about Gpu programming with Julia?

You still need some understanding of gpus, but at least you can write your code in Julia rather than CUDA or OpenCL.


As a longtime user of Julia I'm convinced that the julia approach to gpus is wrong -- for being insufficiently Julian. I would love for the gpus to be registered as independent GPU nodes accessible by the Distributed module, to which you dispatch async compute tasks.


There was an Intro to gpuprogramming in Julia post on hn one or two days ago


> Part of the problem is I don't really know how to think about a GPU. Can I think of the Vega GPU as a processor with 64 cores, a SIMD width of 64 (64 * 64 = advertised "4096), and more efficient gather loads/scatter store operations?

From my limited experience, that's the wrong way of looking at things.

SIMD is good at looking at "converged" instruction streams, and bad at "divergent" instruction streams.

"Converged" is when you've got your bitmask (in AMD / OpenCL: the scalar bitmask register) executing all 64-threads as much as possible. This means all 64-threads take "if" statements together, they loop together, etc. etc.

"Diverged" is when you have a tree of if-statements, and different threads take different paths. The SIMD processor has to execute both the "Then" AND the "Else" statements. And if you have a tree of "if / else" statements, the threads have less-and-less in common.

-------

That's it. You try to group code together such that as much code converges as possible, and diverges as little as possible. It helps to know that the work-group size on AMD can be anywhere from 64 to 256 (so keep your "thread groups" together as large as 256 at a time).

The OpenCL compiler will automatically translate most code into conditional moves and do its best to avoid "real if/else branches". But as the programmer, you just have to realize that nested if-statements and nested-loops can cause thread-divergence.

---------

Matrix maths are the easiest for SIMD, because of how regular their structure is. When you get to "real" cases like... well...

> My code does lots of different things, many small, but given a billion iterations, it can add up. If you're running a Monte Carlo simulation over millions of different data sets, fitting each iteration with Markov Chain Monte Carlo, with a model that has a few long for loops and requires (automatic) differentiation and some number of special functions... It's not hard to rack up slow execution times out of small pieces.

Okay, well that's why its hard to program properly on a GPU. Because they're all diverging. Unless you figure out a way to "converge" these if statements, it won't run on a GPU correctly.

Chances are, if you have a million-wide Monte-carlo, a lot of those threads are going to be converging. Can you split up the steps and "regroup" tasks as appropriate?

Lets say your million threads run into a switch statement:

* 106,038 threads will take path A.

* 348,121 threads take path B.

* 764 threads take path C.

Etc. etc.

Can you "regroup" so that all your threads in pathA start to execute together again? To best take advantage of the SIMD-architecture?

Think about it: a gang of 64 may have thread #0 take PathA, thread#1 take PathB, thread #2 take PathA, etc. etc. So it all diverges and you lose all your SIMD.

But if you "rebatch" everything together... eventually you'll get hundreds of threads ready for PathA. At which point, you gang up your group of 64 threads and execute PathA all together again, taking full advantage of SIMD.

SIMD is an architecture that allows "similar" threads to run at the same time, in groups of 64 or more (up to 256 threads at once). And they truly execute simultaneously as long as they are all taking the same if/else branches and loops. The real hard part is designing your data-structures to maximize this "convergence" of threads.

Something like Chess would practically be impossible to SIMD on a GPU. Its just impossible to handle how divergent the typical chess analysis engine gets due to the precision of chess board positions. But something like Monte-Carlo surely will have hundreds, or thousands, of threads taking any "particular execution path". So I'd have hope for a SIMD-based execution on Monte-Carlo simulations.


Using AVX instructions clocks down your CPU by a lot, though. It isn't usually worth it unless you're running mostly AVX.


> So GPUs are good for bulk compute (big matrix multiplies, Deep Learning, etc. etc.)

Maybe. Sparsity of the matrix in question matters a lot. Matrixes with very little sparsity, like say an image, do well. Others may not. It’s more like GPUs do well on algorithms with predictable branching.


People aren't benchmarking avx512 because very little is coded to use it yet. You are a small niche.

The future will probably include more avx512, but not yet.


If you're running anything CPU-intensive these days, you are definitely going to use those AVX cores.

Video Editing, Video Games, Graphics, 3d Modeling, Photoshop. Even Stockfish Chess uses new instructions (not SIMD: but the Bit-board popcnt and pext / pdep instructions) to accelerate chess computations.

At a bare minimum, AVX grossly accelerates memcpy and memset operations. (Setting 256-bits per assembly instruction instead of 64-bits per operation is a big improvement). And virtually every program can benefit from faster memcpy and faster memsets.

"Standard Software" isn't written to be very fast. But anything that's even close to CPU-bound is being upgraded to use more and more SIMD instructions.


In theory, yes. I was looking to build a new rig, I do a lot of video. I'm running a 2697v2 and 3930k in another (both built around the same time, around ~5-6 years ago). To be honest, I can't really tell much of a difference in using those for those heavy video editing and fx work vs threadripper and newer Xeons I'm using at other places. I'm not saying they're not faster, but there's not much of a noticable, if at all, difference in using those machines.


> At a bare minimum, AVX grossly accelerates memcpy and memset operations.

Not necessarily, for large operations and depending on processor generation a simple rep stos/movsb will be simpler (no alignment requirements) and saturate your memory bandwidth just as well as any AVX sequence will with less icache pressure.


I wasn't aware of ERMSB "Enhanced Rep MOVSB". Thanks for the tip.

Seems to be a feature in Ivy Bridge and later, which happens to be around the time AVX2 started.


To be fair, it's issuing a long sequence of microcode ops under the hood using 256bit ops.


At a bare minimum, AVX grossly accelerates memcpy and memset operations. (Setting 256-bits per assembly instruction instead of 64-bits per operation is a big improvement). And virtually every program can benefit from faster memcpy and faster memsets.

How often are memcpy and memset CPU-bound, though?


Whenever its in L1 or L2 caches, which on Intel, have 64-byte-per-clock bandwidth. (96-byte-per-clock bandwidth to L1 cache, 64-byte-per-clock theoretical L2 bandwidth... dropping to ~29 sustained)

L3-cache on Skylake architectures has 18-bytes-per-clock sustained, which is still greater than 128-bits per clock bandwidth to your L3 cache.

Soooo... I'd say roughly for any memset or memcpy smaller than 256kB or so... you're going to benefit to an AVX-based memset or memcpy.

At ~8MB or so, where you're hitting L3 cache, it probably is a benefit but not much of one. The important thing is Skylake only has one store unit, so you can only write once per clock cycle.

So... do you write one 64-bit value, or do you write one 256-bit (32-byte) value?

----------

To be fair: I'm pretty sure every compiler's default settings today outputs SSE-based memcpy and memsets (128-bit). So AVX doubles that to 256-bit.


The issue - and one that a compiler often can't really reason about very well - is that in using AVX (and especially AVX-512) you incur other performance costs due to throttling the CPU down to lower frequencies and having additional latency issues. Deciding when it's actually worth doing so for a memcpy is, in general, a very hard problem.


This is the problem when I have a Threadripper can't can't see these issues, lol. One day I'll get a proper AVX512 machine and try this stuff for myself.

Still, it seems like the documentation elsewhere says that AVX 128-bit doesn't cause any clocking issues. So AVX512 (applied to 128-bit) should still lead to good speeds without any clocking problems.


Right, if you stick to 128 bit registers, you avoid most of the issues. Does that get you much benefit for a memcpy over an SSE implementation on a SKX CPU? I don't think so, but I haven't looked into that particular problem, so maybe I'm missing something obvious here.


Simple test, `memcopy.c`:

  void memcopy(double* restrict a, double* restrict b, long N){
    for (long n; n < N; n++){
      a[n] = b[n];
    }
  }
You can check assembly with:

  gcc -march=skylake-avx512 -O2 -ftree-vectorize -mprefer-vector-width=128
   -shared -fPIC -S memcopy.c -o memcopy128.s

  gcc -march=skylake-avx512 -O2 -ftree-vectorize -mprefer-vector-width=512 
  -shared -fPIC -S memcopy.c -o memcopy512.s
to confirm they use `xmm` and `zmm` registers, respectively. (The 512 version also uses xmm to finish off a remainder. Ideally, it would use masked load/stores, but I haven't seen any auto-vectorizers actually do that.

I'm using gcc 8.2.1. I believe the option `-mprefer-vector-width` was added recently.

Now I compiled both into shared libraries (drop the `-S` and choose an appropriate file name), and benchmarked from within Julia.

  julia> using BenchmarkTools, Random

  julia> memcopy128!(a, b) = ccall((:memcopy, 
  "/home/chriselrod/Documents/progwork/C/libmemcopy128.so"),
  Cvoid, (Ptr{Cdouble},Ptr{Cdouble},Clong), pointer(a), pointer(b), length(a));

  julia> memcopy512!(a, b) = ccall((:memcopy, 
  "/home/chriselrod/Documents/progwork/C/libmemcopy512.so"), 
 Cvoid, (Ptr{Cdouble},Ptr{Cdouble},Clong), pointer(a), pointer(b), length(a));

  julia> b = randn(32); a = similar(b);

  julia> b = randn(32); a = similar(b);

  julia> all(a == b)
  false

  julia> memcopy128!(a, b); all(a == b)
  true

  julia> randn!(a); all(a == b)
  false

  julia> memcopy512!(a, b); all(a == b)
  true

  julia> @btime memcopy128!($a, $b)
    7.517 ns (0 allocations: 0 bytes)

  julia> @btime memcopy512!($a, $b)
    5.277 ns (0 allocations: 0 bytes)

  julia> b = randn(64); a = similar(b);

  julia> @btime memcopy128!($a, $b)
    10.980 ns (0 allocations: 0 bytes)

  julia> @btime memcopy512!($a, $b)
    7.300 ns (0 allocations: 0 bytes)

  julia> b = randn(100); a = similar(b);

  julia> @btime memcopy128!($a, $b)
    15.060 ns (0 allocations: 0 bytes)

  julia> @btime memcopy512!($a, $b)
    14.031 ns (0 allocations: 0 bytes)

  julia> b = randn(200); a = similar(b);

  julia> @btime memcopy128!($a, $b)
    33.867 ns (0 allocations: 0 bytes)

  julia> @btime memcopy512!($a, $b)
    19.546 ns (0 allocations: 0 bytes)

  julia> b = randn(400); a = similar(b);

  julia> @btime memcopy128!($a, $b)
    53.923 ns (0 allocations: 0 bytes)

  julia> @btime memcopy512!($a, $b)
    28.376 ns (0 allocations: 0 bytes)

  julia> b = randn(800); a = similar(b);

  julia> @btime memcopy128!($a, $b)
    98.277 ns (0 allocations: 0 bytes)

  julia> @btime memcopy512!($a, $b)
    57.960 ns (0 allocations: 0 bytes)

  julia> b = randn(2_000); a = similar(b);

  julia> @btime memcopy128!($a, $b)
    232.138 ns (0 allocations: 0 bytes)

  julia> @btime memcopy512!($a, $b)
    133.229 ns (0 allocations: 0 bytes)

  julia> b = randn(20_000); a = similar(b);

  julia> @btime memcopy128!($a, $b)
    3.688 μs (0 allocations: 0 bytes)

  julia> @btime memcopy512!($a, $b)
    2.759 μs (0 allocations: 0 bytes)

  julia> b = randn(200_000); a = similar(b);

  julia> @btime memcopy128!($a, $b)
    112.598 μs (0 allocations: 0 bytes)

  julia> @btime memcopy512!($a, $b)
    117.437 μs (0 allocations: 0 bytes)
The advantage wasn't very impressive here, but it persisted for vectors of length 20,000. At 200,000 we saw the memory bottleneck hit in force, when it busts the per core L2 cache. The 7900x's per core L2 cache is 1048576, which translates to 131,072 doubles.


This is a nice benchmark, thanks for doing it, but the point I was making is somewhat orthogonal to this.

Of course there's a performance advantage from using 512 bit registers for a memcpy - but a memcpy is rarely a major performance bottleneck by itself and is usually surrounded by other code. Unless that code is also AVX-512, you've just made it slower by optimizing the memcpy. My point was that a compiler can't usually decide whether it's worth making the optimization in light of the broader context.

The other point was whether using AVX-512 while sticking to xmm registers is faster than just using xmm SSE/AVX code. I don't have an AVX-512 capable machine at the moment, perhaps you'd like to check if your 128 bit version is any faster than just doing "gcc -march=skylake -O2 -mprefer-vector-width=128" (thereby retaining the microarchitecture optimizations, but sticking to AVX2)?


Here are your C examples for easy viewing: https://godbolt.org/z/5j-QUj


I don’t know about the other software types, but games are not typical users of AVX (at client runtime anyway, the tooling used to make the games is different).


If you roll up all of the 'huge advancements' (IPC, clock speed, all of it) over the course of 7 years, that's probably in the neighborhood of what we used to call the 'annual upgrade.'

Ancient? Yes. Incomparable? No.


The Pentium 4 Netburst architecture was relatively stagnant between the years 2000 and 2004. "Prescott" (2004) was still "Netburst" but Prescott includes AMD64 so I'd say that's a major change.

Still, its reflective in the naming. Netburst, Willamette, Northwood, Prescott, and Tejas were all called "Pentium 4", and lasted from a time period from year 2000 to 2006.

Sure, there were a few additions: Hyperthreading I guess. But otherwise, the main difference in this timeframe were the 250nm to 90nm scaling (which got "free" GHz upgrades back then without any effort)

In short:

> what we used to call the 'annual upgrade.'

You're exaggerating. Progress has slowed but Willamette (year 2000 1.5GHz) vs Northwood (year 2002 2GHz) Pentium 4s weren't that dissimilar.

Prescott (2004) is even more amusing, as it hit 3.8GHz (Intel Pentium 4 570J). But boy, oh boy... you really can't compare Prescott 3.8GHz against Sandy Bridge (2011, 3.6GHz) or Coffee Lake (2018, 3.6GHz ), even if the base clocks are all close.


And luckily we had AMD then, as we do now (only recently), to pick up the slack. Sure I'm exaggerating a bit, but no more than you are to say 'incomparable'. Intel upgrades for the last 7 years have been modest at best. If you need a new machine, sure they'll make a fine replacement. But hardly a reason to upgrade year over year (or even every 2-3 years) for performance and/or functionality alone.

I've been sitting on an old i7 Sandy Bridge precisely because Intel hasn't been offering me a compelling reason to upgrade. I'm probably going to finally do it next year... but it won't be to another Intel system.


My 5yo i7-4790K is to this day very fast (gtx 1080, and nvme 2 years ago). May upgrade next year for more threads/vms though.


1. Sandy Bridge to Skylake combined makes barely a double digit gain in IPC, ~roughly 14%. I hardly called that a "huge advancements" in IPC over 7 years time.

2. We have already proved should the SandyBridge Overclocked to 4GHz+ it would have perform faster in some cases than Skylake Uarch with same Clock Speed. Remember SandyBridge was 32nm.

3. "Netburst, Willamette, Northwood, Prescott, and Tejas " You made it sounds as if there were many generation. All of them are part of Netburst uArch and Willamette was the first Gen, Tejas was cancelled. So there is only "Willamette, Northwood, Prescott". Each with much improved Clock speed, better implementation of HT, better memory controller and cache read improvement, and finally AMD64, all translate to substantial performance improvement, regardless of its power usage.

Netburst may not be a good uArch, but it definitely brings many improvement, the writing was on the wall in 2004 and we got Core 2 in 2006. And we have been stuck with Quad Core Desktop CPU despite we have had over 4x transistor density improvement.


«The i9-9900k is way better...»

You forgot the most important: i9-9900k has TWICE the number of cores as i7-2600k.


CPU speed improvements have slowed down compared to the early 00s, but there's still a big difference in performance between 2011 and today.

On CPU Benchmark the i7-2600K scores an 8449 [1], while the i7-8700K scores a 15970 [2]. The i7-8700K is the predecessor to the i7-9700K reviewed in the article.

[1] https://www.cpubenchmark.net/cpu.php?cpu=Intel+Core+i7-2600K...

[2] https://www.cpubenchmark.net/cpu.php?cpu=Intel+Core+i7-8700K...


The real score is 1942 against 2704. Not even a 50% progress in almost a decade. It's disappointing.

The score you gave is the multi threaded benchmark that is just higher for having 6 cores rather than 4, that regular applications don't use.


Another point is that 2600k was good for OC and 4.8GHZ OC were pretty common(not sure about 5Ghz).

Newer intel CPUs turbo higher at stock (you could say that OC is sort of built in) so single threaded performance looks better than it really is.

Now 32nm at 4.8Ghz would be quite power hungry but that is not the issue here.

Pretty much all the progress Intel has achieved in the last 7 years since Sandy Bridge has been in achieving the same performance for less wattage focusing on mobile use cases.

(on server side Xeons have increased the core count substantially but again single threaded performance IPC has stagnated)

Thus mobile i5-8250U gets to claim that it does about the same in 15W that 2600k did at 95W.

Remember we are still on the same Skylake Arch from 3-4 years ago, just a few minor process tweaks on 14++++nm plus obviously 2x the number of cores.


Some do. But in those cases, you might be better served with Ryzen.

If you're playing games, you'll get better performance with this i9.

But if you're playing games, you're probably not CPU-bound anyway.

I'd be interested in what niches people here work in and what CPUs they look for because of their unique needs.


In all the truly multithreaded benchmarks the Ryzen and ThreadRipper chips really punch above their weight (price).


the people who care the most about frames per second are playing esports titles that don't demand much from the GPU. counterstrike is still CPU bound in most cases.


Esport is not about graphics quality at all. Try Assassins Creed - it will eat all of your CPU cores.


>But if you're playing games, you're probably not CPU-bound anyway.

We used to say this, but I don't believe it anymore. A lot of games these days make heavy use of simulation and physics engines. Games like KSP that perform continuous physics calculations are absolutely CPU-bound.


You will also be cpu bound often when going for extremely high refresh rate 1080p. E.g. 240hz.


WoW is ST CPU-bound :( until next patch!


Feature differences and bandwidth can still be a significant issue though, and could become more so as raw single thread general CPU performance increases slow down while the surrounding ecosystem continues to improve. More PCIe lanes, more special purpose instructions, more features like ECC and IOMMU support becoming universal standards would all be nice outcomes to see.

This could represent a real additional long term opportunity to AMD as well, because historically Intel has been pretty fond of creating profitable product line segmentation via only allowing certain features on certain chips. And they've been mediocre about boosting interconnect capacity as well. Perhaps AMD doesn't necessarily need to match them as well in single thread performance if they can get the absolute difference down to a few percent, but be much friendlier then Intel in terms of not artificially restricting useful extra CPU features? I wonder if that becoming an area of competition again could bring some nice energy back to the market?

It's been something that's personally making me reconsider at least. I used to stick to Intel more despite price differences because of their performance advantages, but now I look at the rest of the ecosystem and start to calculate adding a few GPUs, NVMe storage, 10G networking, possibly other accelerator/utility cards and future interconnects, and the PCIe lanes seem to get sucked up a lot faster then they used to. With Ryzen/Epyc AMD is offering a full 64 lanes it looks like vs Intel's 24 here, and even with the latest Xeon I think it's now up to 48 lanes from 32? The difference there isn't nothing if overall system performance starts to depend less on just the CPU then on the overall ability to move data around and keep various more special purpose accelerators fed.


My Overlocked 6600k vs Ryzen/1800x at stock speeds is night and day difference in "real world" performance. Even though general clock speeds are not that much different - the Ryzen machine is the one I want to use.

Desktop productivity improves dramatically with a bunch more cores. Multiple browser tabs, background services, code editors open, VM's open... you can throw a lot of things at the Ryzen machine before you start noticing it starting to slow down.

On the 6600 machine it doesn't take much before you notice it chugging.


I'm curious about the amount of memory as well as the drive performance... if you went from 8GB on an HDD to 16+ with an SSD or NVME, huge difference.


Is the amount of RAM the same?


Not the OP but I upgraded from 6600K -> 1800X and honestly they're not even in the same ballpark performance-wise. When you come across software that can take advantage of multiple threads it's not even a contest, the 1800X just walks away laughing.

One example I can think of off the top of my head is gaming and transcoding (Plex server) at the same time. Even with QuickSync enabled and running, transcoding HEVC->x264 would regularly stutter and drop frames while a game is running, while also affecting the games framerate. The 1800X (without hardware accelerated video transcoding) just plows through it like it's nothing. I've even had it transcode 10-bit HDR video while playing Forza Horizon 4, not a single loss of framerate in either the game or the video.

The extra cores/threads are nothing to scoff at when software can take advantage of them.

My upgrade consisted of only the board and CPU; same RAM, GPU, PSU, SSD, etc.


Same amount of ram, different speeds (2133 vs 3000). I attribute the performance largely to the additional cores... the machine has that processor power sitting ready and waiting.


Check out this benchmark comparison: https://cpu.userbenchmark.com/Compare/Intel-Core-i9-9900K-vs...

It tells you more than the spec sheet comparison.


It's better than indicated in GP, but as an indicator of how good modern CPUs are in comparison to older ones, this page page overstates the case. The older CPU is 4-core but 8-thread (hyperthreads), the newer one is 8-core and 8-thread (no hyperthreads). The multi-core benchmarks are much higher for the newer CPU because it has twice as many cores, so that is inflating the overall score quite a bit. That said, it's not as simple as halving that score, as hyperthreads don't scale quite the same way, so it depends on how the multi-core test is formulated.

Edit: Oh, wow, as taspeotis points out it's not even the same two CPUs in the original Intel Ark comparison (which I was working off of from memory when noting cores and threads). The i9-9700k is actually an 8-core and 16-thread (hyperthreads) CPU, so it is actually literally twice as many cores and threads. It's still better performance, but not nearly as lopsided as it even seemed before, given those resources.


GP said “I'm comparing the i7 here to the i7 in my current computer” and you linked to an i9 vs i7 comparison...


53% single core clock improvement over almost 8 years.

Transistor count has increased from 1B to about 7B, which is roughly in line with Moore's law though...


Still rocking my LGA1366 Motherboard from 2008 with a Xeon X5675 @ 4.0ghz that I "upgraded" from an I7920 for $90 on ebay , 16GB of RAM, SSD, and an nvidia 1070 as the gaming PC. Handles everything at 2560x1600 60fps+ ultra quality(AA generally not necessary). Really don't quite understand the need to go crazy on computer hardware these days. Upgrading your GPU every 2-3 years seems plenty.


There is no way sort of some magical overclocking that a 1070 runs everything at 60 fps+ at that res. It has trouble at 1440p with many games. Note the minimum fps (which is the key factor):

https://www.tomshardware.com/reviews/nvidia-geforce-gtx-1070...


I can provide ALL the proof you need on my Asus STRIX 1070. I run everything at 2560x1600 synced to 60fps on my 30" monitor with Ultra settings. Those benchmarks are all enabling AA, which is just wasteful and not required at such resolutions. So yes, with AA, below the 60fps target. But right now my library of 418 Steam games, all run rock solid at those framerates. Metro, Batman, BF1/4/5, Witcher, Alien Isolation, Fallout, GTA, Bioshocks, whatever I throw at it.

Though the point of this discussion was the CPU not being critical anymore. If you need more FPS, buy a more expensive video card, not a CPU.


I'm rocking the 920 with 6gb still. I have been hesitating to upgrade to the Xeon. I have always been so salty that I jumped in just before the EFI boat, I feel my motherboard is so archaic. Is the performance jump to the Xeon very noticeable?


I think the 920 was fine, but the Xeon granted two more cores, and higher overclocking frequency for me, plus it was only $90. So more future proofing, as games take more advantage of multithreading. Increasing your RAM definitely should be on the list though.


Well those stats don't really highlight most real-world performance differences. If your expectation is that performance is tied to clock speed and you wanted to see 8 Ghz by now, then sure that's a disappointment, that's not reality though.

And besides what we're seeing now and why CPU's are viable for much longer is they got extremely powerful and user demands for more power win general computing kind of tapered off. I mean nowadays you can run multiple programs, a browser with dozens of tabs, multiple other utilities and background task pretty easily. You can run way more stuff then a person can reasonably multi-task. You have to get into some pretty specific workloads to really utilize all the power a modern CPU has and be wanting more.

And yeah the performance difference between generations has been pretty meager, but over 5, 6, 7 generations those modest improvements do add up. Going form a 2600k to a 8700k or 9700k is a big leap. But if you're not desperate for a lot more power you just might not care about 50% more performance.


I have a Kaby Lake i7 and my last build was a i7-920 (Bloomfield). The old chip is a beast but the power consumption is a lot more. One thing I use is Intel QSV and the intel graphics chip. I don’t need a GPU at this time and it’s nice to have the option.


I am currently using an early 2008 MacPro and I too had a time in my life when I expected I would have to build a new computer every two years.


What are you comparing, because 50% more cache and double the memory and PCIe bandwidth is actually a really big deal?


It's worth noting this is largely a paper launch thus far:

https://www.reddit.com/r/intel/comments/9pidjh/any_9900k_ama...

The October 19th date that everyone was given thus far seems to be a fantasy. The new processors aren't in stock anywhere, and as of yesterday every single retailer said they were shipping today. After waiting weeks for the 19th to arrive, it's rather infuriating.


I saw some 9700k in stock at Microcenter in Brooklyn this morning


9700k is in stock everywhere. 9900k (the i9 with hyperthreading) is still nowhere to be found.


Amazon doesn't even have the 9700k right now. Newegg does, however if memory serves they didn't earlier today.


> When Intel’s own i5-9600K is under half the cost with only two fewer cores,

> or AMD’s R7 2700X is very competitive in almost every test, while they might not be the best, they’re more cost-effective.


It all depends on what you're doing with it. For gaming alone, current games are barely using 6 threads, so a 2700X is complete overkill with 16 threads. It's always a bit revealing that people leap to recommending AMD's flagship processor as some sort of a value king for gaming. You really should save $100 and go with the 2600X instead, there's virtually no difference in gaming benchmarks.

The 9600K, 9700K or 2600X are really what gamers should be looking at. The only thing that's going to make use of a 2700X or a 9900K right now is productivity tasks (video encoding, 3D modelling, etc).

Intel still has a lead in high-refresh gaming (which is a coded way of saying they have better gaming performance that is currently bottlenecked by GPU performance), so if you're primarily a gamer it's still worth leaning to the 9600K/9700K. And the 9900K, while expensive, combines that gaming performance with a ~20-25% lead in productivity tasks. Expensive, but hands-down the fastest thing on the market for at least a year.

Threadripper is great for productivity stuff, but since it's NUMA it really doesn't perform any different than a 2700X for gaming. Also, the 2990WX has an even weirder configuration where half the dies don't have access to RAM, so it's hard to recommend except as a specialty product. 2950X is the generalist recommendation there.

Next year the Zen2 upgrade will hopefully bring AMD up to parity with Intel in gaming, and probably beat them in productivity. But it's hard to recommend planning on buying two different processors to "save money", so if that's your cup of tea, just wait until next year.

So, the recommendations look like:

60 Hz gaming => buy a 2600X

high-refresh gaming => buy a 9700K

60 Hz gaming+productivity or productivity only => buy a 2700X

high refresh gaming and productivity => 9900K

really heavy productivity stuff => 1950X/2950X/wait for 3800X/3950X next year


There are some games, like the new Assassin's Creed, that benefit from having 8 physical cores. The 2700 is probably not the best value proposition to play games that are out right now, but it probably is a good value if you want a chip that will handle new games for the next 4 years without needing an upgrade. Games today may be using 4 or 6 threads but it wasn't that long ago when games ran on only 1 or 2.


Are you saying that when you start a game, you are closing all the programs you have opened and disable all the services/background processes? If a game uses 6 cores, you can have 2 more for your office, browser and god knows what else running without slowing you down.


Just because the program is running doesn't mean it's using any appreciable amount of CPU. Most applications spend their time waiting for IO, and when you're not interacting with them, there isn't any.


Sure, but are you checking that every single application you have opened does "nothing" before you start a game? The point is to not being worried about such things.


Background services (Discord, etc) use a negligible amount of processor power. It should be like sub-5% on a modern processor, with the processor clocked down to its lowest frequency. With the processor at turbo, you should be pulling 1-3% easily, even actively using multiple tasks.

These days even Chrome is very aggressive about throttling background tabs, much to the consternation of some webdevs here.

https://arstechnica.com/information-technology/2017/03/chrom...


So much this, to say "you only need x cores because the software can only use x cores" is a little shortsighted, because there's always stuff happening in the background.


> For gaming alone, current games are barely using 6 threads, so a 2700X is complete overkill with 16 threads.

But then so is the 9900K, or even the 9700K. If you legitimately don't have anything that can use more than six threads, why are you paying for eight cores? Both vendors offer processors with fewer cores for less money.


In the UK the 2700X is half the price of the i9-9900K and trades blows with it on the stuff I mostly care about.

I'm truly loving the 2700X I built last week, it's a nice bump over the 1700 at work.


Is the 2700X that noticeable of a bump over the 1700? On paper it doesn't look _that_ much better.


The L2 cache dropped from 17-clocks to 12-clocks of latency. Its not a big improvement, but there's just better power-management, better caches, better memory compatibility (!!!), etc. etc. But enough that you get a real boost in practical benchmarks (FPS in games or Handbrake rendering).

The compatibility problem was probably most annoying. The 1700 didn't work with a lot of RAM when it launched, requiring BIOS settings to change. You'll still see AMD builders praising the "Samsung B-Die" (the most compatible RAM with AMD Ryzen).

The 2700x just works overall better. I'd personally consider it due to all of the little changes and little improvements.


You're probably better off waiting for 3rd gen Ryzen which sounds like it will be a nice all-around performance bump. Should be arriving in ~6 mo.


That's what I'm waiting for, around that time my 4790k will be 6yo, and the bump should be worth it... more ram and threads will be nice. May wait for the TR refresh as a consideration too... longest I've ever held on to a desktop PC.


I'm in the same boat. Been running my SB for ~2x longer than I've ever stuck with a primary system before. No complaints, it's been a great system. But I'm really looking forward to getting more threads. Probably won't go with TR though, I think the mainstream Ryzen will do what I need and won't have to deal with the latency idiosyncrasies.


Yep, I'm coming from a 4690 and my plan is similar. Hopefully Zen 2 provides a substantive enough improvement but even if not I would still be quite happy.


Are there any games nowadays that are CPU bound to the extent that you'd see a big performance improvement by using a high-end processor?


CPU performance also tends to matter for competitive gaming where you will normally turn down graphics in order to match your monitor's frame rate (144 or 240hz).

Overwatch is one game where in order to get 240hz, your RAM speed and timing can be a bigger bottleneck than your graphics card.


Anyone that thinks 240hz is useful is a nutcase. Show me a double-blind test that proves it. People will go far defending their outrageous purchases however.


On CRT monitors the difference between 120 and 200Hz was very noticeable in fast arena shooters. On LCDs with their inherent motion blur there's probably less of a difference between 144 and 240Hz, but I'm not going to call it placebo due to my previous experience with high frequency CRTs.


May I ask how old you are?


Games with many different objects interacting with each other tend to stress the CPU. I know Factorio likes having very fast memory and CPU clock speed despite not being too graphically intensive.


Also particularly the fact that Factorio runs a complex sim in "real-time" which means it has something like 17 or 33 ms to calculate every tick.

Games like X-COM have the luxury of only having to advance the simulation every so often. The rest is just rendering the game's view.


Same for rimworld, though I'm not quite sure as to what extent.


Not many, but a few. Civilization still thinks quite hard in the late game.


i7-9700K offers about 10% better performance than the 2700X for this[0][1] for about the same percentage increase in price. Pretty decent. Worth noting that the i9-9900K and i7-9700K are about the same here.

0: https://www.pcgamesn.com/intel-i9-9900k-review-benchmarks#nn...

1: https://www.tomshardware.com/reviews/intel-core-i9-9900k-9th...


I dunno where you are getting your numbers from but currently 9700k is approximately 40% costlier than 2700x. $420 vs $280. Heck - I can walk into nearby microcenter can get both 2700x CPU + motherboard at $350.


10% doesn't matter if it exceeds what your monitor can handle. For example, in Tech Report's benchmarking, even a i5-8400 is well over 60Hz in many current games at 1080p on maximum settings. [0]

If you want to run the latest games on max settings at 4K/120hz, it's a different story. Personally? I remember when reviews gushed over 30fps at 800x600.

[0] https://techreport.com/review/34192/intel-core-i9-9900k-cpu-...


Did you respond to the wrong comment?

No one much cares about the framerate in civ6. It's about AI turn calculations in the end of the game, which is what the benchmarks people were linking are.


But it's turn based— most of the time that you're playing, nothing is happening! Surely it could be running its simulations and AI speculatively and then just tossing out the bits which aren't relevant any more as the player makes changes to the world state.


The number of possible things you can do is a Large Number, so I doubt how possible this would be. Even if it were - it would absolutely murder battery life. Civilization runs on laptops, and now iOS too.


that would be extremely complex to be able to toss out the bits which are no longer relevant. everything might be no longer relevant.


Most turn-based bots are in some way derived from minimax— you peer down the road looking at the space of possible moves and evaluate them based on an assumption that your opponent will respond with the strongest possible move available to them, so each move you can make now is only as good as the game state after the opponent's response, recursively.

I think basically as you're building up this tree you'd add annotations at each branch/ply which call out things the human user (opponent) could do that strengthen or weaken the position. Then as the human performs their actions, the tree is pruned/rebuilt accordingly.


Yes with Chess you can do exactly that no problem, but I bet Civ has a thousand heuristics woven through that which would make it horribly difficult. Maybe not impossible though!

The other complication is it isn't just 1 on 1, its 1 player vs ~10 players. So your move changes player 2's move changes player 3's move and so on.


Source based games like Counter-strike Global Offensive, mainly due to the engine being older. I actually built a mid-range gaming/VR setup with AMD processor years ago with the thought that cpu didnt matter much but later got into competitive CSGO. Quickly realized I made a mistake as the Intel CPUs of comparable price were much better for that game. Ryzen changed that a bit though I don't have a build with one.


just getting the data to the gpu in a game can use quite a bit of cpu power.

as well competitive gamers want ~140+ FPS so even when gameplay code is not super complex per frame, it is a lot of frames.


Dwarf Fortress! But it's an outlier...


Depends on how many cats you have wandering around


Kerbal Space Program will definitely benefit, especially if you have big part count stations and ships.


Simulators are still very highly single-threaded performance-bound. Particularly flight simulators.


That's surprising to me; wouldn't they be able to represent huge chunks of world state as giant matrices, upon which operations are trivial to parallelize?


Yes, but most flight sims are based on very old codebases. X-Plane has started to change, and Prepar3d too, but I imagine they're so complicated it's difficult to just do without breaking a lot of things.

For the time being, even with latest versions, single-threaded cpu performance is the largest concern in these sims.


I think minecraft with a lot of entities can be pretty CPU bound.


StarCraft 2


World of Warcraft


Only for games that require lots of individual object/ai management, like Ashes of the Singularity or maybe civ games


Cities Skylines crawls to a halt in big cities.


My favorite game, but I hate needing to start over so often because of performance issues. My 6700K just throws the towel when the city gets big.


because the simulation is singlethreaded.


Print link loads about 700 graphs and images and lacks internal navigation...

Alt: https://www.anandtech.com/show/13400/intel-9th-gen-core-i9-9...


Why aren't they benchmarking it against the latest Threadrippers? They've got a i9-7900X up there which costs the same as the 2950x.


I think to give you perspective against previous HEDT Intel chips that had more than 4 cores.


> The other angle is one of the recently discovered side-channel attacks that can occur when hyper-threading is in action. By disabling hyper-threading on the volume production chips, this security issue is no longer present.

This line of reasoning feels unconvincing. If there were really worried about this they should disable it everywhere.


There's no conceptual vulnerability if the system the chip is being integrated into isn't multi-tenant (where a "tenant" can be a literal time-share user, or something like a website's Javascript code executing in a web browser.)

So: cloud computing, and consumers that might be running dumb code from the internet (= i{3,5,7} and Xeon = volume production) should get chips with no hyperthreading; while HPC, locked-down workstations, and gaming PCs that probably don't have any important data on them (= i9 = low-batch-size production) should get chips with hyperthreading.


Unfortunately, it's the multi-tenant applications that tend to gain the most from hyperthreading.

For HPC it's usually turned off, and I think for gaming and workstations it also has little if any use.


I think I've seen either Intel or AMD explain hyperthreads as being useful in the case of a gamer who is also using a software encoder to stream their gameplay.

Also, HPC doesn't necessarily imply a uniform workload; it just implies that all workloads are trusted. Really, modern HPC clusters are still "multi-tenant"; they're just tenants who are all e.g. employees of the same company, or researchers at the same university. So you can describe them, from a security perspective, as one "tenant." None of them would have anything to gain by doing a side-channel attack against one-another.


> Intel is also forgoing hyper-threading on most of its processors. The only Core processors to get hyper-threading will be the Core i9 parts, and perhaps the Pentiums as well ... The other angle is one of the recently discovered side-channel attacks that can occur when hyper-threading is in action. By disabling hyper-threading on the volume production chips, this security issue is no longer present.

Does this give AMD an advantage with Ryzen and Threadripper, since AMD supports hyperthreads?


AMD has a cost/performance advantage. The Ryzen 2700x sells under $280 with a free cooler for 8c/16 threads.

The Intel i9-9900k sells for $550+ to $580+, requires an added cooler, and 8c/16 threads.

The difference is that the i9-9900k can single-thread turbo to 5GHz, which makes a big difference in single-threaded situations. The i9-9900k also has 256-bit AVX pipelines, while AMD Ryzen is 128-bit based AVX pipelines.

So any AVX-heavy code (H264 rendering, Blender benchmarks, etc. etc.) will be hugely beneficial on the Intel platform.

-----------

AMD still wins price/performance. I'd personally still take a Threadripper 2950x for example. But the i9-9900k is impressive because of that 5GHz clock, even if its a bit restrictive to use.

Anyone who cares about maximum single-threaded speeds is looking at the 5GHz clock. That's huge.


> The difference is that the i9-9900k can single-thread turbo to 5GHz

minor nitpick: the 9900k can actually turbo to 5GHz on two cores simultaneously.


I think so. I'd bet that the reason Intel is removing HT is for market segmentation rather than architectural reasons.


For server usage, where you have multiple processes involved, I seem to find that memory bandwidth is very important. If the CPU can not get the data it cannot work with it...


Probably that's why the Skylake Xeon CPUs have 6 channels.


I can't wait for the Skylake-X refresh. I use my computer for real-time audio and although the processing scales to multiple cores the "real-timeness" is limited by the speed of the single-core. Sadly AMD still is nowhere near Intel in this regard. I was hoping 2990WX could be an upgrade to 1800X, but sadly it won't be as the single core can also be limited by not having access to local memory. It is Intel this year.


I really like that Die Sizes from Wikichip chart. For a quad core processor, Die Size went down from almost 275mm2 (Nehalem) to 125mm2 (Kaby Lake) over the last decade, but that didn't translate to cheaper price or much better performance, Intel worked so hard to prevent any of these to happen in their mainstream desktop processor product line.


Would rather save the bucks and go with a Ryzen 2700 unless I was doing avx512 work but who’s doing that these days?


The i9-9900k does NOT support AVX512.

The i9-9900x does. (x, not k). Confusing? Yeah. Blame Intel's marketing.

Still, the i9-9900k is faster at AVX2 (256-bit), which is useful in H264, Photoshop, and a number of graphical productivity applications. And you can see the benefits of it: the i9-9900k gets close to Threadripper performance on Handbrake due to the AVX2 support, even with half the cores.


The Ryzen chips do AVX2 as well though for some reason the Intel parts often do better.

Media applications makes sense to use such cpu features.


Form other parts of this thread AMD implementation is only 128b, while the lower i9...k is 256, and x also has 512.


Yea marketing is really confusing here.

The 9900k is coffee lake v2. The 9900x is skylake (server) v2.

Two totally different chips.


Could you elaborate why AVX-512 is not important? I guess that you are referring to GPGPU programming is what everybody is doing right now instead of SIMD in the CPU.


It’s important yes but hard to implement I think and not widely used is what I was getting at. For me compiling and running VMs I’d take the Ryzen over this.


I could be talking out of my butt but it’s my understanding that AVX512 isn’t used in but few niche programs.


Isn't that niche the same as the niche of programs that can realistically max out a modern desktop CPU? Video encoding, compression, and things like that?


according to this only x265 uses AVX512: https://software.intel.com/en-us/articles/accelerating-x265-...

But I wouldn't doubt that things like Video encoding / compression uses this. Again, I was not saying it's not valuable just saying that for my use case where I don't encode videos for a living or run a large Handbrake/Plex video farm / library the i9 cpu is not worth as much to me as the cheaper, cooler running Ryzen 2700.


Disappointing that it's still only 2 memory channels. Would have loved for quad channel to become the norm.


4-sticks of RAM is cost-prohibitive for a lot of builds. Heck, a lot of laptops still ship with only 1-stick of RAM for cost-saving purposes.

2-memory channels is what is expected on a normal, consumer platform. Especially since DDR4 effectively doubled the bandwidth from DDR3 anyway.

In effect: DDR4 2-channels is roughly the same speed as DDR3 4-channels. So that's how RAM scales: it gets faster without necessarily needing more sticks.

In two or so years, DDR5 is going to double-bandwidth again. So 2-channel DDR5 will be the same bandwidth as 4-channel DDR4.

Latencies are all going to be the same though, but that's the nature of the technology.

----------

If you need 4-channels of RAM, x299 exists (and x399 for AMD Threadripper). If you need 6-channels, you got Xeon Scalable, 8-channels got EPYC, 12-channels (or more) for higher-end Xeons. So these high-end systems exist for people who need them.

But most consumers are fine with the ~40GB/s speeds that 2-sticks of modern DDR4 get you today.


True enough, but I personally hit bandwidth limits on occasion for my most heavy processing jobs (data analysis, wouldn't be surprised if the same applied to people doing media editing or working with databases).

Threadripper or cheap EPYC are looking quite attractive to me because of the high bandwidth per core and price-point. I just wish Intel would also compute on this point.


I agree that Threadripper / EPYC look good on that.

Intel's main price/performance competition are Dual Xeon Silvers. Its not talked about very much, but 2x Silver 4114 (https://www.amazon.com/Intel-Xeon-4114-Deca-core-Processor/d...) isn't really that bad. 10-cores x2 == 20 cores at $1500 total.

That's 12-memory channels too (6-channels per socket). You'd have to buy an expensive dual-socket motherboard, but its relatively high-end / server-class for a reason.

Xeon Gold and Xeon Platinum are just WTF with the pricing though. I guess those are for people who stopped caring about price/performance.

AMD EPYC still seems better on a price/performance front than Dual Xeon Silvers, if only because you can get a single-socket EPYC with 8-channel RAM. But Xeon Silvers aren't really that far away.


> 10-cores x2 == 20 cores at $1500 total.

Isn't the Epyc 7281 basically the same price with 32 cores and 16 memory channels (16 cores and 8 channels per socket)?


Yes, but you have to consider the NUMA-nodes and how good threading is on the two systems.

10-cores x2 == 20-cores on 2x NUMA nodes.

2x EPYC 7281 basically gives you 8x NUMA nodes. Your threads are really spread out, so its bad for major workloads like Databases (which may communicate heavily, and also benefit from the "combined" L3 caches on the Intel Xeon Silver setup)

In many ways, I'd argue that the "real" competitor to the Dual-Xeon Silver 4114 setup is the single-socket EPYC 7401P for $1200ish.

The EPYC 7401P gives you 4x NUMA nodes with 24-cores / 48-threads total, fits into a single-socket motherboard (way cheaper). This minimizes the NUMA crosstalk issue, still gets you a respectable 8-channel memory system and a large core-count.

The EPYC 7401P is the price/performance champ. Still, AVX512 or Database purposes may keep dual-Xeon Silvers in the running.


> Your threads are really spread out, so its bad for major workloads like Databases (which may communicate heavily, and also benefit from the "combined" L3 caches on the Intel Xeon Silver setup)

As soon as you add the second node for a workload with poor locality like that, half your accesses are on the wrong node. No matter how many more nodes you add, the worst it can cost is that much again.

The unified L3 is also probably immaterial when you have a large database like that which exceeds the L3 by hundreds of gigabytes and it's ~100% cache misses either way.

On the other hand, you have 60% more cores to make up for it.

I'm not saying there are no circumstances or workloads where the Intel system makes sense, but it's kind of telling that it's a matter of finding them as exceptions to the rule.

For the general purpose things like running a bunch of unrelated VMs that aren't affected by number of nodes much if at all (or benefit from them because one misbehaving guest or process thrashing all the caches and flooding memory bandwidth only impacts a single CCX/node), it seems like an obvious choice.

It's also going to be interesting to see how databases optimize for multiple NUMA nodes now that they're common. There should be ways to determine which parts of the database are on which node and then prefer to dispatch queries to cores on the same node, or keep copies of the highest hit rate data on multiple nodes etc.


Everything I've seen indicates DDR5 is behind schedule, seems like it's going to be 3 years at least before it reaches high end parts, which means it could well be 4 or more before it helps the 2 channel parts.


Which xeons have 12? Are you talking about dual socket?


Yes. Dual Socket is 2x6 memory channels.

Which is just as legitimate as AMD's 2x4 memory channels on their quad-NUMA single-socket solution IMO.


I don't think that's a legitimate comparison unless you compare a dual socket epyc.


At first I thought this was super unimpressive, not much faster than previous gen, often slower. Then I realized my own cpu is already 3 generations old! And it is quite a huge jump from mine to the new i9! It has only been a few years!


How those 8 cores communicate? I remember that AMD uses NUMA architecture that can bite in rare cases. Does Intel uses something similar too?


Only the Threadripper has to worry about NUMA and that's because it has two physical dies with 8 cores each. The regular Ryzen 8-core only has a single physical die and thus NUMA isn't an issue.


Intel consumer class CPUs still use a ringbuffer architecture. Server chips use a newer mesh architecture.


so it looks like intel is having the same issues as amd where more cores means not necesarily better gaming. i5 is still the best it looks like lol. this puts amd in a better position because they are at least cheaper for their cores.


Threadripper or Core i9 for my next Linux desktop build?


I'm extremely happy with my 1950x, both for the value and the performance.

Linux runs great (I'm using Ubuntu 16.04 desktop currently)


TR platform has a higher "future upgrade potential", can handle 4xGPUs for Deep Learning, but has some issues with its high-end motherboards. Also, memory is still limited to 128GB ECC UDIMM as larger chips aren't out yet (Samsung is sampling some 32GB module right now). If you need single-threaded performance, 9900K is your best choice.


i7 8700K may be the most sensible choice overall really.


Not sure that I agree.. if you have truly multithreaded workloads, the R7-2700X is probably a better option... in the end, it depends on your use and how much you want to spend. System cost is also a consideration.


On one hand I agree for "truly multithreaded workloads" - on the other hand having strong single-threaded performance is still a nice feature to have (e.g. when writing prototypes or simply when using simple SW).

I did evaluate about 2 months ago if to go for a 2700X or the usual Intel, and based on the benchmarks I saw, the 2700X is still a bit too weak when running single-threaded SW => I then bought an i7-8700 (no "k") just to be "safe" when running semi-single threaded SW (e.g. my current prototype uses 10 threads but has a single coordination bottleneck).

Let's see what happens in ~January (I think that's when AMD is supposed to present the new CPUs that are supposed to be the "real" next-gen with higher IPC?)


A lot of the time I'm working multi-tier software so will have VM and/or docker containers running in support of what I'm specifically working on. And even then it's mostly node/js based, so the NVME is the single biggest boost to my productivity. In any case, it's often browser + app server + backend services running... so I notice the single-thread performance less. TBH my current CPU (i7-4790k) keeps up reasonably well, though I wouldn't mind more threads and more RAM.

When the GTX 1080 came out, I bumped up to that with a 1TB NVME drive (Samsung) which has been fantastic in limited use. Mostly notice the difference in my npm scripts (ci and build). I'm mostly just itching to upgrade more than feeling pain at this point. I'm pretty sure the Ryzen 2 CPUs (not 2xxx) will be sufficient, as will the mid-upper Intel. However, unless Intel brings their pricing into better competitive range, I may go AMD on my next build.




Applications are open for YC Summer 2019

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact

Search: