Very excited for access to more RAM in a consumer form factor - by comparison, the most RAM you can get on a DDR5 motherboard is 128GB(actually, there may be 48 GB modules as well, so maybe 192), but at that capacity you're not getting anywhere near the rated speeds. The listed 1TB of capacity sounds super roomy by comparison
I have 128 GB in my computer with a 5850x, it allows me to run and load the 180B falcon and 70B llama2 LLMs in llama.cpp, although with different quantization.
- you don’t get GPU acceleration just by using unified memory. Llama.cpp still only uses the CPU on Apple Silicon chips.
- the difference in tokens/sec is likely attributable to memory bandwidth. Mac Studios with the base Max chip have 400 GB/s memory bandwidth compared to around 50 GB/s for the Ryzen 5000 series CPUs
One underused angle for oodles of memory is the humble ramdisk. If you haven't run into these, you set aside a portion of memory to serve as a disk volume. If you have a temporary work product, some kind of intermediate stage bit you will not save, shoving it in a ramdisk provides some really amazing speedups. You can put a SQLite database in it on the fly, just for analysis, run at blazing speeds, keep the results. Image an optical disk into your ramdrive, chow away at it, keep the work product, and just clear the memory.
Try searching deep into the game tree in go. After a few million nodes you can't actually store them all in 16GB of RAM. That's just a day of searching on a 2060, you can get into the hundreds of millions with a faster GPU and a longer search horizon. But when you put it into swap, it won't be as fast...
It can't be for tabs in chrome, that browser can't even use all 64gb I offer it ("putting tabs to sleep" is disabled on purpose as I need the tabs to stay active)
I do, but they crash for memory insufficiency reasons while half of the RAM is still free. it's a bug, other people have commented on the issue with no solution so far. some suspect hardware acceleration to be a culprit, but I ruled that out.
what I meant to convey in the first place: it doesn't matter how much hardware you throw at an issue if the software can't use it
Finally, I've been waiting for a non-Pro Threadripper since the 3000 days. The 64 core 7980X looks like a good deal, a bit higher than the 3990x but still not bad. Gotta see the compile and gaming benchmarks though first.
I heard rumors though that the mainline Ryzen will get 32 cores perhaps next year, so we'll have to see how that does.
I would be very surprised if Ryzen goes for 32 cores, 24 is more probable. If they do go for that many cores, it will likely be "Zen 5C" cores, which may or may not be competitive with Threadripper being full Zen 4 cores.
Regardless, it's the memory channels and extra PCIE lanes that makes Threadripper shine, oh, and now HEDT gets Registered ECC which is fantastic.
> I would be very surprised if Ryzen goes for 32 cores, 24 is more probable.
The rumor is that the CCX is going from 8 to 16 cores, at which point a 24 core ryzen would necessarily be 2 partially-enabled CCXs. That wouldn't make much sense as their high end configuration which so far has always been 2 fully enabled compute chiplets. So if that 16 core CCX change is true then 32 full fat cores for the top end Ryzen is the most likely scenario. There wouldn't really be any reason for AMD to just refuse to put 2 fully enabled dies together, especially since as you noted Threadripper & Epyc already offer more benefits that Ryzen just won't get.
Yes and it is Zen 4 Core, so a lot faster than the 3000 days. My only concern is the Quad Memory Channel. Not sure if the 64 Core will be limited by it.
Did they really do this again? They already have a 12-channel and 6-channel DDR5 socket for EPYC, now they're going to have two completely different sockets just for Threadripper with 4 and 8?
This isn't even market segmentation. Threadripper costs just as much as EPYC. It's pointless.
"However, CXL has suffered from a “chicken and egg” problem, according to Srinivas. Last year, there was a great deal of interest from hyperscalers and integrators, but there were no CXL devices ready.. Now that products are sampling, he said, they can work on applications and see benefits. “Now we are seeing practical implementation and how CXL technology can really help provide that additional memory capacity and bandwidth.” ... "CXL 2.0 platforms likely for 2024"
2.)
( 23 hours ago ) "Even a laptop can run RAM externally thanks to a little-known tech called CXL — but it’s not for sale to mere mortals who only want 1TB of RAM"
I have a few questions about computer architecture for you hardware folks. These many core architectures get so complex. I'm a lowly scientist that does some fairly heavy compute in both interpreted languages (python) and compiled languages (c/julia). Two questions:
When these 'base clocks' are listed, would a lower core CPU with a higher base clock run more calculations at its max thread count than a higher core cpu? Say 64 threads on a 32 vore 4Ghz CPU vs a 2.5Ghz 96 core monster. For my use case I'm often running a serial computation on a fixed dataset that can be threaded per data point. So if I have 32 data points it's only that parallel. Something between embarrassingly parallel and completely serial.
And second. For these massive caches, if I'm doing a massive serial calculation does an individual core get to use the full cache if it's being thrashed? As in, are there cache benefits to running a computation on these huge CPUs vs some overclocked 8 core beast?
> When these 'base clocks' are listed, would a lower core CPU with a higher base clock run more calculations at its max thread count than a higher core cpu? Say 64 threads on a 32 vore 4Ghz CPU vs a 2.5Ghz 96 core monster.
The base clock is only when all cores are loaded, and especially on Threadripper / Ryzen is far from meaningful in practice as they will permanently turbo. So it's very possible that the 96 core threadripper and the 32 core threadripper when asked to run the same 32 threads will actually end up running around the same clockspeed.
Clock speed isn't a particularly meaningful measurement anymore, and hasn't been for years. For example, an AMD Genoa chip, depending on SKU, may have fairly comparable base/boost clock speeds compared to an Intel Sapphire Rapids - but in practice the single-core performance of the Intel is going to be substantially better for most code.
Your cache question doesn't really have a simple answer either. E.g. an AMD CPU is split into different CCXs. To simplify somewhat, each core is broken up into several smaller compute units, with their own caches and memory controller. Intel has a completely different ring-based approach that's harder to summarise in once sentence.
Overall though, for the sort of work you're describing the limiting factor is often memory bandwidth, not raw compute. Different platforms have very different membw/core figures, and I suspect if you started measuring that then you'd find it easier to predict your codes performance.
While at equal clock frequency the Intel CPUs are a little faster in single-thread applications, their main weakness is that at equal power consumption their clock frequencies are much lower in multi-threaded applications, which leads to much lower multi-threaded performance.
This can be easily noticed when comparing the base clock frequencies, which are more or less proportional with the actual clock frequencies that will be reached in multi-threaded applications. For instance a 7950X has 4.5 GHz versus the 3.2 GHz of 14900K. Similar differences are between Epyc and Xeon and between Threadripper and Xeon W.
In desktop CPUs Intel can hide their very poor multi-threaded performance by allowing a much higher power consumption. However this method does not work for server and workstation CPUs, because these already have the highest TDP that is possible with the current cooling solutions, so in servers and workstations the bad Intel MT performance is much more visible. Intel hopes that this will change in 2024, when they will launch server and workstation CPUs made with the new Intel 3 CMOS process.
In the absence of actual benchmarks, a good proxy for the multi-threaded performance of a CPU is the product between the base clock frequency and the number of cores. For Intel hybrid CPUs, an E-core should be counted as 0.6 cores. For example a Threadripper 7960X should be expected to be (24 cores x 4.2 GHz) / (16 cores x 4.5 GHz) = 1.4 times faster than a 7950X in multi-threaded applications that are limited by the CPU cores (but twice faster in applications that are limited by the memory throughput).
> However this method does not work for server and workstation CPUs, because these have already the highest TDP that is possible with the current cooling solutions, so in servers and workstations the bad Intel MT performance is much more visible
I disagree on this point. I would say this problem is much more critical on Intel's desktop platform than their workstation platform. Xeon Sapphire Rapids is actually very easy to cool, even on air, thanks to the CPU having a much larger surface to dissipate heat than their desktop equivalent.
I have Xeon w9-3495X, and while power consumption is one of its weakest points, it stays under 60°C with water cooling while I pump 500W into it (25°C ambient), of which I see between +30% to +50% gain in multithreaded performance over the default power limit. (Golden Cove needs around ~10W per core, so the default 350W/56c = 6.25W is way below its performance curve.) Noctua has also shown that they're able to achieve ~700W on U12S DX-4677[1] on this platform.
Not only is this complete and utter rubbish, it should have been obvious from context that we're not talking about desktop CPUs. 96-core desktop CPUs are not a thing, and neither of the product families I mentioned are desktop CPUs either. I neither know nor care what the difference between those desktop cpus are, and I doubt GP does either.
Your metric about clock speed is, I'm afraid to say, so horribly oversimpified as to be flat out wrong. You can't just multiply core count by clock speed like that, as you're failing to take into account all sorts of other scaling factors such as memory bandwidth, cache size, avx support and so on, which matter as much or more than simple IPS.
> serial computation on a fixed dataset that can be threaded per data point. So if I have 32 data points it's only that parallel
The general rule of thumb is that within a generation, higher clock speed yields better performance per core. If you only have 32 data points, then you will probably get better performance with the 4 GHz 32-core CPU.
> if I'm doing a massive serial calculation does an individual core get to use the full cache
Your single core will get to use the entire L3 cache, but L2 and L1 caches are per-core and so your single core doing the work will not have access to those. So yes, there conceivably could be a benefit due to the larger L3 cache.
On a broader note, these kinds of factors (frequency, cache size, parallelism) tend to be extremely workload-specific and unpredictable, so the only real way to find out what's faster is to measure your specific workload.
> And second. For these massive caches, if I'm doing a massive serial calculation does an individual core get to use the full cache if it's being thrashed? As in, are there cache benefits to running a computation on these huge CPUs vs some overclocked 8 core beast?
On AMD chip's (I just dunno about the Intel architecture), each chip is divided up into chiplets with their own set of cores and L3 cache. For these Threadrippers, those will be 8 physical cores, and 32MB of L3 cache. Each core can access the L3 cache within their own chiplet only.
You'll need to dig into the enabled cores+cache arrangements for any particular chip you might be interested in to figure out what will be good for your workloads.
When these 'base clocks' are listed, would a lower core CPU with a higher base clock run more calculations at its max thread count than a higher core cpu? Say 64 threads on a 32 vore 4Ghz CPU vs a 2.5Ghz 96 core monster
-----
You know that GPUs are just CPUs with a massive core count, right?
Like in the 1000s of cores
So depends on what calculations you need done, there is no answer.
It depends on how the program uses the cores, not even only what calculations you need done
GPU cores are not CPUs, otherwise we could run Linux and a shell on them. GPU core is much simpler than a CPU, that's why there can be so many of them on the die.
Something to note about the Threadripper Pro CPUs in the previous generations that may apply here:
The 12 and 16 core variants only had the ability to saturate the IO die with enough data to fully utilize 4 channels of RAM due to fewer CCDs being populated. I would bet that will be the case here as well. So those bottom CPUs will be just the budget IO monsters, and be no better than the non-Pro (for RAM bandwidth) even in the Pro boards. The one caveat is that they probably won’t have the speed penalty with 8 RDIMMs that the others probably will.
These chips are great. You get a lot of cores, without having to compromise on frequency (you get high frequency as well).
I wish cloud / dedicated server providers sold these chips. Because most workloads need higher frequency than more cores. And you get the best of both with these.
They pretty much have this, e.g. EPYC 9274F has a 4.05GHz base clock. Threadripper with the same number of cores is 4.2GHz, which is less than 4% faster (for a >9% higher TDP).
Threadripper has a significantly higher boost clock, but that only counts when you're not using all the threads, so you only care about this in a workstation running mixed workloads. In a data center you don't use a high core count processor for a single-thread performance-sensitive workload, you put it on something with fewer cores like a Ryzen 7700X because it has a higher boost clock than any Threadripper while costing less and using less power.
EPYC 9334 has the same config, but it's clocked lower and has a correspondingly lower TDP.
Whereas the 7965WX has 24 cores from 4 CCDs, so the 7965WX is presumably made of CCDs where 2 of the 8 cores were defective, i.e. it's a 7975WX made of "defective" CCDs. But so is the EPYC 9254. Or they could have used the same four "defective" CCDs to make a pair of the 12-core Ryzen 9 7900, or four of the Ryzen 5 7600.
The EPYC "F" processors are apparently made entirely of "defective" CCDs, presumably because it gives them more L3 cache per core and helps with heat dissipation by spreading the heat load across more CCDs for the same number of cores.
I'm sort of confused on this one, is this essentially the "home version" of an Epyc? Who is the intended purchaser here? As previously if I wanted to build an AMD workstation at home I would have opted for an Epyc setup.
I mean, usually the intended audience is Us - people who benefit from lots of cores for things like compilation, but still want to have a Regular Computer rather than something in a rackmount form-factor that sounds like an aircraft taking off
This generation of ThreadRipper has higher TDP than last generation of EPYC. There is very little difference between EPYC and ThreadRipper of the same generation. There are plenty of workstation style motherboards for EPYC as well (see asrock offerings).
Some other comments mentioned lower frequency of EPYC, but I'll add another point, the motherboards on EPYC are built for a datacenter environment where you run very loud fans and shove a crap ton of air through the system. Threadripper has larger heatsinks on VRM and chipsets and whatnot on the motherboard so that you can use quieter fans, like on Ryzen.
Relevant to that, a lot of EPYC (and server boards in general) also have the CPU socket rotated 90 degrees compared to non-server boards to support that "Back to Front high airflow" scenario. (That said, a Noctua NH-U14S TR4-SP3 still fits perfectly fine on a ROMED8-2T, but in a regular ATX case, the fan will blow towards the top of the case, not the back - so you need to have an exhaust at the top.)
I do more or less the same - I look for tower servers. The hardware is good, the maintenance is easy, plenty of space for storage and memory, and so on.
> Why would anyone buy anything but the most expensive product?
If I were to build a workstation on it right now, I wouldn't buy the most expensive one. My relatively modest compute needs don't come even close to what that monster is able to do.
If you are IO-bound, why spend $3000 more on a CPU if you can spend that in memory, more/faster NVMe or HDDs?
If you need more than 28 PCIe lanes, then you need the big one. In any case, 28 is still a lot of IO. It makes little sense to pay for what you won't need.
workstation - that would require some external GPU, so 8 or 16 lanes. 4 for chipset/USB. That leaves 8 or 16 pcie lanes for SSD, decent NVMe SSDs are 4 lanes, which means up to 4 fully enabled SSD, it's not much, esp. running RAID.
Not all work for workstations require a 16-lane GPU or four NVMe SSDs. Sometimes all you need is a lot of CPU cores and lots of memory at a reasonable, albeit not ludicrous, bandwidth.
Although I work with 2-3 monitors all the time, I'm happy with low-end GPUs - I don't do much data visualization or 3D rendering. I mostly develop software, so I stand-up relatively thin ephemeral VMs and containers all the time. For me the core count and amount of memory is the limiting factor. To completely avoid any swapping, 64GB of RAM is needed, and more would allow me to keep a CI pipeline running all the time. If I need more space, I can spare one or two SSDs as bcache for a larger SATA array.
As much as I'd love the biggest Xeon X, EPYC, AmpereOne, MI-300, or Grace Hopper under my desk, I have no need for that much power.
My initial response was to: "If you are IO-bound, why spend $3000 more on a CPU...", the desktop ones just dont have the PCIe lanes to prevent the I/O bound case. You can run 3 monitors off a iGPU (if you dont care about 3D), overall your setup is a classic desktop, running lots of VM just requires memory as the processes don't do much but exist peacefully. 128GB on a desktop CPU is totally fine with 4 DIMM slots, there won't be ECC - which DDR5 doesn't fully cover, though.
This is nothing compared to the range of price points offered for server CPUs. I remember one Intel launch must've included like 50 different SKUs. It's clear that companies do not just buy the most expensive one.
According to the Anandtech coverage[0], both WRX90 and TRX50 have some sort of off-cpu chipset, unlike the EPYC platform. However there's basically no information about what's on the chipset. I suppose it must not be very exciting, then. Just a bunch of USB controllers, probably. It may even be the same die as the desktop Ryzen chipset.
The 96 core part has 480MB of cache and 8 channels of DDR5.
Compare that to a 7950X (not the X3D model) which has 16 cores, but only 80MB of cache and 2 channels of DDR5. The 96-core part has 6X as many cores, but 4X as much memory bandwidth and 6X as much cache. The interfaces have scaled similarly to the core count.
Of course, if your workload isn't really very parallel then you won't see much benefit. Such is life with high core counts.
Waiting for Bruce Dawson aka randomascii's sequel to "24-core CPU and I can't move my mouse", with "96-core CPU and time has been stilled", with event tracing and flame graphs that show which of those cores are slowing down something basic in Windows.
It's a bit weird (but cool!) that AMD seems to expect to use this chip on ordinary workstations rather than HPC clusters. Hopefully a few of those workstations are bought by the kernel and application developers (who have reasonably parallel workloads like compiling operating systems), otherwise you're likely to have more than an order of magnitude more cores than the average person designing your operating system. Usually, the problem is the opposite - the devs are on new flagship workstation-class devices, and the users on little edge devices (sometimes still with HDDs instead of SSDs!) have a poor experience, but this is kind of the opposite.
> the devs are on new flagship workstation-class devices, and the users on little edge devices
This will continue to be true, I think. The vast majority of people buying threadripper machines at threadripper prices are those with serious work to do on them, like compiling huge codebases, hosting piles of VMs, running simulations, or doing certain creative work. I doubt anyone developing 'serious' software that their customers buy 96 core machines to run wouldn't also run similarly powerful machines.
There are some tech fetishists who will buy this premium hardware and underutilize it, but not too many.
Depends on how memory bandwidth constrained you are with eight channels of DDR5. Keep in mind also that the DDR5 on these can also clock higher than EPYC parts, so the gap between Threadripper's 8 channels and EPYC's 12 channels isn't as wide as one may initially think.
And 8 channels of DDR5! I wonder if any WRX90 Motherboards will have memory overclocking or if that's going to be limited to the quad channel TRX50 enthusiast boards. AMD seems to allow it, but will there be a market?
If the platforms are anything like the desktop topology overclocking the memory increases the L3 cache and CCX<->CCX clock speeds as well. Feed the beast.
> Can the cores even stay busy and fed with work to do at some point?
There are still some embarrassingly parallel applications out there that could benefit from this sort of hardware. I'm not sure if it's absolutely necessary to have this number of cores under each desk though. To me this sounds like a luxury/vanity/status symbol.
Epic that the Pro series go all the way down to 12 cores. Still get gobs and gobs of memory and io bandwidth, at a somewhat hopefully more down to earth price.
Also, I think the key point people really miss is that with modern cores you can just say what you want! I really really enjoyed this article where they scaled a AMD 7840HS laptop core from 51W down to 5W. There looks like perfectly acceptable single core performance even down at 10-15W. https://news.ycombinator.com/item?id=37923741
This mirrors what Matthew Dillon of DragonflyBSD was finding 6 years ago with a 2700x. You can just drop the cores target power down from 160W to 85W and it still runs amazingly fast and the efficiency numbers go way way up. https://lists.dragonflybsd.org/pipermail/users/2018-Septembe...
People keep being afraid of high power cores & they just don't know any better. The situation has just gotten better and better and easier and easier to dial in an arbitrary desired power level, and you get it. At some bit to multicore performance, but efficiency usually goes way way up.
Its too late for me to edit my reply so I'm adding another:
Just wanted to say that you had a good point about modern cores being configurable (for example, on my laptop I set the EPP 100% towards energy efficiency by default and then when I know I'm going to be running some serious long-running computations I'll run a script to make the EPP favor performance). I'll have to wait for these to be released to see how low their power usage can go considering they might have a high floor since I doubt these were designed for low power usage, but this has at least revived my consideration of these CPUs.
That being said, there is one other thing to consider: motherboards. I have a first generation threadripper in my desktop and through a series of dumb moves with very heavy graphics cards, I have broken a number of the PCIe slots. I looked into getting a replacement motherboard since I figured they'd be cheap by now since they're so out of date, but first generation threadripper motherboards are like $700-$1000, while the people who bought AM4 socketed CPUs got multiple generations of CPU upgrades AND their motherboards now cost $100-$200. There are some undeniable advantages to being on their more popular consumer lines instead of the prosumer threadripper.
Looking forward to AMD releasing some more budget oriented options for AM5, currently it's a very expensive platform. I really wanna see how these take on Intel, and if they steamroll them the way they did when they Threadrippers first came out.
There are still processes that benefit greatly from cpu advances. Something like compiling code, for instance. But even if this were the case, high end cpus like these can now be used with several gpus where a lesser cpu might be able to only feed n-1 gpus. Density is pretty important in data centers and elsewhere. Think ML researcher with 4 big gpus in their desktop
There are lots of random bits which are still CPU bound.
https://www.pugetsystems.com/solutions/ - can you click through to workflows to specific programs and then look up Hardware Recommendations to get a run down.
While Blender itself is a mostly GPU bound situation, a lot of other media creation software is more mixed.
* Photoshop: mostly still limited to 8 cores (some filters being GPU enabled)
* Premiere: GPU used only to handle specific effects (so it depends on your mix of effects)
You'll have to wait for the reviews to confirm, but I believe it's highly unlikely that the ECC will be non-functional. These require RDIMM's, and it's hard to even buy RDIMM's without ECC. So if everybody who is testing it has ECC in their test machine, many of them will explicitly test ECC and it will be noticed. Also, ECC works fine on the EPYC chips these are variants of.
That was probably the most welcome bit of news. I've got a water loop on my older threadrippers that can handle that ballpark. After the 5xxx became unobtainium for us who build their own system, I've got a 7950x that won't even turn on the fans that is waiting for something newer than what I got on the workstation front.
Because at the time of release of 5000 series, they couldn't get all this silicon to make SKUs at every price point, so they tried to sell at the highest price point. Now that they are not selling much of anything, they will try to get their profits at the HEDT market
I wish they'd go back to the way they did the original Threadrippers. These absurd prices kinda make them pointless for a large number of people when combined with how AMD has bungled the socket support with Threadripper and how expensive the motherboards are. The lower prices and overlap with the top range of regular Ryzens made threadripper very appealing for applications where you want lots of IO and RAM, but not necessarily a more capable CPU.
7950X3D is $43.75 per core.
The 24 core Threadripper part is $62.50 per core. It's certainly a higher price, but imo it's not outrageous for more memory channels, registered ECC, and more PCIE lanes.
I don't necessarily agree on looking at pricing that way. One reason being that the 24 core part is 8 ccds with 3 cores each, while the 7950x is 2 ccds with 8 cores each, so the threadripper is able to go with much lower quality chips which would have otherwise gone in the bin.
Even if it uses chips with defective cores, it's four times the silicon. The 24-core Threadripper is actually greatly advantaged compared to the 7950x for the right workload. There is four times the L3 cache, and four times the bandwidth from the chiplets to the io die. If you just needed a place to sell chips with a core defect, there are Ryzen SKUs for that.
What do you think the appropriate way of looking at pricing is? Once you get beyond 8 cores, your ability to saturate the CPU comes down to very specific parallelized workflows that can benefit close to linearly from extra cores. If you have that kind of workflow (render farms, encoding, etc) then the marginal cost of a CPU feels relevant.
I'm sure there are other workflows, especially in the cloud where you are rending out fractions of a CPU. But for individual professionals, what else besides marginal core cost would you consider?
You mean the Threadripper has a lot more cache on the 24 core part so it's going to be much better in those workloads that fit into its cache but not into the 7950x's?
I wouldn't hold your breath. As long as there's an insatiable demand for EPYC server CPUs, the workstation parts need to have similar margins because they use the same dies.
Judging by AMD's Q2 financials, there most probably isn't insatiable demand for EPYC CPUs right now. But probably AMD won't sell the chips with discount as Threadrippers, either because they already sit finalized in EPYCs in a warehouse, or to prevent companies buying lots of Threadrippers and using them in datacenters. They probably want to make the big money in the datacenter, and they're willing to take a short-term revenue hit if it helps to motivate the buyers to spend in the future.
> Data Center segment revenue was $1.3 billion, down 11% year-over-year primarily due to lower 3rd Gen EPYC processor sales as Enterprise demand was soft and Cloud inventory levels were elevated at some customers.