Hacker News new | past | comments | ask | show | jobs | submit login
AMD’s 64-Core Threadripper 3990X, only $3990 Coming February 7th (anandtech.com)
252 points by erik on Jan 7, 2020 | hide | past | favorite | 249 comments



When I left Sun in 1995 their "biggest" Iron was the Enterprise 10K (which internally was called "Dragon" because of the Xerox bus) A system with 64 cores and 256GB of RAM was just under 2.5 million dollars list. It needed over 10kW of power provided by a 60A 240V circuit. The power cord weighed in at like 20 lbs. I put together a new desktop with the TR3960 and 128GB of ECC ram, that motherboard will take the 3990 and 256GB of RAM if I chose to upgrade it. It really boggles my mind what you can fit under your desk these days with a single 120V outlet.


In 2001, the fastest supercomputer in the world was ASCI White. It cost $110M, weighed 106 tons, consumed 3MW of power (plus 3MW for cooling), and had a peak speed of 12.3 TFLOPS.

Right now, sitting under my desk is a RTX 2080 Ti GPU which cost around $1000, weighs 3 pounds, draws a maximum of 250 watts, and has a peak speed of 13.4 TFLOPS [1].

We truly live in amazing times.

[1] Not quite a fair comparison: the GPU is using 32-bit floating-point, while ASCI White used 64-bit. But for many applications, the precision difference doesn't matter.


It's not fast enough. After having access to 160 TPUs, it's physically painful to use anything else.

I hope in 20 years I'll have the equivalent of 160 TPUs under my desk. Hopefully less.

The reason it's not fast enough is that ... there's so much you can do! People don't know. You can't really know until you have access to such a vast amount of horsepower, and can apply it to whatever you want. You might think "What could I possibly use it for?" but there are so many things.

The most important thing you can use it for is fun, and intellectual gratification. You can train ML models just to see what they do. And as AI Dungeon shows, sometimes you win the lottery.

I can't wait for the future. It's going to be so cool.


How do you program a TPU ?

I looked to use some for solving PDEs, but Google had literally zero documentation on how to cross-compile C to TPUs, launch kernels, etc.

AFAICT, you either use tensorflow or some other product that supports them, and for which the TPU code is not open source, or you can't use TPUs at all.


Do these help? https://github.com/google/jax/tree/master/cloud_tpu_colabs

The ODE solver might be close to what you want.

I use Tensorflow 1.15. The world has been steadily pushing for Tensorflow 2.0 or Jax, but I like the simplicity of the Session model. It's so simple you can explain it in one sentence: it's an object that runs commands. Tell the session to connect to the TPU, and it will run all those commands on the TPU.

Jax is new to me (and to everyone; they just released it). But it looks like Google is pouring some serious R&D into it.

Two things help a lot. One, twitter. You can get a direct line to the people who actually make these beasts. Exploit it when you can. Like you, I dislike using a black box, and I'm intensely interested in the details of how to communicate with a TPU at a low level. I recently asked someone on the jax team about it here: https://twitter.com/theshawwn/status/1213221594052599808

Two, TFRC support has been incredibly helpful. https://www.tensorflow.org/tfrc I don't know who they have working the support channels, but those guys and gals are some of the most helpful and cheerful people I've come across. I often asked them very technical questions and to my surprise, they followed up with an A+ response almost every time, usually the next day.

Pytorch is giving TF a real run for its money, and to be honest I once felt it was a mistake to invest so much time into Tensorflow. But it turned out to be a big advantage due to Google's investment in the overall ecosystem. TPUs are something that only Google has the resources to pull off.

Note that the traditional path towards "just get a TPU up and running and start playing with it" is to use one of their Colab notebooks on the topic. https://cloud.google.com/tpu/docs/colabs I've been implicitly steering you away from these because you seem (like me) to want to know more of the low-level details. Those notebooks are designed to let ML researchers get results quickly, not for hardware enthusiasts to exploit heavy metal. The jax notebooks felt much more satisfying in that regard.


Or you can buy Nvidia gpus which will be much cheaper than tpus in the long run for the same performance.


Great twits and write up.

Can u mention how much human Dev time is involved?

We have a stupid-basic single machine Deep reinforcement Self play setup. It takes about 24 hrs to run a full experiment. The NN is the bottle neck. Using Tensor flow. Nothing fancy.

How much dev time for a good enginner (backend, kernel, multi core experience) to get this down to say 1hr ?

Obviously a very general question. Thanks for any input.


> You can't really know until you have access to such a vast amount of horsepower, and can apply it to whatever you want.

Something I've often wondered, and there are probably good reasons why, is that billionaire tech moguls - even the ones who are outwardly technical (or were in the past - people like Bill Gates, who we know had technical chops in the past) - that none of them (that I'm aware of) haven't ever tried to build "their ultimate computer".

For instance, if I had their kind of money, I've often thought that I would construct a datacenter (or maybe multiple datacenters, networked together) filled with NVidia GPU/TPU/whatever hardware (the best of the best they could sell me) - purely for use as my "personal computer". Completely non-public, non-commercial - just a datacenter I would own with racks filled to the brim with the best computing tech I could stuff into them (on a side note, I've also pondered the idea of such a personal datacenter, but filled with D-Wave quantum computing machines or the like).

What could you do with such a system?

Obviously anything massive parallelism could be useful for - the usual simulation, machine learning, etc; but could you make any breakthroughs with it - assuming you had the knowledge to do such work?

Which is probably why none have done it - at least as a personal thing.

I mean, sure, I would bet that people who own large swathes of machines in a datacenter, or those who outright own datacenter (like Google or Amazon) - their founders and likely internal people do run massively parallel experiments or whatnot on a regular basis, ad-hoc, and "free" - but it's a commercial thing, and other stuff is also running on those machines...

But a single person is probably unlikely to have or think of problems that would require such a grand scale before they would just "start a company to do it" or something similar; because in the end, just to maintain and administer everything in such a datacenter, if one were built, would require (I would think) the resources of a large company.

Of course, then I wonder if such companies - especially ones like Google and Amazon, which own and run many datacenters around the world, and also sell the resources of them for compute purposes - weren't started in some fashion (even if only in the back of their heads) by their founders with that idea or goal in mind (that is, to be able to own and use on their whim "the world's largest amount of computing power"...?


Paul Allen kinda did just that, although in a different direction. He built a datacenter and filled it with a bunch of old computers he thought were cool, like the DEC PDP-10. It's now the Living Computer Museum in Seattle.

https://www.pcworld.com/article/3313424/inside-seattle-livin...


I feel like "tech moguls" are the wrong type to expect this kind of interest out of. They got rich on either tools or workflows (i.e. CRUD), not intelligence/analytics/prediction. It's not the same mindset.

If anyone were to own a secret HPC cluster, it'd probably be a finance billionaire. Or the owner of a think-tank who made their money as a subcontractor for state intelligence agencies.


Or as it currently stands you can run the buggier, more resource intensive equivalents the software you used to run! Now featuring pervasive spyware that tracks and catalogues your every action! Wanted a permanent copy to the software you paid for? Too bad, it's only available as "A Service" which means you get constant changes you never asked for AND you get to pay for them on a recurring basis whether you like it or not!

Seriously though I feel like most of the gains in hardware have been wasted by shittier software both in terms of quality and in the way the software itself acts against the interests of its users.


A bit off topic, but I am looking at TPUs at the moment. Can I ask for clarity if you mean TPUs are easier to use than GPUs out vice versa?

I thought TPUs are harder to work with because they only support Tensorflow rather than Tensorflow and other high-level frameworks as well as low-level CUDA that are supported by GPUs


Sure!

TPUs aren't necessarily easier to use – it's about the same – but they're powerful. I've documented some benchmarks in this tweet chain, where I trained GPT-2 1.5B to play chess using a technique called swarm training: https://twitter.com/theshawwn/status/1214013710173425665

The power turned out to be from the fact that every TPU gives you 8 cores at your disposal. I never use the Estimator API. I just scope Tensorflow operations to specific TPU cores. Works great.

In terms of actual performance, I was delighted to discover that TPUs can be faster than GPUs when you use all 8 cores: https://twitter.com/theshawwn/status/1196593451174891520 (solution notebook: https://twitter.com/theshawwn/status/1205914446918492170)

It also gives you flexibility. TPUv2-8 can apparently allocate up to 300GB (!) if you don't scope any operations to any cores. Meaning, you run it in a mode where you only get 1 core of performance, but you get 300GB of flexibility. And then you can connect multiple TPUs together as described in the tweet chain, which quickly makes up the difference.

There is also the question of cost savings. A TPUv3-8 seems about as expensive as a V100. Which one is worth it? Well, it depends. In my experience a GPU is easier to use and quicker to set up if you only need one GPU of horsepower. But suppose you wanted to train a massive model in 24 hours. What's your best option? For us, it was TPUs.

The reason is subtle: It's hard to find any single VM that can talk to 140 GPUs simultaneously. But you can talk to 140 TPUs from a single VM no problem. And since you get 800MB/s to and from the VM, you can average the parameters across all TPUs very quickly.

This is similar to what TPU pods do internally. And while TPU pods are impressive, they are also impressively expensive. A TPUv3 pod will run you $192/hr at evaluation prices. Whereas you can play with a TPUv3-8 for $2.50/hr. You can also play with a TPUv2-8 for free using Colab: https://github.com/shawwn/colab-tricks

Yesterday I used that notebook to port forward Colab's free TPUv2-8 using ngrok, then trained using the new StyleGAN 2 codebase: https://twitter.com/theshawwn/status/1214245145664802817

I think a swarm of TPUs can cost significantly less than a cluster of V100s with less engineering effort.

That said, right now most codebases are designed to work with V100's. It will take time before TPUs widely proliferate. But speaking as someone who was once skeptical of TPUs and who has spent several months trying to discover their secrets, I feel that TPUs can get the job done quicker and easier than a GPU cluster. The hardware is also more accessible, since you can more easily spin up 100 TPUs than 100 V100s. But mainly I like that it's all coordinated from a single machine. It's conceptually simpler to debug and to implement.

If you run into any issues or have any trouble with TPUs, please feel free to ask here or DM me. I love talking about this stuff.

EDIT: In regards to usability, the new Jax library works with TPUs out of the box. Google seems to be heading in the direction of Jax. My initial reaction was "Not another library..." but first impressions were positive. It's not quite the React of ML – an idea which I hope to see soon – but it does seem easier for certain research purposes.

PyTorch also recently gained TPU support, and as far as I know they've put in some serious efforts to make sure things run quickly. As for how you use all 8 cores of a TPU using PyTorch, I haven't looked into it yet. But I'd be surprised if you couldn't. It seems unlikely that they would design an API that would hamstring you to just 1 out of 8 cores.


Thanks a lot. This is a lot of information to digest. I will check out your Twitter feed and I think I need to start to play around with TPUs then. Scaling seems to work fantastic for you. I am working more in computer vision and we are running sometimes into weird bottlenecks with our GPUs where neither GPUs not CPUs are under full load. Unfortunately, drilling down on where the bottlenecks come from is not easy at all. I am assuming the profiler from Tensorflow works with TPUs in the same way it does with GPUs?


Oooh, you're so lucky you get to work on those kinds of problems. I know sometimes it feels frustrating to hunt for bottlenecks, but man is it satisfying to find it.

We had a similar situation at one point. The problem turned out to be that our CPU wasn't generating input data fast enough. So the first step is to confirm that your input pipeline isn't the issue.

The next step would be to break down the problem: Can you extract the smallest part of the codebase into a separate program, and try to make that run under full load?

That's not the technique I used, though. To figure out the multicore stuff, the trick for me was to comment out almost all of the code, until you're left with only a small part that actually runs on the device. Ideally the smallest part.

Basically, change your code so that the model file returns tf.no_op() (or as close to that as possible while still letting your input pipeline run). You want to be in a situation where your training loop is doing an equivalent of while(true) { read_input(); } so that you can verify that your pipeline is able to peg your GPUs to 100% usage.

If you get 100% usage, fantastic! That means you're left with an easy problem: start turning parts of the code back on until you find which part is reducing your performance. Then study that part to figure out why.

If you're not at 100% usage, you're either running into a fundamental limitation (which sometimes happens) or the pipeline isn't designed correctly in some way. I would compare it against other popular codebases such as StyleGAN 2 https://github.com/NVlabs/stylegan2 which is designed to use 8 V100s. The optimizer.py file is pretty insightful: https://github.com/NVlabs/stylegan2/blob/eecd09cc8a067e09e12...

Finally, my biggest tip would be to step back from the problem and think: is there something simple you can do to reframe the problem? When I find myself in a situation where I'm spending a lot of time and energy trying to get a certain thing to work, I can sometimes do X instead for 80% of the benefit. Try to find something like that in this case.

FWIW the TPU profiler was the first tool I reached for. I never got it working. The bag of tricks above ended up giving me effective results on a variety of codebases with no profiler. (A usage graph is pretty crucial, though, which Colab TPUs don't provide.)

So there are a bunch of general tips for solving weird bottlenecks blindfolded.

To answer your question directly:

I am assuming the profiler from Tensorflow works with TPUs in the same way it does with GPUs?

Not really. You're supposed to use cloud_tpu_profiler: https://cloud.google.com/tpu/docs/cloud-tpu-tools

But yeah, if you give specifics (ideally a link to a codebase + dataset + script that runs it) then I can try to look for candidates of what might be the bottleneck.


Sounds awesome and exciting. I can't wait either.

The personal ML bots will be a big things. Next step in total automation.


For a fair comparison (fp64 vs fp64):

«A single GPU card like the AMD Radeon MI60 has more computing power than year 2000 supercomputer ASCI Red (fastest supercomputer in the TOP500 list of June 2000):

• MI60: 7.4 TFLOPS (FP64)

• ASCI Red: 3.2 TFLOPS (FP64) »

https://mobile.twitter.com/zorinaq/status/112491212518746521...


Comparing GPU floating point performance with CPU floating point performance is comparing apples and oranges. GPUs may have higher raw FLOPS, but they have issues with workloads that aren't massively parallel or require branching.


That's true when you're comparing a GPU to a single CPU. But when you're comparing a GPU to an entire supercomputer the requirement that the workload has massive parallelism to use all available resources is present in both.


It's a little more complicated for that. The main problem is that the way GPUs are designed, their execution units share the same instruction pointer[1]. That's not an issue if you're multiplying matricides, but it's an issue anytime you have branches. Therefore, even when your workload is massively parallel, it doesn't necessarily mean that a GPU cluster would perform nearly as well as a CPU cluster with the same amount of FLOPs.

[1] https://en.wikipedia.org/wiki/Thread_block_(CUDA_programming...


However, since you seem to count the GPU, fully utilising a modern system is definitely not easy.


It's not easy to fully (or even partially) utilize a supercomputer from 2001 either. Or a modern supercomputer for that matter.


Also the supercomputer likely had a substantial ammount of solid state & fast spinning storage, even back then. Thats also often overlooked in these comparisons, not just the difference in precission.


A single 80mm nvme ssd would likely be faster than a significant amount of that supercomputers storage. In 2000 a million IOPS was a lofty target. Now we can do it on a single device.


ASCI red had 1TB of DRAM and 12TB of disk. Not bad, but three NVMe drives would clobber it. Putting that much DRAM in a box is still expensive today, but entirely feasible for about $5k-10k.


> while ASCI White used 64-bit.

x87 is 80-bit. :)


ASCI White wasn't Intel (it was IBM Power3)


oh, I was thinking ASCI red! fair enough!


When the univ I went to at Budapest got a second hand VAX cluster (also around '95) for the bargain basement price of only 50 000 CHF they needed to dig up the street to the nearest substation and have a new powerline installed. http://hampage.hu/oldiron/vaxen/9000_4.jpg the furthest away cabinet is the power supply. This photo is not even half of the cluster.

A few years before that, at another univ, they put an ancient IBM mainframe in place with a crane, temporarily removing the roof of the building.


I recently "upgraded" from a desktop to a laptop. This marks the end of an era for me. I always had a relatively powerful desktop at home, mostly running 24/7 since I couldn't be bothered to wait for it to boot. This ThinkPad X1 is the first laptop I own which is apparently powerful enough to host all my work in a 1.09kg package that easily fits in my backpack... OK, I am a text mode user, so gfx isn't what keeps my computers busy. Low latency realtime audio synthesis much more so. And I still remember when 33.6kbps were an exciting thing to have :-) Nice to see that tech moves ahead.


Are you sure it was in 1995? I joined Sun in 1997, and the E10k was launched a bit after that.

According to Wikipedia it was launched in 1997 so it does line up. If I remember correctly, the system was bought from Cray after SGI bought the rest, so I didn't know they even had it at Sun in 1995.

Also, the original model of the E10k supported 64 GB RAM.


My company back then bought one of those around that time. It was used to run large EDA software jobs.

On one beautiful day during the weekend, only a two weeks after delivery, our system admin noticed that the machine went offline. Logging in remotely didn’t work at all. No ping either.

He drove to work and ... the machine was gone.

Thieves had used a crane to lift it out of the building through a window onto a truck.

Sun told him that this wasn’t the first time such a thing had happened and that somewhere in the chain from order to delivery, an insider tipped off the thieves about where to find the latest.


I'm curious where the black market for something like that would even come from?


Back then, smaller nation-states wanting to do nuclear device simulation and the like would be my guess. Basically, countries that were restricted in some manner on gaining large amounts of parallel processing compute power for such simulations.


My money is on breaking it up and sell it for components. The 256 GB in server grade RAM alone cost a fortune.


The thing that boggles my mind is what people casually put in their pockets.


I think we already live in the cyberpunk/sci-fi realm in this regard. Soon we will have swarms of tiny AI powered robots...


I was thinking about this for house building, you go to a panel and the bots that make up the house reconfigure themselves to add a swimming pool, or an extra guest room, etc. Could be pretty awesome. Let’s hope they are not used for evil :-/


Until you try to open your door once and it goes "intruder alert"... Wasn't that a Doctor Who episode? :D


OTOH we'd rather have seen the single-core perf keep improving. What's the average performance speedup (vs 1 core) that the Sun customers got, or the AMD users get, on the average software they use? The progress in programming language technology hasn't been very kind to multiprocessing[1].

[1] GPUs are another kettle of fish of course, but have their own wellknown problems that prevent widespread use outside graphics


Could you share the specs for your build, specifically the motherboard and RAM models. Also have you verified that ECC actually works?

I am tempted to build something similar but AMD's wishy-washy ECC guarantees as well as Linux-specific issues make me unsure.


TR3960X Threadripper (24 core, 48 thread)

ASUS PRIME TRX40-Pro motherboard

8 sticks Kingston KSM26ED8/16ME (16GB, 2666Mhz DDR4, ECC)

Dual boot Windows 10 / Ubuntu 18.04

2x Samsung EVO 970 NvME M.2 1TB SSDs in RAID0 configuration.

Running in a Coolermaster Cosmos case with a stupidly big air cooler at the moment, to be replaced with a decent liquid cooler (it works, the Cosmos is a huge case because it had to be to hold a hacked Supermicro dual Xeon server board in it before (I really wanted ECC for my workstation))

nVidia 1080Ti+ GPU.

The ECC is detected and claims it is working although I've yet to see it correct an SBE. I haven't been running non-stop memory tests either though so.

I may end up removing the Linux partition since WSL2 works so well on this box.


I have a 3970X with ECC RAM and got error correction notifications in my Linux logs when I tweaked RAM timings too tight. Note that memtest86 doesn't know about ECC on Ryzen, so you may get unnotified error corrections happening if you use that.


Thanks for the info. Is this the same motherboard as OP?


No, but all TRX40 motherboards should support ECC the same. Mine is a gigabyte aorus pro wifi, FWIW.

BTW, Century Micro has the only unbuffered ECC modules at 3200MHz native speed (at least that was the case on the 39x0X release day). I don't know if they can be sourced outside Japan, though.

Fun story, I actually botched my order and got 2666MHz ones... but on closer inspection, it turned out the chips on the modules were actually native 3200MHz ones. With the SPD EEPROM saying they are 2666MHz. So I ended up overclocking them at their actual native speed. And I tweaked the timings to be a little shorter than what the 3200MHz modules were advertized for.


Thanks. Seeing that we are sharing ECC overclocking information, this post has some useful information for Kingston RAM: https://old.reddit.com/r/ASUS/comments/cw74rl/asus_pro_ws_x5...


What are you going to be using for the liquid cooler?


7 years ago it was doable to build similar desktop at similar price from refurbished server (4x6core Opteron CPU, 128GB DDR2 ECC RAM). It took about 1KW and was loud, but great complement for performance testing.

BTW I seen 64GB DDR4 sticks on amazon...


The memory access performance on ThreadRipper (and current Ryzen/Epyc) is much better than anything prior to this generation for workstation loads. Not that the shared I/O controller is without cost, only that it tends to average out better in most cases where multiple cores across chips are in use together.

Just got my 3950X w/ 64gb ram, not sure that I'd be able to practically use any more compute than this for what I play with, which is mostly multiple back-ends and some container orchestration for local dev and occasionally video re-encodes (BR-Ripping for NAS).

Some think $4k for this CPU is too much... considering the shear performance that you can get these days for under $10K there's never been a better time to build or buy a computer. My only regret is wasting time and money on aRGB that I cannot configure in Linux.


sorry for the buzz kill but I feel super weird about how something that was once a corporation class tool is now a casual consumer thingie.

I do wish for exascale in the medical field though.

I even-more-wish that the energy world can see ~similar improvements in efficiency.


The real funny thing is that for most people (not OP) it would still be used mainly for word processing, email, and occasionally casual gaming - and it would still be slow to run and boot.

The amount of processing power we each carry in our pockets (even the cheapest throw-away smart phones) would have been almost unthinkable 30 years ago; it's akin to the difference of an Altair of the 1970s vs what was available just 10-20 years prior. What took up a room now sat on a desk and could be purchased for the price of a car.

Now, what took up a room now sits in your pocket, and almost could be given away in a box of cereal its so cheap.

Heck - think about what's available in the embedded computing realm for pennies (or just a few dollars in single quantities) - it's mind boggling to an extent.


I concur but it depresses me somehow. Some say that it's worth it, to me it's just the same cycle of marketing trying to disguise the things as progress.


The comments are mentioning the Xeons in Mac Pros and how Apple should switch. I have no factual basis for this, but I figure Apple has got to be using AMD's new chips as leverage to get some pretty sweet deals on Intel silicon.


Deals that Apple does not pass on to their customers.


Why should they? The market dictates the price.

If people are buying Apple products at those prices then why should Apple lower their prices? The answer is they shouldn't.


Indeed it does. Which nicely explains the rising popularity of Hackintosh's, particularly among developers and other technologists. AMD Hackintosh's[1] in particular have skyrocketed in maturity and simplicity since Ryzen.

1: https://amd-osx.com/


Apple are locking down the platform via proprietary chips etc. It isn't a long term solution and shouldn't be relied upon.


To be fair, the Hackintosh community is pretty persistent. They added opcode emulation into the kernel, for example, to handle running on CPUs without the expected instructions (older AMD CPUs back in 10.8 or so). I wouldn't be surprised to see the cat-and-mouse continue.


Wonder what happens when Apple starts "enforcing" a Tx chip to boot.

https://en.wikipedia.org/wiki/Apple-designed_processors#Appl...


Indeed though keep in mind that the current method for running macOS on AMD doesn't use this and instead relies on patching through Clover. The downside being that whilst the OS itself may run any applications that use an opcode that isn't implemented will simply crash.


This sounds really interesting, do you have any links that I could follow to learn more? Is it setting an invalid opcode exception handler?


They released a whitepaper on it, yeah. It was a bit annoying to dig up, but Archive.org comes to the rescue:

https://web.archive.org/web/20100217014904/http://xnu-dev.go...


It's strange that I can't find any source code for this…



Thanks! I still wonder how usable programs that utilize SSE3 with this, but it's a pretty cool workaround.


I'd been running Hackintosh and an rMBP for my desktop and laptop respectively... this past year I've passed on both and now running Linux for my personal desktop, and will be getting a new laptop within the next few months, some of the Ryzen Asus laptops coming soon are interesting.

Although still not without issue, my workflow has aligned so much with Linux and it's finally reached a good enough point for my day to day use... not having to use the VM based mac or windows docker has been really nice (WSL isn't good enough imho).


So, explain to me again how "the marked dictates the price"? ...


I know you are being facetious, however there are options which the market provide.

1) Other companies that are willing to sell you workstation and laptop computers that you can run other operating systems on such as one of the Linux variants, Windows, BSD etc. Nobody is forcing you to buy a computer from Apple.

2) There is a thriving second hand market of Apple machines just look at ebay, craigslist, gumtree etc.

If a new Apple machine isn't worth it to you, you are free to buy alternatives.

I am going to buy one of the newer Lenovo Thinkpads as I don't think the MacBook pro is worth it to replace my ageing Macbook Pro.


> Which nicely explains the rising popularity of Hackintosh's, particularly among developers and other technologists.

Can you provide a citation for this? Having worked with literally thousands of engineers, I have never seen a hackintosh in real life.


I believe he’s wrong and they are not meaningfully rising in popularity. Hackintoshes started as soon as the Intel transition, and they do exist (I’ve seen a couple personally, both built around the Leopard era when tools were mature and the desirability of iCloud/iMessage integration was lower.) Today there aren’t many people who need to push computing beyond the relatively affordable Mac Mini and iMac configurations. Most Hackintosh practitioners want Apple to release the “XMac,” a cheap and configurable desktop tower* and operate their Hackintosh in its stead.

It’s a respectable hobby, though, like iOS jailbreaking or emulation, and for persistent people it does let them run MacOS on more powerful hardware than they could afford.

* https://arstechnica.com/staff/2005/10/1676/


Not sure a citation exists for such an assertion -- but I'm a decade deep and numerous production (profitable) iOS apps shipped without ever touching Apple hardware. MacOS is a joy but the hardware is consumer rot.


I’m sure if it was indeed popular, there would be some way of demonstrating that. Aside from the fact that I have never seen one in my entire career as a consultant, working with hundreds of organisations, the reason this sounds ridiculous to me is that Apple has a long history of making very little effort to obstruct the hackintosh community. Which suggests very strongly that the community is too small for Apple to bother with. There are a few topics on HN that seem to bring out people claiming that incredibly niche interests are actually very common and popular. Apple is one of them. So I don’t think it’s unreasonable to expect that somebody making such an incredible claim should have at least some way of substantiating it.


I’m sure if it was indeed popular, there would be some way of demonstrating that.

Google Trends suggests that searches for 'Hackintosh' peaked around 2009 and have been steadily declining down to around a half since then. Searches for 'Ubuntu' dominate so much it makes the Hackintosh graph look flat by comparison, but Hackintosh seems about as popular as 'Manjaro' (Linux Distribution) currently is, fwiw:

https://trends.google.com/trends/explore?date=all&q=hackinto...


I personally know of a iOS software house where all developers use hackintosh, so yeah it's popular.


Which you can derive from there being at least one hackintosh shop?


It's the only iOS software house I personally know, so in my POV it represents 100% of macos/iOS shops.

YMMV.


You put together all your information from anecdotally never seeing one.


The only reason I'm using a Mac now is by pure good luck as all my PC components at the time (back in 2011-ish) were OSX Snow Leopard compatible, right down to Wi-Fi, bluetooth, motherboard, soundcard, etc. I admit it wasn't completely vanilla due to the infamous tonymacx86 software method but it did get me running quickly.

I gave OSX a test drive and found it much more simple compared to Windows. As I already had an iPhone/iPad it made sense to switch. Come upgrade time I bought a Macbook Pro and have been on OSX since.


Ah yes, the "market" for macOS machines with PCIe slots.


While you may scoff enough people are willing to pay the extra for Mac OS and a high level workstation to justify their prices.


Elasticity?


The market price is the equilibrium of both supply and demand.

What you're saying is the fact people are buying their machines means they shouldn't change price or value proposition... because there are in fact purchases; which, is quite frankly, baffling, to me, a humble idiot.

Mercedes must think their EQC is positioned perfectly in the market with 55 sales? I now imagine Magic Leap will be leaping to raise prices with their next version?


> The market price is the equilibrium of both supply and demand.

Obviously.

> What you're saying is the fact people are buying their machines means they shouldn't change price or value proposition... because there are in fact purchases; which, is quite frankly, baffling, to me, a humble idiot.

If Apple are selling the machines in sufficient quantities at whatever they are priced at (I haven't cared to look) then obviously Apple's customers think they are worth it. It isn't really more complicated than that.


You assume without any facts or supporting evidence that they are achieving optimal sales. What you're saying is literally something you're just making up out of thin air. It's okay to be completely full of shit, just don't market it as truth.


Just doing a quick web search and the company made $224 billion dollars in 2018. Do you honestly think they aren't achieving optimal sales? The proof is in the pudding and they have a very very large pudding.

> It's okay to be completely full of shit, just don't market it as truth.

I know you think you are being big brained but not everything is "you must provide a citation". It is pretty obvious Apple knows the market well, knows exactly what they can and can't charge for certain products. You pretending otherwise because I haven't provided you with a citation is a complete joke, it is like asking someone to cite for evidence that the sky is blue.


> "Obviously the way it is, is the way it is. Obviously".

In conclusion, you think no company should adjust their prices, ever? Because the current price is the market price which is obviously the right price because it's the market price?

It's circular reasoning, which can be applied to any sales situation - and if it explains everything, it explains nothing.


> In conclusion, you think no company should adjust their prices, ever?

Obviously not. I am saying they have no incentive to change the price if the sales are inline or above with what they would have forecasted.

> Because the current price is the market price which is obviously the right price because it's the market price? > > It's circular reasoning, which can be applied to any sales situation - and if it explains everything, it explains nothing.

Again you don't seem to understand basic market economics. Your product is only worth what people are willing to pay for it. There is the odd exception to the rule (Head and Shoulders Shampoo being one of them, which is priced far higher than they originally intended because people assumed it didn't work because it was cheap).

Generally if there are two or more companies producing product X (in this case Computer Workstations and Laptops) then the market will coalesce around a particular price point for a particular specification. Sure there are those that will always stick to a brand, but the vast number of consumers won't be loyal.

Whether or not the company makes a profit on each unit sold is irrelevant to its market price. If they price their product higher than their competitors people will look at the alternatives.

e.g. I bought a MacBook Pro in 2015 because Apple's machine was cheaper than Lenovo, Dell for the same spec and had a better screen than any of the competitors machines.

This really isn't complicated stuff. I think that personal bias seems to cloud people to some basic truths.


> The market price is

No. There is usually no such thing as _a_ market price. There's a distribution of prices for purchases of the same item (and that's when we ignore the cases of transactions involving more than just the transfer of money).

> the equilibrium of both supply and demand.

Supply and demand for a specific products are more the _result_ of socio-economic processes and phenomena rather their _causes_.


You must have never seen a supply and demand graph, let me elucidate you: https://cdn.britannica.com/70/74270-050-317C4423/Illustratio...

If you're talking to me about socio-economic processes when we're talking about something simple, it pretty clearly shows you're you're not very educated in Economics.


You _do_ realize that if you say something with a chart instead of with text it does not become more valid, right?

https://en.wikipedia.org/wiki/Supply_and_demand#Criticism

If you like your charts and economic formalisms, and believe in "market prices", perhaps you should take the time to read the Candide-like "Production of commodities by means of commodities" by Piero Sraffa.


If people continue to pay Apple's (sometimes) outrageous prices, why should they lower them?

(I'm just as guilty, having spent over two grand on MBPs multiple times!)


Despite their price, there was a time when macbooks were only 5-10% more than the PC equivalent laptop. People that were complaining about its price were inevitably comparing it to bottom of the barrel PC laptops, not higher end business laptops that had comparable specs.

I haven't priced out any recent macbooks to know if that's still true though. Glancing at the new 16" macbook pro, it seems like it might be reasonably priced for what you're getting.


There is so much that goes in to a laptop that doesn't make it to the spec sheet either. People only look at a couple of specs to decide how much it should cost but some laptop makers put everything in to those specs and cheap out on everything else and you end up with a laptop with a fast CPU but brittle plastic, A TN display, a DAC that hisses and a whole bunch of other nastyness


Yeah, the MBP 16” is pretty comparable to the Dell XPS in price - at least when comparing base models.

However, the costs go up a lot if you spec out a custom config (+$400 just for 32GB RAM). Then the MBP starts looking quite a bit more expensive. Overall I don’t think they’re a bad buy though if you want macOS.


>Yeah, the MBP 16” is pretty comparable to the Dell XPS in price - at least when comparing base models.

The Dell XPS [1] with a comparable spec cost $1650 compare to MBP 16" $2399. In the old days Apple would have priced it closer to $2199 or slightly lower.

Somewhere along the line they started making Mac same margins as iPhone.

[1] https://www.dell.com/en-us/shop/deals/new-xps-15-laptop/spd/...


That said, the Mac pro, even starting at the base price is pretty outrageous... I mean, I get $500 for the case and $1500 for the MB, but the rest just seems to be too much in aggregate, and even more ridiculous for mfg upgrades out of the box.


>Despite their price, there was a time when macbooks were only 5-10% more than the PC equivalent laptop.

I seriously doubt 5-10%. You are talking about minimum $50 - $100+ dollar difference, that has never happened. Mac has always been roughly 20-30% more expensive than a laptop with comparable specs. So $1000 comparable spec laptop, Apple will sell you one for $1300, ( But with more expensive upgrades )

The 30% has been fine for years, the quality and finishing as well as macOS was well worth the price tag. But in recent years it hasn't been 30% at all.


I remember when I bought my first Macbook Pro in 2014, the specs compared to Dell/Lenovo was in the same price range. There was no Apple Premium.

So I decided to give Apple a chance, and I am still using that laptop today, it has really been great value for my money.


As you lower the price, more people can and will buy your product giving you (hypothetically) more profits than before. I’d like to think Apple has done all their homework about what price point to sell at to maximize profit but honestly at this point I think they just make up whatever huge number they want for the Mac Pro price to make it seem cool and go with it.

https://blog.asmartbear.com/price-vs-quantity.html


Don't feel too bad. 386 PCs used to sell for $15k in 2020 dollars.


When I was in high school in the early 2000s I had a friend who told me his parents purchased a 386 when they first came on the market where I live and they paid around $15k. My jaw just dropped. Then my friend just started chuckling at his parents folly when looking back even at that time it was such an expensive paperweight. Heck, thinking about it now it probably came up in conversation because at the time I had a hobby of picking up old computers that people had thrown out and cobbling together the working parts and built a 386 and a 486 that way. Good times.


Apple has been successfully using the Good, Better, Best three-tiered pricing model for quite some time. I remember buying a Powerbook 140 at the time when I really wanted the 170 but could not justify the increased price.

From wikipedia:

"Intended as a replacement for the Portable, the 140 series was identical to the 170, though it compromised a number of the high-end model's features to make it a more affordable mid-range option. The most apparent difference was that the 140 used a cheaper, 10 in (25 cm) diagonal passive matrix display instead of the sharper active matrix version used on the 170. Internally, in addition to a slower 16 MHz processor, the 140 also lacked a Floating Point Unit (FPU) and could not be upgraded. It also came standard with a 20 MB hard drive compared with the 170's 40 MB drive."


So where's the "good" Mac with a PCIe slot?


The MBP isn’t even a very good example of Apple price gouging. Any other laptop that your can get for less money is likely to have rather significant trade-offs.


Not much of a deal, but the price differential on the high end Mac Pro CPUs is actually slightly less than Intel list.


So? As any corporate sack-chortler worth their salt will eagerly tell you, to seek rent is not only their prerogative but their moral duty.


My hunch is that the ARM transition is coming sooner rather than later and instead of spending time and effort into rewriting a bunch of OS functions (like AirPlay Mirroring) to support AMD processors that lack Intel-only features like QuickSync, Apple is just going to drop the iOS pieces it has written for ARM64 into the MacOS (or whatever it's called) that runs on their upcoming ARM-based desktops and laptops.

I mean, I wouldn't re-write code to support hardware-accelerated video transcoding on Ryzen+GPU if I knew that in 2-3 years I was moving from x86-64 to ARM64.


How well would Premiere or Photoshop run on ARM? How long would it take to rewrite these apps to run natively (i.e. with acceptable performance) on an ARM platform? Until you can do video/photo/audio editing faster w/ ARM using existing software products, I do not see it being viable as a replacement for x86 in any of Apple's higher-end products. Perhaps ARM is included for sake of mobile development, but with 32+ modern x86 cores you could just as well emulate ARM and barely feel any overhead.

Sure, there are a lot of neat tricks ARM can do with special instructions and hardware accelerators in very well controlled use cases. But, for the average creative professional who doesnt have time or patience to play with hyper-optimizing their workflow, having an x86 monster that can chew through any arbitrary workload (optimized or otherwise) is going to provide the best experience for the foreseeable future.


Apple has dragged Photoshop kicking and screaming onto different platforms and even OS’ before. If they succeeded when they were nearly dead (OS X Cocoa timeframe) they will have no problems doing it now.


Like sibling comment says, it's not exactly the first cpu transition for Photoshop...

> Photoshop 1

> Photoshop 1 (1990.01) requires a 8 MHz or faster Mac with a color screen and at least 2 MB of RAM. The first release of Photoshop was successful despite some bugs, which were fixed in subsequent updates. Most users ended up using version 1.07. Photoshop was marketed as a tool for the average user, which was reflected in the price ($1,000 compared to competitor Letraset’s ColorStudio, which cost $1,995).

> Photoshop 1.x requires Mac System 6.0.3, 2 MB of RAM, a 68000 processor, and a floppy drive.

Source: https://lowendmac.com/2013/photoshop-for-mac-faq/


They are rewriting all those Apps on iPad OS anyway, which is where many of the professionals are moving towards. It is the same with Autodesk, their CEO said he doesn't know how well iPad Sales are doing, but he clearly sees a trend of more pros moving to iPad. They are everywhere in their industry.

(Unfortunately I can no longer google the link of the video)

Which is not to say they will move to ARM. I am still skeptical of it.


> to support AMD processors that lack Intel-only features like QuickSync

QuickSync is only just a Hardware Video Encoder. Which AMD has as well, it is called VCN [1]. Not to mention Apple hasn't been using QuickSync for as long as they have been shipping T2, where Apple uses their own Video Encoder within T2. ( T2 is just a rebadged A10 )

[1] https://en.wikipedia.org/wiki/Video_Core_Next


And 100 Comments but not a single mention or suggestion as to how would Apple deal with their vested interest in Thunderbolt. I would not be surprised if 90% of the PC shipped with Thunderbolt were from Apple.

USB 4 is out, the Thunderbolt 3 spec has been out for quite a long time as well, and yet we dont even have a single announcement with USB 4 controller.


Unfortuntately Threadripper only ("only"..) supports 256GB of RAM, Mac Pros can go up to 1.5TB.

So they definitely couldn't switch completely, and supporting both would be expensive for apple due to doubling mobos + testing + drivers etc, and confusing for the consumer because of the differing max RAM capabilities.


You're not considering AMD Epyc chips here, and Supermicro already have dual socket boards out that will take 2TB of RAM.


You're right I wasn't, because I presumed those chips were clocked too low to be useful outside of server farms.

I am not the audience for a Mac Pro though, maybe it's fine?


I'm sure Apple's own chip team works pretty good as leverage too.


“consumer variant of the 64-core EPYC”

At nearly $4000 for just the CPU, is that still consumer territory? I assume only huge companies would spend that much money on a single CPU.


It's clocked at 2.9/4.3 instead of 2.0/3.4 like the 64 core EPYC part. So it's actually a lot faster than the server model. Kind of strange.


The EPYC still has double the bandwidth (8x vs 4x DDR4-3200).


Can someone explain why do server variants always have lower base clocks? In particular I'm interested if consumer higher clocked variants are less reliable for long-term 24 hour full-load use. It has to be something like that, and not just power consumption considerations.


you can’t fit the kind of cooler you would need in a server case is one reason, power is likely the other.


They do make 4U server cases that would fit watercooling, several GPUs, and a heavily-loaded motherboard with redundant PSUs quite nicely.


AFAIK, you can overclock the Server variants if you have sufficient cooling, not sure on binning or the extra memory bandwidth in terms of CPU overclocking overhead... but there is room there.


my 7742's seem to spend essentially all their time while in heavy all core use at ~3.2.

This isn't the same experience I've had with the consumer threadripper at all, so I don't think these numbers make for a simple comparison.


which power profile do you have set- 225W or 240W?


Not sure, using a H11dsi. I don't recall setting it, though if I did I assume I would have picked the higher amount. :)

My chips are pretty well cooled.


Look for cTDP ("configurable TDP") in the BIOS.


It was set at a mysterious "auto". I've set it to 240 now. I'll see if I notice a difference.


If I'm not mistaken, most of these threadripper systems seem to come with a water cooling system that I'd bet isn't present in server systems. That's my guess.

It might also be that "consumer" workloads will run the cores at the high speed more infrequently than a server which might be running full tilt 24/7. Just a thought


AMD (currently) only has server and consumer CPU lines. So since Threadripper isn't a server product, it falls into the consumer line even though it will mostly be used by professionals and extreme enthusiasts with deep pockets.


Many large CG/VFX firms buy workstations that cost anywhere between 35K - 80K USD. See [1].

[1] https://twitter.com/yiningkarlli/status/1204564015113895936


That’s because we’re buying server racks for virtualized workstations, not deskside systems. Even from Tier 1 vendors a viable deskside workstation for pro VFX doesn’t approach 35K without shoving dual socket 24-core+ Intel chips that have no realistic purpose being in a workstation unless you’re buying for specificity.

The Mac Pro is a specialty item, not the norm.


I'd love to know more about this, because I think it is the future for everyone. What VDI environment are you using, what runs on your desktop vs run on the racks? Are the racks shared or do you have dedicated hardware provisioned to you? Do you use more than one backend rack system at a time?


Does it mean people buy Mac Pros because MacOs cannot run inder vm?


People buy Mac Pros to specifically run Mac Video Editing Software. This is the only reason to own a Mac Pro. IE if you work for Marvel editing films.


Not sure I would call that consumer. Maybe single user might be more appropriate.


"Consumer" is used there for market segmentation, essentially to sell additional features in an other variant.

More or less like consumer goods companies are using the "pro" keyword on pretty much any somewhat evolved product but the other way around.


There is a huge gamer market. Every hardcore gamer wants to have the fastest cpu avaiable.

Those who can afford it, will buy it.

edit: yes, it is definitely overkill for most if not all avaiable games, but in a certain scene "overkill" is considered awesome


It's definitely not for gamers, even silly "I have to have the best thing" gamers. Games are considered "lightly threaded" in these sorts of conversations, so you're looking for max boost clock / IPC.

However, Lots of the youtube influencer ruling class will buy it, in part because of what you said ("those who can afford it..").

The real legitimate consumer base for this are people whose work productivity is held back by compute loads that are embarrassingly parallel. If you spend a lot of your time waiting for a (well threaded) compiler to finish, or blender to render something out, or whatever.


to be fair YT and twitch "influencer ruling class" uses all threads for video encoding (you need software encoders for best quality)


At least for Premiere it's not as useful as you'd think: https://www.youtube.com/watch?v=StJssAQZlZc&t=38s

(at least for now, I don't work in that industry but it may just be a code quality issue)


Twitch / "Influencer" video gamers aren't using Premiere for videos.

Twitch streamers need a "streaming" solution. They play live, and instantly react to the crowd. If someone pays for an emote or something, the Twitch-streamer is expected to look on camera and say thank you to the donor (and maybe repeat the message that the donor paid for).

This means that a Twitch streamer's computer MUST encode the gameplay live. Traditionally, Twitch streamers would buy two computers, one to play video games, and a 2nd computer to process the video stream and upload it to Twitch.

With the advent of 16+ core computers, Twitch streamers have begun to simply buy one computer, lock 8-cores to the video game, and then lock 8-cores to the Twitch encoder.

Presumably, something like a Threadripper (24+ cores) could process the video stream for better quality and lower bandwidth. Maybe live VP9 encoding, for example (8-cores for the video game, 16-cores or more for the encoder)


I know what Twitch is, I said YouTube. Someone else said Twitch alongside YT.


streaming is a thing on YT


ffmpeg is a much better way to look at video encoding. Codec libraries themselves have varying ability to use multiple cores effectively.


I'm not sure modern game engines make good use of large numbers cores, because most consumers don't have that many.

Intel chips are still competitive for single-threaded performance, from what i can tell https://www.pcworld.com/article/3453946/amd-threadripper-397...


You could host your own Eve Online universe in your bedroom.


On a virtual machine cluster, while you use a few of those cores to play a game from the same PC


EVE Online is written in python and can only leverage a single core per zone =)


Good thing the Eve Online universe has lots of zones to use all those cores (~8000 solar systems, iirc).;


Consoles have 8 and are about to have 8 cores / 16 threads. Multi core utilization on android is essential to running at lower clock speeds and not thermal throttling. Engines and graphics APIs are catching up and there are less singlethread bottlenecks than there used to be.


For most current games, anything more than 6 cores is overkill according to the benchmarks.

With the next console gen being based on Ryzen instead of the much less efficient Jaguar architecture, maybe 8 cores might be better used.


Nobody wants a 64 core CPU for games. Single core performance is the most important factor by far since most games aren't really optimized for parallel computing.


Does the CPU actually matter that much for modern AAA games? (I haven't played any AAA game in a looooooooooong time.)


I help people build PCs sometimes and peoples first pass at picking components almost always overspends on cpu and underspends on gpu. For any fixed budget you will usually get (much) better perf by getting the low-mid range ryzen 3/5 or i3/i5 cpus and spending the difference on a better graphics card.

Most games are not particularly CPU intensive (although there are exceptions like Ashes of the Singularity which actually does usually get bottlenecked by the CPU unless you have a really high end one)


Not really. You run the game loop on one core and physics simulation / sounds / world streaming on a handful of other ones. It's more about the maximum single core performance which is not that different on a $150 Ryzen. Going to the most expensive CPU on the market probably only gets you 5 FPS more compared to the top end GPUs that scale rendering in parallel like a dream and can give you 100+ more FPS.


And physic simulation is something trivial?


Physics simulation in most current AAA games is pretty trivial for low-mid range current cpus yes. There are outliers of course.


Amd has started uping the thread counts with the ryzen processors only. And now that Ryzen has started being adopted by gamers and streamers newer games are getting better at utilising these cores. with current generation games extra threads won't help but newer titles it is going to


LinusTechTips tester it, and, IIRC, the answer is CPU bottlenecks aren’t really a thing anymore.


Because "functional decomposition", a method of threading where you allocate threads per function like physics or rendering, fell out of fashion. Modern game engines instead use a task system where tasks are spun off a main thread and asynchronously computed.


They sure are a thing, because programmers. For example for a stutter free game RDR2 needed 6 cpu cores up to a patch one month ago thanks to bad console first optimizations.


The key word was really. There are obviously times where CPU bottlenecks are a thing, but for the most part, they’re not.


I just upgraded my CPU and kept my GPU, an older Nvidia 960 GTX. My CPU was a Q6600 and now I have a Ryzen 2700X. CPU matters so much it's amazing. Some games are heavy on the GPU and some are heavy on the CPU. Depends on what kind of things they have going on.


You'd be hard pressed to find a current cpu that would be the bottleneck in most AAA titles though. The Q6600 was released in 2006.


I haven't had any experience with cheap current generation CPUs. Have you any experience with ones such as an Intel Celerons that are about $30 on Amazon?

https://www.intel.com/content/www/us/en/products/processors/...


If you want a high 240hz refresh rate @ 1080p, having i7 9700k or i9 9900k is what you want.


the 64 core threadripper is the worst gaming cpu amd sells almost.


The original Apple II sold at an adjusted 2019 price of $5,476, and it sold millions. I would say there are plenty of consumers with that much buying power. Of course, that doesn't mean most consumers need that many cores - but that's still true even if it cost $100.


I'd say Pro-sumer HEDT/Workstation, yes... if you need that much compute and can use it, you're probably making money from it.


My fridge cost $4k and it does less than this CPU.


I bet it runs cooler.


I'd be interested in that comparison. A fridge throws off quite a bit of heat when cooling. Luckily they're closed the vast majority of the time and are well insulated.

Which uses more energy at Max load?


Definitely the CPU. A fridge will use +/- 200 watts. The TDP on these chips is upwards of 250.


You significantly overpaid for your fridge.


Go check out the price of built in fridges and re-evaluate your comment.


Miele, a European 'A' brand tops out at about a grand:

https://www.coolblue.nl/koelkasten/inbouw/miele


Do yourself a favor and look up a real fridge like a Thermador.


I see a fridge. Miele -> real fridge. Thermador -> also a real fridge. Difference in value not reflected in either functionality or efficiency, long term negative effect on pocketbook for non-income generating asset -> wasted money.


So wouldn't this be like 2.6 TFLOPS? I'm thinking if this can replace NVidia V100s to train something like ImageNet purely on CPU. However, V100 has 100 TFLOPS which seems 50x more than 3990X. Perhaps, I'm reading the specs wrong?

PS: Although FLOPS is not a good way to measure these stuff, it's a good indication of possible upper bound for deep learning related computation.


DDR4 has a bandwidth of about 25 Gigabytes per second. The memory on a V100 does about 900 Gigabytes per second. Cerebras has 9.6 Petabytes per second of memory bandwidth. For stochastic gradient descent, which typically requires high-frequency read/writes, memory bandwidth is crucial. For ImageNet, you're trying to run well over 1TB of pixels through the processing device as quickly as possible while the processor uses a few gigabytes of scratch space.


DDR has a bandwidth of about 25GBps per channel. You can hit around 100-200GBps on Epyc processors if you're utilizing ram efficiently. GPUs tend to enforce programming models that ensure more sequential accesses, but CPU can do it too.


oh that's true thanks! I knew GDDR had higher bandwidth but the gap seemed a little high when I looked it up


These stats are true, but the CPU's biggest advantage is L1, L2, and L3 cache.

In particular, the 3990x, 64-core Threadripper will have 256MBs of aggregate L3 cache, and 512kBs L2 cache per core (32MBs of L2 cache). Highly-optimized kernels may fit large portions of data within L3 cache and rarely even touch DDR4!

Note: Each L3 cache is only 16MBs between 4-cores. It will take some tricky programming to split a model into 16MB chunks, but if it can be done, Threadripper would be crazy fast.

True, GPUs have really fat VRAM to work with, but CPUs have really fat L3 cache and L2 cache to work with. And the CPU caches are coherent too, simplifying atomic code. GPUs do have "shared memory" and L2 caches, but they're far smaller than CPU-caches.


>> Cerebras has 9.6 Petabytes

More accurately, Cerebras has 9.6 "bullshitobytes" per second. If you can't verify this, it doesn't exist. You could claim insane "bandwidth" by considering your register file to be your "memory". But that doesn't make it so.


Nah, you’re probably not reading the specs wrong. GPU-type devices severely outclass CPUs in raw compute power. This has been true for years, and it’s why deep learning depends on them. But CPUs and GPUs fulfill very different niches computationally; GPUs are incredibly parallel but aren’t well-suited for serial tasks or tasks that require unpredictable branching/looping, which is pretty much exactly what CPUs are good at.


Think of it like this:

CPU’s do general computing. They’re super flexible but if you have a specific workload you might be able to use a different piece of silicon to get more performance.

GPU’s do more parallelized computing but they don’t do as many different operations. They’re really good at doing a small not super complex fast but massively parallelized (like updating an array of pixels on a screen, for example).

TPU’s are even more parallelized but the operations they do are even more specific and often simpler than the operations GPUs do.


Aren't CPU's also slower, because they have a security overhead? (checking memory access, etc. which the gpu does not (so much))?


That's not really why they're slower. CPUs are significantly more complicated. Things like branch prediction, transactions, sophisticated prefetching all take up a lot of silicon.


> Things like branch prediction, transactions, sophisticated prefetching all take up a lot of silicon.

Those things make CPUs faster at sequential execution.

In effect: CPUs are latency optimized. GPUs are bandwidth optimized.


This is it exactly.

The more specific you get on the circuit the less flexible it is and the more bandwidth you get at a specific task.

The tradeoff is flexibility for application specific performance. CPUs can do hella stuff but they can’t do a specific thing faster than specialty hardware.


GPUs are surprisingly flexible. GPUs are a full on turing machine.

The thing is, GPUs have horrible latency characteristics compared to CPUs. Whenever a GPU "has to wait" for RAM, it only switches to another thread. In contrast, CPUs will search your thread for out-of-order work, speculative work, and even prefetch memory ("guessing" what memory needs to be fetched) to help speed up the thread.

--------

Consider speculative execution. Lets say there is a 50% chance that an if-statement is actually executed. Should your hardware execute the if-statement speculatively?

Since CPUs are latency optimized, of course CPUs should speculate.

GPUs however, are bandwidth optimized. Instead of speculating on the if-statement, the GPU will task switch and operate on another thread. GPUs have 8x to 10x SMT, many many threads waiting to be run.

As such, GPUs would rather "make progress on another thread" rather than speculate to make a particular thread faster.

---------

What problems can be represented in terms of a ton-of-threads ? Well, many simple image processing algorithms operate on 1920 x 1080 pixel entries, which immediately provides 2,073,600 pixels... or ~2-million items that often can be processed in parallel.

When you have ~2-million items of work (aka: "CUDA Threads") waiting, the GPU is the superior architecture. Its better to make progress on "waiting threads" than to execute speculatively.

But if you're a CPU with latency-optimized characteristics, programmers would rather have that if-statement speculated. The 50% chance of saving latency is worth more to a CPU programmer.


Remember that these millions of items must actually be able to do work independently, i.e. you need very few (or regular and localized) dependencies between the data processed by each thread, so that threads can just wait on things like memory accesses rather than waiting on _other threads_.


> rather than waiting on _other threads_.

Hmm, I think I see what you're trying to say, but maybe more precise language would be better here.

GPU cores have extremely efficient thread-barrier instructions. NVidia PTX has "barrier", while AMD has "S_BARRIER". Both of which allow the ~256 threads of a workgroup to efficiently wait for each other.

-------

The other aspect, is that "Waiting on Memory" (at least, waiting on L2 memory) is globally synchronized in both AMD and NVidia systems. Waiting for an L2 atomic operation to complete IS a synchronization event, because L2 cache has a total memory ordering on both AMD and NVidia platforms.

Tying L2 cache to higher levels allows for memory coherence with the host CPU, or other GPUs even. That is to say: "dependencies" are often turned into memory-sync / memory-barrier events at the lowest level.

Synchronizing threads is one-and-the-same as waiting on memory. (Specifically: creating a load-and-store ordering that all cores can agree upon).

---------

I think what you're trying to say is that dependency chains must be short on GPUs, and that there are many parallel-threads of dependency chains to execute. In these circumstances, an algorithm can run efficiently.

If you have an algorithm that is an explicit dependency chain from beginning to end (ironically: Ethereum hashing satisfies this constraint. You work on the same hash for millions of iterations...), then that particular dependency chain cannot be cut or parallelized.

But Ethereum Hashing is still parallel, because there are trillions of guesses that can all work in parallel. So while its impossible to parallelize a singular ETH Hash... you can run many ETH Hashes in parallel with each other.


It's reading the specs wrong for apples to apples. OP is using the tensor core numbers, which is half precision and only for matrix multiplies. Operations that don't fit that will use the standard fp32/16 performance of the chip, which is around 13TFLOPS. still higher than 2, but nowhere near 100.


64 cores * 2.9 GHz * 8 single-precision lanes * 2 issue * 2 (FMA) = 5.9 TF. This compares with 14 TF for V100 (costs more and needs a host). The 100 TOPS for V100 refers to reduced precision (which may or may not be useful in a given ML training or prediction scenario). The V100 has much (~9x) higher memory bandwidth, but also higher latency.


* Radeon VII GPU is 14.2 TFlops for $600 right now.

* NVidia RTX 2070 Super is 9 TFlops for $500

True, Radeon VII and RTX 2070 are "consumer" GPUs... but Threadripper is similarly a "consumer" CPU and commands a lower price as a result.

"Enterprise" products cost more. EPYC costs more than Threadripper, V100 costs more than RTX 2070 Super. If you're aiming at maximum performance at minimum price, you use consumer hardware.


You pay the price in locked down fp64 functionality. (Typically quarter or less of fp32 rather than expected half.)


Most people don't need double-precision.

Similarly, Threadripper loses on RDIMMs, LRDIMMs, and have 1/2 the memory channels and 1/2 the PCIe lanes. Most people don't need that either.

Of course consumer products have fewer features than enterprise products. The chip manufacturers need to leave some features for "enterprise". The general idea is to extract more wealth from the people who can afford it, while providing consumers the features they care about at a lower cost.

-----

Really, the feature enterprise GPUs need seems to be SR-IOV, or other PCIe-splitting technologies. This allows a singular GPU to be split over many VMs.

Double-precision floats is niche to the scientific fields, also "enterprise" but I don't think most enterprise customers use double-precision floats.


Some product manager smiled as they entered in the pricing for this product. Screw the margin, make it match the SKU.


True story, AMD's original price was different. I suggested this price in a pre-briefing the night before. I got an email at 4am of the announcement to say it had been changed.


Can someone clarify what exactly is cut here compared to EPYC 7742 other than PCIe lines? I don't quite see really get how AMD want to avoid competing with their own product.


They also cut multi-socket and half the memory channels. It's only ~10% cheaper than Epyc 7702P so AMD is not hurting themselves.


Ah okay. I missed half of memory channels so it does make sense now.


The EYPC chips have twice the memory bandwidth as well, and allow for dual socket systems. I don't believe the enterprise security and management features are in their HEDT counterparts.

Chips like these are for a specialty market, for those who are using applications and workloads that can actually take advantage of all those cores and aren't running a personal datacenter. You're not going to see this offered in any of the rack/blade systems offered by the likes of Dell, HPE, SuperMicro, Lenovo, etc where organizations are actually going to be purchasing EPYC chips.


I haven’t seen any reviews taking into account the price of unregistered DDR4 RAM at the densities required to use these things. AMD is suggesting 2GB/core or 128GB, but these HEDT CPUs cannot support the typical registered RAM used by server memory kits and require unregistered desktop RAM (for market segmentation, you see) which can be (depending on SKU) quite a bit more expensive. I feel any price comparisons with Epyc or Xeon need to take that into account.


This is a bit tangential but maybe someone could give me a some advice.

When running with many cores is there some way to get the Linux kernel to run them at a fixed frequency?

I've been doing multicore work lately on AWS - which works pretty well but you don't get access to event counters so sometimes I'm having trouble zeroing in on what the performance bottleneck is. At higher core counts I get weird results and I can never tell what's really going on. Running locally I have concerns about random benchmark noise like thermal throttling, "turbo" and other OS/hardware surprises (I know on modern chips single thread stuff can run really different from when you load all the cores). I've been thinking of getting something a bit dumber like an old 16 core Xeon (I'm on a bit of a budget) and clocking it down - or maybe there is some better solution?


As was mentioned cpupower frequency-set will sort you out, though you have to do it for every single core/thread combo.

The other option if it's an occasional thing is you can change it in your BIOS and lock them (I believe) to a specific speed.

Since you're not doing it for higher performance just consistent performance you can lock it to your base clock frequency and it should be stable.


if you omit a specific cpu core, all of them are affected by ‘cpupower...’


Really? Maybe I'm thinking of a different tool then, because that didn't work for me (on an AMD CPU)


oh, that's surprising, here is what i did:

  seldon ~ # cpupower frequency-set -f 1.86GHz                                                   
  Setting cpu: 0
  Setting cpu: 1
  Setting cpu: 2
  Setting cpu: 3
  Setting cpu: 4
  Setting cpu: 5
  Setting cpu: 6
  Setting cpu: 7
  seldon ~ # cpupower --cpu all frequency-info | grep -E '^analyzing CPU.*|current CPU frequency'
  analyzing CPU 0:
    current CPU frequency: 1.86 GHz (asserted by call to hardware)
  analyzing CPU 1:
    current CPU frequency: 1.86 GHz (asserted by call to hardware)
  analyzing CPU 2:
    current CPU frequency: 1.86 GHz (asserted by call to hardware)
  analyzing CPU 3:
    current CPU frequency: 1.86 GHz (asserted by call to hardware)
  analyzing CPU 4:
    current CPU frequency: 1.86 GHz (asserted by call to hardware)
  analyzing CPU 5:
    current CPU frequency: 1.86 GHz (asserted by call to hardware)
  analyzing CPU 6:
    current CPU frequency: 1.86 GHz (asserted by call to hardware)
  analyzing CPU 7:
    current CPU frequency: 1.86 GHz (asserted by call to hardware)
this an oldish Intel(R) Core(TM) i7 CPU. just tried it on an odroid, and it seems to do the right thing there as well.


I don't have a answer to your frequency problem but I ran into similar issues with memory usage when trying to process LIDAR data in pandas and keras. I ended up buying an old HP quad xeon 4U with 320gb of DDR3 for $330 shipped. I used https://labgopher.com/ to find the server deals.


Thanks for sharing labgopher... would have made my own search much easier last year.


oh wow. thanks for the info. I saw some decent deals on newegg, but not that low

it's a big chunk of metal, but it would make life easier for some stuff


> When running with many cores is there some way to get the Linux kernel to run them at a fixed frequency?

just a wild thought: have you looked at 'cpupower frequency-set' ? it _might_ help.


You could disable Turbo and EIST in the BIOS - most will have a setting for these, and that will result in the cores running at nominal freq.

I don't know about a nice high-level linux interface, there may well be one, but if you need to access these settings from a running machine there are MSRs you can poke.


i think the smart buy (for most) is still the 32 core threadripper. it is a lot less money with much higher clocks and is less likely to be ram throughput starved.


I've been experimenting with some ideas for new ML approaches (not neural networks). I was thinking about playing around with FPGAs, which the high end ones are really expensive. 64 cores is making me think it probably is not worthwhile to focus on FPGAs or even necessarily GPU programming like I was thinking before.


With 64-lanes of PCIe on this Threadripper... why not all three?

Plenty of PCIe space to afford a few GPUs to experiment with your 64-core CPU. You probably can shove 2x GPUs + 1x FPGA into one box.


Right and it is possible that by doing something like that the power of the system would be multiplied by many times. It's really more about the programming model.


I'm not sure if that's a big deal with a single box.

Threadripper is only 300W+ under load. Idles for the CPU are ~100W for the whole system. (https://www.kitguru.net/components/cpu/luke-hill/amd-ryzen-t...)

GPUs are similar. GPUs take ~20W while idling, but 300W or 500W while under load. FPGAs are also similar.

-------

The total system idle of a Threadripper + 2x GPU + FPGA rig probably is under 200W.

If you happen to utilize the entire machine, sure, you'll be over 1000W. But you'll probably only utilize parts of the machine as you experiment and try to figure out the optimal solution.

A "mixed rig" that can handle a variety of programming techniques is probably what's needed in the research / development phase of algorithms. Once you've done enough research, you build dedicated boxes that optimize the ultimate solution.


By power I meant compute power. Did not mean power draw.


Ahh, gotcha. I guess I misunderstood.

Your earlier comment makes sense now that I understand what you're saying.


What are these new ML approaches?


16 cores per memory channel seems to be pushing it, if 47% cinebench improvement is any indication.


i want one of these solely to shave days off of waiting for fpga synthesis. even on a 3900x and smaller hobby board things can take hours

now whether its worth selling a body organ to kit out..


How does Windows 10 licensing work for this number of cores?


Honest question: do people who need 64 cores use Windows? I assumed these types of workstations and workloads are 99% Linux based.

IIRC windows's scheduler wasn't as good as linux's at managing these kind of paralel workloads.


Yes, we do :) Well, maybe not 64 cores.

In embedded/automotive, majority of the tooling does not have a linux version. Compiling is a bitch. Still, you're probably right about the 99% comment.


Maybe for rendering workloads one could make use of that many cores on a Windows PC.


Windows Server?


AFAIK, Windows isn’t limited by the number of cores, but the number of physical CPUs. Home and Pro allow 1(?) CPU, but Server you pay per CPU.

Although, I’m not sure how the 64 cores are presented to the OS. If they’re presented as, say, 2 32 core CPUs, Windows will complain.


Windows does "weird things" with high core counts, I don't remember the exact number but when I had a high cored machine before it presented as "2 cores" in the second numa zone and the remainder in the first.

like: zone1: 62 cores, zone2: 2 cores

the OS was also rather unstable.


Windows 7 [something] definitely allows for more than 1 CPU. I mistakenly configured KVM to show 10 CPUs with 2 cores each instead of 1 CPU with 10 cores with 2 threads each, and it definitely showed more than one CPU in the task manager (and less than ten).


I believe you need the windows 10 “pro for workstations” edition

Edit: according to this, the normal windows 10 pro edition would suffice. I feel like I’ve read conflicting things though, so take it with a grain of salt: https://answers.microsoft.com/en-us/windows/forum/windows_10...


Same as always? Win10's license isn't restricted by cores, that's usually the domain of their server OS's and software.


There used to be limitations on Windows regarding CPU count and maximum RAM size. With a consumer-level OS, you could not run more than two processors, for example. Given how earlier Ryzen acted like it was actually a bunch of NUMA nodes in the past, this could pose a problem (the normal version of Windows 7 wouldn't allow for a NUMA setup, you'd need Pro for that).

I don't think Windows 10 has this limitation anymore but it's a very valid question. Given their weird calculation for the number of licenses you need to run Windows server, it's probably good to be cautious about licensing weggeweest it comes to running Windows on high performance chips like Threadripper.


> There used to be limitations on Windows regarding CPU count

There still is. And the Threadripper is just one CPU, so you should be ok.

With pro edition of Win 10 there should be support for up to 2 CPUs, with whatever amounts of cores they have.


Having a blast with Ryzen 5 3600X. Only problem is that Win10 appears to have a bug with it. Stutters all over. Only stopped when I reinstalled the newest chipset drivers and set the Ryzen Balanced energy profile. Windows default energy profiles all stutter, fans make too much noise etc. Now it stays variating clock from ~3.8 to ~4.2. Before it was fixed at 3.79.


Updating on this if anyone is seeing: This was caused by faulty sata drivers: "AMD Sata Controller". Apparently it doesn't work correctly with Win10. Just revert back to default microsoft sata drivers and problem solved. I used a software called "Multimon" to track it down;


Have you tried it in Linux? When the first Threadrippers came out, I remember seeing people having all sorts of problems with them in Windows 10, with the same machines running in Linux outputting double the performance.


Yeah, probably a Win10 driver issue. When i updated it, it fixed it. Before i event got a BSOD because of it.


That was an issue on my (and other people's) super low budget Asrock motherboard. Fan profile in BIOS and/or the energy profile in Windows fixes it.


Yeah, mine is a msi mpg x570 gaming plus if i remember correctly. Probably a driver bug.


It sounds like your CPU fan isn't seated properly.


Well, doesn't sound like it. I've seen this problem reported by other people. At least it appears to be solved now.


When can I buy the new amd vega 2 without having to purchase a Mac Pro ?


its a bit excessive in the prosumer market but these will make great cheap home lab machines.




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: