When I left Sun in 1995 their "biggest" Iron was the Enterprise 10K (which internally was called "Dragon" because of the Xerox bus) A system with 64 cores and 256GB of RAM was just under 2.5 million dollars list. It needed over 10kW of power provided by a 60A 240V circuit. The power cord weighed in at like 20 lbs. I put together a new desktop with the TR3960 and 128GB of ECC ram, that motherboard will take the 3990 and 256GB of RAM if I chose to upgrade it. It really boggles my mind what you can fit under your desk these days with a single 120V outlet.
In 2001, the fastest supercomputer in the world was ASCI White. It cost $110M, weighed 106 tons, consumed 3MW of power (plus 3MW for cooling), and had a peak speed of 12.3 TFLOPS.
Right now, sitting under my desk is a RTX 2080 Ti GPU which cost around $1000, weighs 3 pounds, draws a maximum of 250 watts, and has a peak speed of 13.4 TFLOPS [1].
We truly live in amazing times.
[1] Not quite a fair comparison: the GPU is using 32-bit floating-point, while ASCI White used 64-bit. But for many applications, the precision difference doesn't matter.
It's not fast enough. After having access to 160 TPUs, it's physically painful to use anything else.
I hope in 20 years I'll have the equivalent of 160 TPUs under my desk. Hopefully less.
The reason it's not fast enough is that ... there's so much you can do! People don't know. You can't really know until you have access to such a vast amount of horsepower, and can apply it to whatever you want. You might think "What could I possibly use it for?" but there are so many things.
The most important thing you can use it for is fun, and intellectual gratification. You can train ML models just to see what they do. And as AI Dungeon shows, sometimes you win the lottery.
I can't wait for the future. It's going to be so cool.
I looked to use some for solving PDEs, but Google had literally zero documentation on how to cross-compile C to TPUs, launch kernels, etc.
AFAICT, you either use tensorflow or some other product that supports them, and for which the TPU code is not open source, or you can't use TPUs at all.
I use Tensorflow 1.15. The world has been steadily pushing for Tensorflow 2.0 or Jax, but I like the simplicity of the Session model. It's so simple you can explain it in one sentence: it's an object that runs commands. Tell the session to connect to the TPU, and it will run all those commands on the TPU.
Jax is new to me (and to everyone; they just released it). But it looks like Google is pouring some serious R&D into it.
Two things help a lot. One, twitter. You can get a direct line to the people who actually make these beasts. Exploit it when you can. Like you, I dislike using a black box, and I'm intensely interested in the details of how to communicate with a TPU at a low level. I recently asked someone on the jax team about it here: https://twitter.com/theshawwn/status/1213221594052599808
Two, TFRC support has been incredibly helpful. https://www.tensorflow.org/tfrc I don't know who they have working the support channels, but those guys and gals are some of the most helpful and cheerful people I've come across. I often asked them very technical questions and to my surprise, they followed up with an A+ response almost every time, usually the next day.
Pytorch is giving TF a real run for its money, and to be honest I once felt it was a mistake to invest so much time into Tensorflow. But it turned out to be a big advantage due to Google's investment in the overall ecosystem. TPUs are something that only Google has the resources to pull off.
Note that the traditional path towards "just get a TPU up and running and start playing with it" is to use one of their Colab notebooks on the topic. https://cloud.google.com/tpu/docs/colabs I've been implicitly steering you away from these because you seem (like me) to want to know more of the low-level details. Those notebooks are designed to let ML researchers get results quickly, not for hardware enthusiasts to exploit heavy metal. The jax notebooks felt much more satisfying in that regard.
Can u mention how much human Dev time is involved?
We have a stupid-basic single machine Deep reinforcement Self play setup. It takes about 24 hrs to run a full experiment. The NN is the bottle neck. Using Tensor flow. Nothing fancy.
How much dev time for a good enginner (backend, kernel, multi core experience) to get this down to say 1hr ?
Obviously a very general question. Thanks for any input.
> You can't really know until you have access to such a vast amount of horsepower, and can apply it to whatever you want.
Something I've often wondered, and there are probably good reasons why, is that billionaire tech moguls - even the ones who are outwardly technical (or were in the past - people like Bill Gates, who we know had technical chops in the past) - that none of them (that I'm aware of) haven't ever tried to build "their ultimate computer".
For instance, if I had their kind of money, I've often thought that I would construct a datacenter (or maybe multiple datacenters, networked together) filled with NVidia GPU/TPU/whatever hardware (the best of the best they could sell me) - purely for use as my "personal computer". Completely non-public, non-commercial - just a datacenter I would own with racks filled to the brim with the best computing tech I could stuff into them (on a side note, I've also pondered the idea of such a personal datacenter, but filled with D-Wave quantum computing machines or the like).
What could you do with such a system?
Obviously anything massive parallelism could be useful for - the usual simulation, machine learning, etc; but could you make any breakthroughs with it - assuming you had the knowledge to do such work?
Which is probably why none have done it - at least as a personal thing.
I mean, sure, I would bet that people who own large swathes of machines in a datacenter, or those who outright own datacenter (like Google or Amazon) - their founders and likely internal people do run massively parallel experiments or whatnot on a regular basis, ad-hoc, and "free" - but it's a commercial thing, and other stuff is also running on those machines...
But a single person is probably unlikely to have or think of problems that would require such a grand scale before they would just "start a company to do it" or something similar; because in the end, just to maintain and administer everything in such a datacenter, if one were built, would require (I would think) the resources of a large company.
Of course, then I wonder if such companies - especially ones like Google and Amazon, which own and run many datacenters around the world, and also sell the resources of them for compute purposes - weren't started in some fashion (even if only in the back of their heads) by their founders with that idea or goal in mind (that is, to be able to own and use on their whim "the world's largest amount of computing power"...?
Paul Allen kinda did just that, although in a different direction. He built a datacenter and filled it with a bunch of old computers he thought were cool, like the DEC PDP-10. It's now the Living Computer Museum in Seattle.
I feel like "tech moguls" are the wrong type to expect this kind of interest out of. They got rich on either tools or workflows (i.e. CRUD), not intelligence/analytics/prediction. It's not the same mindset.
If anyone were to own a secret HPC cluster, it'd probably be a finance billionaire. Or the owner of a think-tank who made their money as a subcontractor for state intelligence agencies.
Or as it currently stands you can run the buggier, more resource intensive equivalents the software you used to run! Now featuring pervasive spyware that tracks and catalogues your every action! Wanted a permanent copy to the software you paid for? Too bad, it's only available as "A Service" which means you get constant changes you never asked for AND you get to pay for them on a recurring basis whether you like it or not!
Seriously though I feel like most of the gains in hardware have been wasted by shittier software both in terms of quality and in the way the software itself acts against the interests of its users.
A bit off topic, but I am looking at TPUs at the moment. Can I ask for clarity if you mean TPUs are easier to use than GPUs out vice versa?
I thought TPUs are harder to work with because they only support Tensorflow rather than Tensorflow and other high-level frameworks as well as low-level CUDA that are supported by GPUs
TPUs aren't necessarily easier to use – it's about the same – but they're powerful. I've documented some benchmarks in this tweet chain, where I trained GPT-2 1.5B to play chess using a technique called swarm training: https://twitter.com/theshawwn/status/1214013710173425665
The power turned out to be from the fact that every TPU gives you 8 cores at your disposal. I never use the Estimator API. I just scope Tensorflow operations to specific TPU cores. Works great.
It also gives you flexibility. TPUv2-8 can apparently allocate up to 300GB (!) if you don't scope any operations to any cores. Meaning, you run it in a mode where you only get 1 core of performance, but you get 300GB of flexibility. And then you can connect multiple TPUs together as described in the tweet chain, which quickly makes up the difference.
There is also the question of cost savings. A TPUv3-8 seems about as expensive as a V100. Which one is worth it? Well, it depends. In my experience a GPU is easier to use and quicker to set up if you only need one GPU of horsepower. But suppose you wanted to train a massive model in 24 hours. What's your best option? For us, it was TPUs.
The reason is subtle: It's hard to find any single VM that can talk to 140 GPUs simultaneously. But you can talk to 140 TPUs from a single VM no problem. And since you get 800MB/s to and from the VM, you can average the parameters across all TPUs very quickly.
This is similar to what TPU pods do internally. And while TPU pods are impressive, they are also impressively expensive. A TPUv3 pod will run you $192/hr at evaluation prices. Whereas you can play with a TPUv3-8 for $2.50/hr. You can also play with a TPUv2-8 for free using Colab: https://github.com/shawwn/colab-tricks
I think a swarm of TPUs can cost significantly less than a cluster of V100s with less engineering effort.
That said, right now most codebases are designed to work with V100's. It will take time before TPUs widely proliferate. But speaking as someone who was once skeptical of TPUs and who has spent several months trying to discover their secrets, I feel that TPUs can get the job done quicker and easier than a GPU cluster. The hardware is also more accessible, since you can more easily spin up 100 TPUs than 100 V100s. But mainly I like that it's all coordinated from a single machine. It's conceptually simpler to debug and to implement.
If you run into any issues or have any trouble with TPUs, please feel free to ask here or DM me. I love talking about this stuff.
EDIT: In regards to usability, the new Jax library works with TPUs out of the box. Google seems to be heading in the direction of Jax. My initial reaction was "Not another library..." but first impressions were positive. It's not quite the React of ML – an idea which I hope to see soon – but it does seem easier for certain research purposes.
PyTorch also recently gained TPU support, and as far as I know they've put in some serious efforts to make sure things run quickly. As for how you use all 8 cores of a TPU using PyTorch, I haven't looked into it yet. But I'd be surprised if you couldn't. It seems unlikely that they would design an API that would hamstring you to just 1 out of 8 cores.
Thanks a lot. This is a lot of information to digest. I will check out your Twitter feed and I think I need to start to play around with TPUs then. Scaling seems to work fantastic for you. I am working more in computer vision and we are running sometimes into weird bottlenecks with our GPUs where neither GPUs not CPUs are under full load. Unfortunately, drilling down on where the bottlenecks come from is not easy at all. I am assuming the profiler from Tensorflow works with TPUs in the same way it does with GPUs?
Oooh, you're so lucky you get to work on those kinds of problems. I know sometimes it feels frustrating to hunt for bottlenecks, but man is it satisfying to find it.
We had a similar situation at one point. The problem turned out to be that our CPU wasn't generating input data fast enough. So the first step is to confirm that your input pipeline isn't the issue.
The next step would be to break down the problem: Can you extract the smallest part of the codebase into a separate program, and try to make that run under full load?
That's not the technique I used, though. To figure out the multicore stuff, the trick for me was to comment out almost all of the code, until you're left with only a small part that actually runs on the device. Ideally the smallest part.
Basically, change your code so that the model file returns tf.no_op() (or as close to that as possible while still letting your input pipeline run). You want to be in a situation where your training loop is doing an equivalent of while(true) { read_input(); } so that you can verify that your pipeline is able to peg your GPUs to 100% usage.
If you get 100% usage, fantastic! That means you're left with an easy problem: start turning parts of the code back on until you find which part is reducing your performance. Then study that part to figure out why.
If you're not at 100% usage, you're either running into a fundamental limitation (which sometimes happens) or the pipeline isn't designed correctly in some way. I would compare it against other popular codebases such as StyleGAN 2 https://github.com/NVlabs/stylegan2 which is designed to use 8 V100s. The optimizer.py file is pretty insightful: https://github.com/NVlabs/stylegan2/blob/eecd09cc8a067e09e12...
Finally, my biggest tip would be to step back from the problem and think: is there something simple you can do to reframe the problem? When I find myself in a situation where I'm spending a lot of time and energy trying to get a certain thing to work, I can sometimes do X instead for 80% of the benefit. Try to find something like that in this case.
FWIW the TPU profiler was the first tool I reached for. I never got it working. The bag of tricks above ended up giving me effective results on a variety of codebases with no profiler. (A usage graph is pretty crucial, though, which Colab TPUs don't provide.)
So there are a bunch of general tips for solving weird bottlenecks blindfolded.
To answer your question directly:
I am assuming the profiler from Tensorflow works with TPUs in the same way it does with GPUs?
But yeah, if you give specifics (ideally a link to a codebase + dataset + script that runs it) then I can try to look for candidates of what might be the bottleneck.
«A single GPU card like the AMD Radeon MI60 has more computing power than year 2000 supercomputer ASCI Red (fastest supercomputer in the TOP500 list of June 2000):
Comparing GPU floating point performance with CPU floating point performance is comparing apples and oranges. GPUs may have higher raw FLOPS, but they have issues with workloads that aren't massively parallel or require branching.
That's true when you're comparing a GPU to a single CPU. But when you're comparing a GPU to an entire supercomputer the requirement that the workload has massive parallelism to use all available resources is present in both.
It's a little more complicated for that. The main problem is that the way GPUs are designed, their execution units share the same instruction pointer[1]. That's not an issue if you're multiplying matricides, but it's an issue anytime you have branches. Therefore, even when your workload is massively parallel, it doesn't necessarily mean that a GPU cluster would perform nearly as well as a CPU cluster with the same amount of FLOPs.
Also the supercomputer likely had a substantial ammount of solid state & fast spinning storage, even back then. Thats also often overlooked in these comparisons, not just the difference in precission.
A single 80mm nvme ssd would likely be faster than a significant amount of that supercomputers storage. In 2000 a million IOPS was a lofty target. Now we can do it on a single device.
ASCI red had 1TB of DRAM and 12TB of disk. Not bad, but three NVMe drives would clobber it. Putting that much DRAM in a box is still expensive today, but entirely feasible for about $5k-10k.
When the univ I went to at Budapest got a second hand VAX cluster (also around '95) for the bargain basement price of only 50 000 CHF they needed to dig up the street to the nearest substation and have a new powerline installed. http://hampage.hu/oldiron/vaxen/9000_4.jpg the furthest away cabinet is the power supply. This photo is not even half of the cluster.
A few years before that, at another univ, they put an ancient IBM mainframe in place with a crane, temporarily removing the roof of the building.
I recently "upgraded" from a desktop to a laptop. This marks the end of an era for me. I always had a relatively powerful desktop at home, mostly running 24/7 since I couldn't be bothered to wait for it to boot. This ThinkPad X1 is the first laptop I own which is apparently powerful enough to host all my work in a 1.09kg package that easily fits in my backpack... OK, I am a text mode user, so gfx isn't what keeps my computers busy. Low latency realtime audio synthesis much more so.
And I still remember when 33.6kbps were an exciting thing to have :-)
Nice to see that tech moves ahead.
Are you sure it was in 1995? I joined Sun in 1997, and the E10k was launched a bit after that.
According to Wikipedia it was launched in 1997 so it does line up. If I remember correctly, the system was bought from Cray after SGI bought the rest, so I didn't know they even had it at Sun in 1995.
Also, the original model of the E10k supported 64 GB RAM.
My company back then bought one of those around that time. It was used to run large EDA software jobs.
On one beautiful day during the weekend, only a two weeks after delivery, our system admin noticed that the machine went offline. Logging in remotely didn’t work at all. No ping either.
He drove to work and ... the machine was gone.
Thieves had used a crane to lift it out of the building through a window onto a truck.
Sun told him that this wasn’t the first time such a thing had happened and that somewhere in the chain from order to delivery, an insider tipped off the thieves about where to find the latest.
Back then, smaller nation-states wanting to do nuclear device simulation and the like would be my guess. Basically, countries that were restricted in some manner on gaining large amounts of parallel processing compute power for such simulations.
I was thinking about this for house building, you go to a panel and the bots that make up the house reconfigure themselves to add a swimming pool, or an extra guest room, etc. Could be pretty awesome. Let’s hope they are not used for evil :-/
OTOH we'd rather have seen the single-core perf keep improving. What's the average performance speedup (vs 1 core) that the Sun customers got, or the AMD users get, on the average software they use? The progress in programming language technology hasn't been very kind to multiprocessing[1].
[1] GPUs are another kettle of fish of course, but have their own wellknown problems that prevent widespread use outside graphics
2x Samsung EVO 970 NvME M.2 1TB SSDs in RAID0 configuration.
Running in a Coolermaster Cosmos case with a stupidly big air cooler at the moment, to be replaced with a decent liquid cooler (it works, the Cosmos is a huge case because it had to be to hold a hacked Supermicro dual Xeon server board in it before (I really wanted ECC for my workstation))
nVidia 1080Ti+ GPU.
The ECC is detected and claims it is working although I've yet to see it correct an SBE. I haven't been running non-stop memory tests either though so.
I may end up removing the Linux partition since WSL2 works so well on this box.
I have a 3970X with ECC RAM and got error correction notifications in my Linux logs when I tweaked RAM timings too tight. Note that memtest86 doesn't know about ECC on Ryzen, so you may get unnotified error corrections happening if you use that.
No, but all TRX40 motherboards should support ECC the same. Mine is a gigabyte aorus pro wifi, FWIW.
BTW, Century Micro has the only unbuffered ECC modules at 3200MHz native speed (at least that was the case on the 39x0X release day). I don't know if they can be sourced outside Japan, though.
Fun story, I actually botched my order and got 2666MHz ones... but on closer inspection, it turned out the chips on the modules were actually native 3200MHz ones. With the SPD EEPROM saying they are 2666MHz. So I ended up overclocking them at their actual native speed. And I tweaked the timings to be a little shorter than what the 3200MHz modules were advertized for.
7 years ago it was doable to build similar desktop at similar price from refurbished server (4x6core Opteron CPU, 128GB DDR2 ECC RAM). It took about 1KW and was loud, but great complement for performance testing.
The memory access performance on ThreadRipper (and current Ryzen/Epyc) is much better than anything prior to this generation for workstation loads. Not that the shared I/O controller is without cost, only that it tends to average out better in most cases where multiple cores across chips are in use together.
Just got my 3950X w/ 64gb ram, not sure that I'd be able to practically use any more compute than this for what I play with, which is mostly multiple back-ends and some container orchestration for local dev and occasionally video re-encodes (BR-Ripping for NAS).
Some think $4k for this CPU is too much... considering the shear performance that you can get these days for under $10K there's never been a better time to build or buy a computer. My only regret is wasting time and money on aRGB that I cannot configure in Linux.
The real funny thing is that for most people (not OP) it would still be used mainly for word processing, email, and occasionally casual gaming - and it would still be slow to run and boot.
The amount of processing power we each carry in our pockets (even the cheapest throw-away smart phones) would have been almost unthinkable 30 years ago; it's akin to the difference of an Altair of the 1970s vs what was available just 10-20 years prior. What took up a room now sat on a desk and could be purchased for the price of a car.
Now, what took up a room now sits in your pocket, and almost could be given away in a box of cereal its so cheap.
Heck - think about what's available in the embedded computing realm for pennies (or just a few dollars in single quantities) - it's mind boggling to an extent.
I concur but it depresses me somehow. Some say that it's worth it, to me it's just the same cycle of marketing trying to disguise the things as progress.
The comments are mentioning the Xeons in Mac Pros and how Apple should switch. I have no factual basis for this, but I figure Apple has got to be using AMD's new chips as leverage to get some pretty sweet deals on Intel silicon.
Indeed it does. Which nicely explains the rising popularity of Hackintosh's, particularly among developers and other technologists. AMD Hackintosh's[1] in particular have skyrocketed in maturity and simplicity since Ryzen.
To be fair, the Hackintosh community is pretty persistent. They added opcode emulation into the kernel, for example, to handle running on CPUs without the expected instructions (older AMD CPUs back in 10.8 or so). I wouldn't be surprised to see the cat-and-mouse continue.
Indeed though keep in mind that the current method for running macOS on AMD doesn't use this and instead relies on patching through Clover. The downside being that whilst the OS itself may run any applications that use an opcode that isn't implemented will simply crash.
I'd been running Hackintosh and an rMBP for my desktop and laptop respectively... this past year I've passed on both and now running Linux for my personal desktop, and will be getting a new laptop within the next few months, some of the Ryzen Asus laptops coming soon are interesting.
Although still not without issue, my workflow has aligned so much with Linux and it's finally reached a good enough point for my day to day use... not having to use the VM based mac or windows docker has been really nice (WSL isn't good enough imho).
I know you are being facetious, however there are options which the market provide.
1) Other companies that are willing to sell you workstation and laptop computers that you can run other operating systems on such as one of the Linux variants, Windows, BSD etc. Nobody is forcing you to buy a computer from Apple.
2) There is a thriving second hand market of Apple machines just look at ebay, craigslist, gumtree etc.
If a new Apple machine isn't worth it to you, you are free to buy alternatives.
I am going to buy one of the newer Lenovo Thinkpads as I don't think the MacBook pro is worth it to replace my ageing Macbook Pro.
I believe he’s wrong and they are not meaningfully rising in popularity. Hackintoshes started as soon as the Intel transition, and they do exist (I’ve seen a couple personally, both built around the Leopard era when tools were mature and the desirability of iCloud/iMessage integration was lower.) Today there aren’t many people who need to push computing beyond the relatively affordable Mac Mini and iMac configurations. Most Hackintosh practitioners want Apple to release the “XMac,” a cheap and configurable desktop tower* and operate their Hackintosh in its stead.
It’s a respectable hobby, though, like iOS jailbreaking or emulation, and for persistent people it does let them run MacOS on more powerful hardware than they could afford.
Not sure a citation exists for such an assertion -- but I'm a decade deep and numerous production (profitable) iOS apps shipped without ever touching Apple hardware. MacOS is a joy but the hardware is consumer rot.
I’m sure if it was indeed popular, there would be some way of demonstrating that. Aside from the fact that I have never seen one in my entire career as a consultant, working with hundreds of organisations, the reason this sounds ridiculous to me is that Apple has a long history of making very little effort to obstruct the hackintosh community. Which suggests very strongly that the community is too small for Apple to bother with. There are a few topics on HN that seem to bring out people claiming that incredibly niche interests are actually very common and popular. Apple is one of them. So I don’t think it’s unreasonable to expect that somebody making such an incredible claim should have at least some way of substantiating it.
I’m sure if it was indeed popular, there would be some way of demonstrating that.
Google Trends suggests that searches for 'Hackintosh' peaked around 2009 and have been steadily declining down to around a half since then. Searches for 'Ubuntu' dominate so much it makes the Hackintosh graph look flat by comparison, but Hackintosh seems about as popular as 'Manjaro' (Linux Distribution) currently is, fwiw:
The only reason I'm using a Mac now is by pure good luck as all my PC components at the time (back in 2011-ish) were OSX Snow Leopard compatible, right down to Wi-Fi, bluetooth, motherboard, soundcard, etc. I admit it wasn't completely vanilla due to the infamous tonymacx86 software method but it did get me running quickly.
I gave OSX a test drive and found it much more simple compared to Windows. As I already had an iPhone/iPad it made sense to switch. Come upgrade time I bought a Macbook Pro and have been on OSX since.
The market price is the equilibrium of both supply and demand.
What you're saying is the fact people are buying their machines means they shouldn't change price or value proposition... because there are in fact purchases; which, is quite frankly, baffling, to me, a humble idiot.
Mercedes must think their EQC is positioned perfectly in the market with 55 sales? I now imagine Magic Leap will be leaping to raise prices with their next version?
> The market price is the equilibrium of both supply and demand.
Obviously.
> What you're saying is the fact people are buying their machines means they shouldn't change price or value proposition... because there are in fact purchases; which, is quite frankly, baffling, to me, a humble idiot.
If Apple are selling the machines in sufficient quantities at whatever they are priced at (I haven't cared to look) then obviously Apple's customers think they are worth it. It isn't really more complicated than that.
You assume without any facts or supporting evidence that they are achieving optimal sales. What you're saying is literally something you're just making up out of thin air. It's okay to be completely full of shit, just don't market it as truth.
Just doing a quick web search and the company made $224 billion dollars in 2018. Do you honestly think they aren't achieving optimal sales? The proof is in the pudding and they have a very very large pudding.
> It's okay to be completely full of shit, just don't market it as truth.
I know you think you are being big brained but not everything is "you must provide a citation". It is pretty obvious Apple knows the market well, knows exactly what they can and can't charge for certain products. You pretending otherwise because I haven't provided you with a citation is a complete joke, it is like asking someone to cite for evidence that the sky is blue.
> "Obviously the way it is, is the way it is. Obviously".
In conclusion, you think no company should adjust their prices, ever? Because the current price is the market price which is obviously the right price because it's the market price?
It's circular reasoning, which can be applied to any sales situation - and if it explains everything, it explains nothing.
> In conclusion, you think no company should adjust their prices, ever?
Obviously not. I am saying they have no incentive to change the price if the sales are inline or above with what they would have forecasted.
> Because the current price is the market price which is obviously the right price because it's the market price?
>
> It's circular reasoning, which can be applied to any sales situation - and if it explains everything, it explains nothing.
Again you don't seem to understand basic market economics. Your product is only worth what people are willing to pay for it. There is the odd exception to the rule (Head and Shoulders Shampoo being one of them, which is priced far higher than they originally intended because people assumed it didn't work because it was cheap).
Generally if there are two or more companies producing product X (in this case Computer Workstations and Laptops) then the market will coalesce around a particular price point for a particular specification. Sure there are those that will always stick to a brand, but the vast number of consumers won't be loyal.
Whether or not the company makes a profit on each unit sold is irrelevant to its market price. If they price their product higher than their competitors people will look at the alternatives.
e.g. I bought a MacBook Pro in 2015 because Apple's machine was cheaper than Lenovo, Dell for the same spec and had a better screen than any of the competitors machines.
This really isn't complicated stuff. I think that personal bias seems to cloud people to some basic truths.
No. There is usually no such thing as _a_ market price. There's a distribution of prices for purchases of the same item (and that's when we ignore the cases of transactions involving more than just the transfer of money).
> the equilibrium of both supply and demand.
Supply and demand for a specific products are more the _result_ of socio-economic processes and phenomena rather their _causes_.
If you're talking to me about socio-economic processes when we're talking about something simple, it pretty clearly shows you're you're not very educated in Economics.
If you like your charts and economic formalisms, and believe in "market prices", perhaps you should take the time to read the Candide-like "Production of commodities by means of commodities" by Piero Sraffa.
Despite their price, there was a time when macbooks were only 5-10% more than the PC equivalent laptop. People that were complaining about its price were inevitably comparing it to bottom of the barrel PC laptops, not higher end business laptops that had comparable specs.
I haven't priced out any recent macbooks to know if that's still true though. Glancing at the new 16" macbook pro, it seems like it might be reasonably priced for what you're getting.
There is so much that goes in to a laptop that doesn't make it to the spec sheet either. People only look at a couple of specs to decide how much it should cost but some laptop makers put everything in to those specs and cheap out on everything else and you end up with a laptop with a fast CPU but brittle plastic, A TN display, a DAC that hisses and a whole bunch of other nastyness
Yeah, the MBP 16” is pretty comparable to the Dell XPS in price - at least when comparing base models.
However, the costs go up a lot if you spec out a custom config (+$400 just for 32GB RAM). Then the MBP starts looking quite a bit more expensive. Overall I don’t think they’re a bad buy though if you want macOS.
>Yeah, the MBP 16” is pretty comparable to the Dell XPS in price - at least when comparing base models.
The Dell XPS [1] with a comparable spec cost $1650 compare to MBP 16" $2399. In the old days Apple would have priced it closer to $2199 or slightly lower.
Somewhere along the line they started making Mac same margins as iPhone.
That said, the Mac pro, even starting at the base price is pretty outrageous... I mean, I get $500 for the case and $1500 for the MB, but the rest just seems to be too much in aggregate, and even more ridiculous for mfg upgrades out of the box.
>Despite their price, there was a time when macbooks were only 5-10% more than the PC equivalent laptop.
I seriously doubt 5-10%. You are talking about minimum $50 - $100+ dollar difference, that has never happened. Mac has always been roughly 20-30% more expensive than a laptop with comparable specs. So $1000 comparable spec laptop, Apple will sell you one for $1300, ( But with more expensive upgrades )
The 30% has been fine for years, the quality and finishing as well as macOS was well worth the price tag. But in recent years it hasn't been 30% at all.
As you lower the price, more people can and will buy your product giving you (hypothetically) more profits than before. I’d like to think Apple has done all their homework about what price point to sell at to maximize profit but honestly at this point I think they just make up whatever huge number they want for the Mac Pro price to make it seem cool and go with it.
When I was in high school in the early 2000s I had a friend who told me his parents purchased a 386 when they first came on the market where I live and they paid around $15k. My jaw just dropped. Then my friend just started chuckling at his parents folly when looking back even at that time it was such an expensive paperweight. Heck, thinking about it now it probably came up in conversation because at the time I had a hobby of picking up old computers that people had thrown out and cobbling together the working parts and built a 386 and a 486 that way. Good times.
Apple has been successfully using the Good, Better, Best three-tiered pricing model for quite some time. I remember buying a Powerbook 140 at the time when I really wanted the 170 but could not justify the increased price.
From wikipedia:
"Intended as a replacement for the Portable, the 140 series was identical to the 170, though it compromised a number of the high-end model's features to make it a more affordable mid-range option. The most apparent difference was that the 140 used a cheaper, 10 in (25 cm) diagonal passive matrix display instead of the sharper active matrix version used on the 170. Internally, in addition to a slower 16 MHz processor, the 140 also lacked a Floating Point Unit (FPU) and could not be upgraded. It also came standard with a 20 MB hard drive compared with the 170's 40 MB drive."
The MBP isn’t even a very good example of Apple price gouging. Any other laptop that your can get for less money is likely to have rather significant trade-offs.
My hunch is that the ARM transition is coming sooner rather than later and instead of spending time and effort into rewriting a bunch of OS functions (like AirPlay Mirroring) to support AMD processors that lack Intel-only features like QuickSync, Apple is just going to drop the iOS pieces it has written for ARM64 into the MacOS (or whatever it's called) that runs on their upcoming ARM-based desktops and laptops.
I mean, I wouldn't re-write code to support hardware-accelerated video transcoding on Ryzen+GPU if I knew that in 2-3 years I was moving from x86-64 to ARM64.
How well would Premiere or Photoshop run on ARM? How long would it take to rewrite these apps to run natively (i.e. with acceptable performance) on an ARM platform? Until you can do video/photo/audio editing faster w/ ARM using existing software products, I do not see it being viable as a replacement for x86 in any of Apple's higher-end products. Perhaps ARM is included for sake of mobile development, but with 32+ modern x86 cores you could just as well emulate ARM and barely feel any overhead.
Sure, there are a lot of neat tricks ARM can do with special instructions and hardware accelerators in very well controlled use cases. But, for the average creative professional who doesnt have time or patience to play with hyper-optimizing their workflow, having an x86 monster that can chew through any arbitrary workload (optimized or otherwise) is going to provide the best experience for the foreseeable future.
Apple has dragged Photoshop kicking and screaming onto different platforms and even OS’ before. If they succeeded when they were nearly dead (OS X Cocoa timeframe) they will have no problems doing it now.
Like sibling comment says, it's not exactly the first cpu transition for Photoshop...
> Photoshop 1
> Photoshop 1 (1990.01) requires a 8 MHz or faster Mac with a color screen and at least 2 MB of RAM. The first release of Photoshop was successful despite some bugs, which were fixed in subsequent updates. Most users ended up using version 1.07. Photoshop was marketed as a tool for the average user, which was reflected in the price ($1,000 compared to competitor Letraset’s ColorStudio, which cost $1,995).
> Photoshop 1.x requires Mac System 6.0.3, 2 MB of RAM, a 68000 processor, and a floppy drive.
They are rewriting all those Apps on iPad OS anyway, which is where many of the professionals are moving towards. It is the same with Autodesk, their CEO said he doesn't know how well iPad Sales are doing, but he clearly sees a trend of more pros moving to iPad. They are everywhere in their industry.
(Unfortunately I can no longer google the link of the video)
Which is not to say they will move to ARM. I am still skeptical of it.
> to support AMD processors that lack Intel-only features like QuickSync
QuickSync is only just a Hardware Video Encoder. Which AMD has as well, it is called VCN [1]. Not to mention Apple hasn't been using QuickSync for as long as they have been shipping T2, where Apple uses their own Video Encoder within T2. ( T2 is just a rebadged A10 )
And 100 Comments but not a single mention or suggestion as to how would Apple deal with their vested interest in Thunderbolt. I would not be surprised if 90% of the PC shipped with Thunderbolt were from Apple.
USB 4 is out, the Thunderbolt 3 spec has been out for quite a long time as well, and yet we dont even have a single announcement with USB 4 controller.
Unfortuntately Threadripper only ("only"..) supports 256GB of RAM, Mac Pros can go up to 1.5TB.
So they definitely couldn't switch completely, and supporting both would be expensive for apple due to doubling mobos + testing + drivers etc, and confusing for the consumer because of the differing max RAM capabilities.
Can someone explain why do server variants always have lower base clocks? In particular I'm interested if consumer higher clocked variants are less reliable for long-term 24 hour full-load use. It has to be something like that, and not just power consumption considerations.
AFAIK, you can overclock the Server variants if you have sufficient cooling, not sure on binning or the extra memory bandwidth in terms of CPU overclocking overhead... but there is room there.
If I'm not mistaken, most of these threadripper systems seem to come with a water cooling system that I'd bet isn't present in server systems. That's my guess.
It might also be that "consumer" workloads will run the cores at the high speed more infrequently than a server which might be running full tilt 24/7. Just a thought
AMD (currently) only has server and consumer CPU lines. So since Threadripper isn't a server product, it falls into the consumer line even though it will mostly be used by professionals and extreme enthusiasts with deep pockets.
That’s because we’re buying server racks for virtualized workstations, not deskside systems. Even from Tier 1 vendors a viable deskside workstation for pro VFX doesn’t approach 35K without shoving dual socket 24-core+ Intel chips that have no realistic purpose being in a workstation unless you’re buying for specificity.
I'd love to know more about this, because I think it is the future for everyone. What VDI environment are you using, what runs on your desktop vs run on the racks? Are the racks shared or do you have dedicated hardware provisioned to you? Do you use more than one backend rack system at a time?
It's definitely not for gamers, even silly "I have to have the best thing" gamers. Games are considered "lightly threaded" in these sorts of conversations, so you're looking for max boost clock / IPC.
However, Lots of the youtube influencer ruling class will buy it, in part because of what you said ("those who can afford it..").
The real legitimate consumer base for this are people whose work productivity is held back by compute loads that are embarrassingly parallel. If you spend a lot of your time waiting for a (well threaded) compiler to finish, or blender to render something out, or whatever.
Twitch / "Influencer" video gamers aren't using Premiere for videos.
Twitch streamers need a "streaming" solution. They play live, and instantly react to the crowd. If someone pays for an emote or something, the Twitch-streamer is expected to look on camera and say thank you to the donor (and maybe repeat the message that the donor paid for).
This means that a Twitch streamer's computer MUST encode the gameplay live. Traditionally, Twitch streamers would buy two computers, one to play video games, and a 2nd computer to process the video stream and upload it to Twitch.
With the advent of 16+ core computers, Twitch streamers have begun to simply buy one computer, lock 8-cores to the video game, and then lock 8-cores to the Twitch encoder.
Presumably, something like a Threadripper (24+ cores) could process the video stream for better quality and lower bandwidth. Maybe live VP9 encoding, for example (8-cores for the video game, 16-cores or more for the encoder)
Consoles have 8 and are about to have 8 cores / 16 threads. Multi core utilization on android is essential to running at lower clock speeds and not thermal throttling. Engines and graphics APIs are catching up and there are less singlethread bottlenecks than there used to be.
Nobody wants a 64 core CPU for games. Single core performance is the most important factor by far since most games aren't really optimized for parallel computing.
I help people build PCs sometimes and peoples first pass at picking components almost always overspends on cpu and underspends on gpu. For any fixed budget you will usually get (much) better perf by getting the low-mid range ryzen 3/5 or i3/i5 cpus and spending the difference on a better graphics card.
Most games are not particularly CPU intensive (although there are exceptions like Ashes of the Singularity which actually does usually get bottlenecked by the CPU unless you have a really high end one)
Not really. You run the game loop on one core and physics simulation / sounds / world streaming on a handful of other ones. It's more about the maximum single core performance which is not that different on a $150 Ryzen. Going to the most expensive CPU on the market probably only gets you 5 FPS more compared to the top end GPUs that scale rendering in parallel like a dream and can give you 100+ more FPS.
Amd has started uping the thread counts with the ryzen processors only. And now that Ryzen has started being adopted by gamers and streamers newer games are getting better at utilising these cores. with current generation games extra threads won't help but newer titles it is going to
Because "functional decomposition", a method of threading where you allocate threads per function like physics or rendering, fell out of fashion. Modern game engines instead use a task system where tasks are spun off a main thread and asynchronously computed.
They sure are a thing, because programmers. For example for a stutter free game RDR2 needed 6 cpu cores up to a patch one month ago thanks to bad console first optimizations.
I just upgraded my CPU and kept my GPU, an older Nvidia 960 GTX. My CPU was a Q6600 and now I have a Ryzen 2700X. CPU matters so much it's amazing. Some games are heavy on the GPU and some are heavy on the CPU. Depends on what kind of things they have going on.
I haven't had any experience with cheap current generation CPUs. Have you any experience with ones such as an Intel Celerons that are about $30 on Amazon?
The original Apple II sold at an adjusted 2019 price of $5,476, and it sold millions. I would say there are plenty of consumers with that much buying power. Of course, that doesn't mean most consumers need that many cores - but that's still true even if it cost $100.
I'd be interested in that comparison. A fridge throws off quite a bit of heat when cooling. Luckily they're closed the vast majority of the time and are well insulated.
I see a fridge. Miele -> real fridge. Thermador -> also a real fridge. Difference in value not reflected in either functionality or efficiency, long term negative effect on pocketbook for non-income generating asset -> wasted money.
So wouldn't this be like 2.6 TFLOPS? I'm thinking if this can replace NVidia V100s to train something like ImageNet purely on CPU. However, V100 has 100 TFLOPS which seems 50x more than 3990X. Perhaps, I'm reading the specs wrong?
PS: Although FLOPS is not a good way to measure these stuff, it's a good indication of possible upper bound for deep learning related computation.
DDR4 has a bandwidth of about 25 Gigabytes per second. The memory on a V100 does about 900 Gigabytes per second. Cerebras has 9.6 Petabytes per second of memory bandwidth. For stochastic gradient descent, which typically requires high-frequency read/writes, memory bandwidth is crucial. For ImageNet, you're trying to run well over 1TB of pixels through the processing device as quickly as possible while the processor uses a few gigabytes of scratch space.
DDR has a bandwidth of about 25GBps per channel. You can hit around 100-200GBps on Epyc processors if you're utilizing ram efficiently. GPUs tend to enforce programming models that ensure more sequential accesses, but CPU can do it too.
These stats are true, but the CPU's biggest advantage is L1, L2, and L3 cache.
In particular, the 3990x, 64-core Threadripper will have 256MBs of aggregate L3 cache, and 512kBs L2 cache per core (32MBs of L2 cache). Highly-optimized kernels may fit large portions of data within L3 cache and rarely even touch DDR4!
Note: Each L3 cache is only 16MBs between 4-cores. It will take some tricky programming to split a model into 16MB chunks, but if it can be done, Threadripper would be crazy fast.
True, GPUs have really fat VRAM to work with, but CPUs have really fat L3 cache and L2 cache to work with. And the CPU caches are coherent too, simplifying atomic code. GPUs do have "shared memory" and L2 caches, but they're far smaller than CPU-caches.
More accurately, Cerebras has 9.6 "bullshitobytes" per second. If you can't verify this, it doesn't exist. You could claim insane "bandwidth" by considering your register file to be your "memory". But that doesn't make it so.
Nah, you’re probably not reading the specs wrong. GPU-type devices severely outclass CPUs in raw compute power. This has been true for years, and it’s why deep learning depends on them. But CPUs and GPUs fulfill very different niches computationally; GPUs are incredibly parallel but aren’t well-suited for serial tasks or tasks that require unpredictable branching/looping, which is pretty much exactly what CPUs are good at.
CPU’s do general computing. They’re super flexible but if you have a specific workload you might be able to use a different piece of silicon to get more performance.
GPU’s do more parallelized computing but they don’t do as many different operations. They’re really good at doing a small not super complex fast but massively parallelized (like updating an array of pixels on a screen, for example).
TPU’s are even more parallelized but the operations they do are even more specific and often simpler than the operations GPUs do.
That's not really why they're slower. CPUs are significantly more complicated. Things like branch prediction, transactions, sophisticated prefetching all take up a lot of silicon.
The more specific you get on the circuit the less flexible it is and the more bandwidth you get at a specific task.
The tradeoff is flexibility for application specific performance. CPUs can do hella stuff but they can’t do a specific thing faster than specialty hardware.
GPUs are surprisingly flexible. GPUs are a full on turing machine.
The thing is, GPUs have horrible latency characteristics compared to CPUs. Whenever a GPU "has to wait" for RAM, it only switches to another thread. In contrast, CPUs will search your thread for out-of-order work, speculative work, and even prefetch memory ("guessing" what memory needs to be fetched) to help speed up the thread.
--------
Consider speculative execution. Lets say there is a 50% chance that an if-statement is actually executed. Should your hardware execute the if-statement speculatively?
Since CPUs are latency optimized, of course CPUs should speculate.
GPUs however, are bandwidth optimized. Instead of speculating on the if-statement, the GPU will task switch and operate on another thread. GPUs have 8x to 10x SMT, many many threads waiting to be run.
As such, GPUs would rather "make progress on another thread" rather than speculate to make a particular thread faster.
---------
What problems can be represented in terms of a ton-of-threads ? Well, many simple image processing algorithms operate on 1920 x 1080 pixel entries, which immediately provides 2,073,600 pixels... or ~2-million items that often can be processed in parallel.
When you have ~2-million items of work (aka: "CUDA Threads") waiting, the GPU is the superior architecture. Its better to make progress on "waiting threads" than to execute speculatively.
But if you're a CPU with latency-optimized characteristics, programmers would rather have that if-statement speculated. The 50% chance of saving latency is worth more to a CPU programmer.
Remember that these millions of items must actually be able to do work independently, i.e. you need very few (or regular and localized) dependencies between the data processed by each thread, so that threads can just wait on things like memory accesses rather than waiting on _other threads_.
Hmm, I think I see what you're trying to say, but maybe more precise language would be better here.
GPU cores have extremely efficient thread-barrier instructions. NVidia PTX has "barrier", while AMD has "S_BARRIER". Both of which allow the ~256 threads of a workgroup to efficiently wait for each other.
-------
The other aspect, is that "Waiting on Memory" (at least, waiting on L2 memory) is globally synchronized in both AMD and NVidia systems. Waiting for an L2 atomic operation to complete IS a synchronization event, because L2 cache has a total memory ordering on both AMD and NVidia platforms.
Tying L2 cache to higher levels allows for memory coherence with the host CPU, or other GPUs even. That is to say: "dependencies" are often turned into memory-sync / memory-barrier events at the lowest level.
Synchronizing threads is one-and-the-same as waiting on memory. (Specifically: creating a load-and-store ordering that all cores can agree upon).
---------
I think what you're trying to say is that dependency chains must be short on GPUs, and that there are many parallel-threads of dependency chains to execute. In these circumstances, an algorithm can run efficiently.
If you have an algorithm that is an explicit dependency chain from beginning to end (ironically: Ethereum hashing satisfies this constraint. You work on the same hash for millions of iterations...), then that particular dependency chain cannot be cut or parallelized.
But Ethereum Hashing is still parallel, because there are trillions of guesses that can all work in parallel. So while its impossible to parallelize a singular ETH Hash... you can run many ETH Hashes in parallel with each other.
It's reading the specs wrong for apples to apples. OP is using the tensor core numbers, which is half precision and only for matrix multiplies. Operations that don't fit that will use the standard fp32/16 performance of the chip, which is around 13TFLOPS. still higher than 2, but nowhere near 100.
64 cores * 2.9 GHz * 8 single-precision lanes * 2 issue * 2 (FMA) = 5.9 TF. This compares with 14 TF for V100 (costs more and needs a host). The 100 TOPS for V100 refers to reduced precision (which may or may not be useful in a given ML training or prediction scenario). The V100 has much (~9x) higher memory bandwidth, but also higher latency.
* Radeon VII GPU is 14.2 TFlops for $600 right now.
* NVidia RTX 2070 Super is 9 TFlops for $500
True, Radeon VII and RTX 2070 are "consumer" GPUs... but Threadripper is similarly a "consumer" CPU and commands a lower price as a result.
"Enterprise" products cost more. EPYC costs more than Threadripper, V100 costs more than RTX 2070 Super. If you're aiming at maximum performance at minimum price, you use consumer hardware.
Similarly, Threadripper loses on RDIMMs, LRDIMMs, and have 1/2 the memory channels and 1/2 the PCIe lanes. Most people don't need that either.
Of course consumer products have fewer features than enterprise products. The chip manufacturers need to leave some features for "enterprise". The general idea is to extract more wealth from the people who can afford it, while providing consumers the features they care about at a lower cost.
-----
Really, the feature enterprise GPUs need seems to be SR-IOV, or other PCIe-splitting technologies. This allows a singular GPU to be split over many VMs.
Double-precision floats is niche to the scientific fields, also "enterprise" but I don't think most enterprise customers use double-precision floats.
True story, AMD's original price was different. I suggested this price in a pre-briefing the night before. I got an email at 4am of the announcement to say it had been changed.
Can someone clarify what exactly is cut here compared to EPYC 7742 other than PCIe lines? I don't quite see really get how AMD want to avoid competing with their own product.
The EYPC chips have twice the memory bandwidth as well, and allow for dual socket systems. I don't believe the enterprise security and management features are in their HEDT counterparts.
Chips like these are for a specialty market, for those who are using applications and workloads that can actually take advantage of all those cores and aren't running a personal datacenter. You're not going to see this offered in any of the rack/blade systems offered by the likes of Dell, HPE, SuperMicro, Lenovo, etc where organizations are actually going to be purchasing EPYC chips.
I haven’t seen any reviews taking into account the price of unregistered DDR4 RAM at the densities required to use these things. AMD is suggesting 2GB/core or 128GB, but these HEDT CPUs cannot support the typical registered RAM used by server memory kits and require unregistered desktop RAM (for market segmentation, you see) which can be (depending on SKU) quite a bit more expensive. I feel any price comparisons with Epyc or Xeon need to take that into account.
This is a bit tangential but maybe someone could give me a some advice.
When running with many cores is there some way to get the Linux kernel to run them at a fixed frequency?
I've been doing multicore work lately on AWS - which works pretty well but you don't get access to event counters so sometimes I'm having trouble zeroing in on what the performance bottleneck is. At higher core counts I get weird results and I can never tell what's really going on. Running locally I have concerns about random benchmark noise like thermal throttling, "turbo" and other OS/hardware surprises (I know on modern chips single thread stuff can run really different from when you load all the cores). I've been thinking of getting something a bit dumber like an old 16 core Xeon (I'm on a bit of a budget) and clocking it down - or maybe there is some better solution?
seldon ~ # cpupower frequency-set -f 1.86GHz
Setting cpu: 0
Setting cpu: 1
Setting cpu: 2
Setting cpu: 3
Setting cpu: 4
Setting cpu: 5
Setting cpu: 6
Setting cpu: 7
seldon ~ # cpupower --cpu all frequency-info | grep -E '^analyzing CPU.*|current CPU frequency'
analyzing CPU 0:
current CPU frequency: 1.86 GHz (asserted by call to hardware)
analyzing CPU 1:
current CPU frequency: 1.86 GHz (asserted by call to hardware)
analyzing CPU 2:
current CPU frequency: 1.86 GHz (asserted by call to hardware)
analyzing CPU 3:
current CPU frequency: 1.86 GHz (asserted by call to hardware)
analyzing CPU 4:
current CPU frequency: 1.86 GHz (asserted by call to hardware)
analyzing CPU 5:
current CPU frequency: 1.86 GHz (asserted by call to hardware)
analyzing CPU 6:
current CPU frequency: 1.86 GHz (asserted by call to hardware)
analyzing CPU 7:
current CPU frequency: 1.86 GHz (asserted by call to hardware)
this an oldish Intel(R) Core(TM) i7 CPU. just tried it on an odroid, and it seems to do the right thing there as well.
I don't have a answer to your frequency problem but I ran into similar issues with memory usage when trying to process LIDAR data in pandas and keras. I ended up buying an old HP quad xeon 4U with 320gb of DDR3 for $330 shipped. I used https://labgopher.com/ to find the server deals.
You could disable Turbo and EIST in the BIOS - most will have a setting for these, and that will result in the cores running at nominal freq.
I don't know about a nice high-level linux interface, there may well be one, but if you need to access these settings from a running machine there are MSRs you can poke.
i think the smart buy (for most) is still the 32 core threadripper. it is a lot less money with much higher clocks and is less likely to be ram throughput starved.
I've been experimenting with some ideas for new ML approaches (not neural networks). I was thinking about playing around with FPGAs, which the high end ones are really expensive. 64 cores is making me think it probably is not worthwhile to focus on FPGAs or even necessarily GPU programming like I was thinking before.
Right and it is possible that by doing something like that the power of the system would be multiplied by many times. It's really more about the programming model.
GPUs are similar. GPUs take ~20W while idling, but 300W or 500W while under load. FPGAs are also similar.
-------
The total system idle of a Threadripper + 2x GPU + FPGA rig probably is under 200W.
If you happen to utilize the entire machine, sure, you'll be over 1000W. But you'll probably only utilize parts of the machine as you experiment and try to figure out the optimal solution.
A "mixed rig" that can handle a variety of programming techniques is probably what's needed in the research / development phase of algorithms. Once you've done enough research, you build dedicated boxes that optimize the ultimate solution.
In embedded/automotive, majority of the tooling does not have a linux version. Compiling is a bitch. Still, you're probably right about the 99% comment.
Windows does "weird things" with high core counts, I don't remember the exact number but when I had a high cored machine before it presented as "2 cores" in the second numa zone and the remainder in the first.
Windows 7 [something] definitely allows for more than 1 CPU. I mistakenly configured KVM to show 10 CPUs with 2 cores each instead of 1 CPU with 10 cores with 2 threads each, and it definitely showed more than one CPU in the task manager (and less than ten).
There used to be limitations on Windows regarding CPU count and maximum RAM size. With a consumer-level OS, you could not run more than two processors, for example. Given how earlier Ryzen acted like it was actually a bunch of NUMA nodes in the past, this could pose a problem (the normal version of Windows 7 wouldn't allow for a NUMA setup, you'd need Pro for that).
I don't think Windows 10 has this limitation anymore but it's a very valid question. Given their weird calculation for the number of licenses you need to run Windows server, it's probably good to be cautious about licensing weggeweest it comes to running Windows on high performance chips like Threadripper.
Having a blast with Ryzen 5 3600X. Only problem is that Win10 appears to have a bug with it. Stutters all over. Only stopped when I reinstalled the newest chipset drivers and set the Ryzen Balanced energy profile. Windows default energy profiles all stutter, fans make too much noise etc. Now it stays variating clock from ~3.8 to ~4.2. Before it was fixed at 3.79.
Updating on this if anyone is seeing: This was caused by faulty sata drivers: "AMD Sata Controller". Apparently it doesn't work correctly with Win10. Just revert back to default microsoft sata drivers and problem solved. I used a software called "Multimon" to track it down;
Have you tried it in Linux? When the first Threadrippers came out, I remember seeing people having all sorts of problems with them in Windows 10, with the same machines running in Linux outputting double the performance.