Hacker News new | past | comments | ask | show | jobs | submit login

> It's worth pointing out how extremely far ahead Apple seems to be in terms of CPU power...

I agree that Apple's ARM CPUs are very competitive on simple scalar instructions and memory latency/bandwidth. However x86/x64 CPUs have up to 512 bit wide vector instructions and many programs use vector instructions somewhere deep down in the stack. I guess that the first generation of Apple ARM64 CPUs will offer only ARM NEON vector instructions which are 128 bit wide and honestly a little pathetic at this point in time. But on the other hand I am very excited about this new competition for x86 CPUs and I will for sure buy once of these new Macs in order to optimize my software for ARM64.




Also, vector instructions are not doing that well on laptops, but are thermally throttled making them less useful https://amp.reddit.com/r/hardware/comments/6mt6nx/why_does_s...


I am more than a little naive on the subject, but is it possible that the vector instructions could be farmed out to a co-processor that is dedicated to that kind of workload? I suspect that the rich instruction set leads to higher transistor count and density(?true?) and thus higher TDP?

Would love to learn more from sources if people might provide a newb an intro.


The vector instructions can't really be farmed out because they can be scattered inline with regular scalar code. A memcopy of a small to medium-sized struct might be compiled into a bunch of 128bit mov for example and then immediately working on that moved struct. If you were to offload that to a different processor waiting on that work to finish would stall the entire pipeline.


Could the compiler create a binary that had those instructions running on multiple processors? I see now I have some googling/reading to do about how you even use multiple processors (not cores) in a program.


That's what we call the magic impossible holy grail parallelizing compiler.


Good to know before I run off looking for the answer :)


The technological knowledge to do this is years and years away.


> The vector instructions can't really be farmed out because they can be scattered inline with regular scalar code.

If you believe this, you won't believe what's in this box[1].

[1]: https://www.sonnettech.com/product/egfx-breakaway-puck.html

> A memcopy of a small to medium-sized struct might be compiled into a bunch of 128bit mov for example and then immediately working on that moved struct

I'm not sure that's true: rep movs is pretty fast these days.


> If you believe this, you won't believe what's in this box[1].

There's a fundamental difference between GPU code and vector CPU instructions, though. GPU shader instructions aren't interwoven with the CPU instructions.

Yes, if you restrict yourself to not arbitrarily mixing the vector code with the non-vector code, you can put the vector code off in a dedicated processor (GPU in this case). The GP explicitly stated that a lack of this restriction prevents efficiently farming it off to a coprocessor.


> I'm not sure that's true: rep movs is pretty fast these days.

That's only true if you target skylake and newer. If you target generic x86_64 then compilers will only emit rep mov for long copies due to some CPUs having a high baseline cost for it. There's some linker magic that might get you some optimized version when you callq memcpy, but that doesn't help with inlined copies.


I think people with computers more than five years old already know that their computer is slow.

Why exactly do you think seven-years-old is too-old, but five-years-old isn't?


That is irrelevant. The default target of compilers is some conservative minimum profile. Any binary you download is compiled for wide compatibility, not to run on your computer only.


That’s different. Rendering happens entirely on the GPU, so the only data transfer is a one-way DMA stream containing scene primitives and instructions.


There's absolutely no reason it _has_ to be one-way: It's not like the CPU intrinsically speaks x86_64 or is directly attached to memory anyway. When inventing a new ISA we can do anything.

And if we're talking about memcpy over (small) ranges that are likely still in L1 you're definitely not going to notice the difference.


By definition a co-processor won't share the L1 cache with another processor.


Exactly.


Then you will face the same problems that GPUs suffer from. Extremely high latency and constrained memory bandwidth. Sending an array with 100 elements to the GPU is rarely worth it. However, processing that array with vector instructions on the CPU is going to give you exactly the speedup you need because you can trivially mix and match scalar and vector instructions. I personally dislike GPU programming because GPUs are simply not flexible enough. Either it runs on a GPU or it doesn't. ML runs well on GPUs because graphics and ML both process big matrices. It's not like someone had an epiphany and somehow made a GPU incompatible algorithm run on a GPU (say deserializing JSON objects). They were a perfect match from the beginning.


This is not an area of expertise for me, so is there a reason to not offload vector processing to the GPU and devote the CPU silicon to what it's good at, which is scalar instructions?


There are many reasons. The latency of getting data back and forth to the GPU is a pretty high threshold to cross before you even see benefits, and many tasks are still CPU bound because they have data dependencies and logic that benefit from good branch prediction and deep pipelines.

Many high compute tasks are CPU bound. GPUs are only good for lots of dumb math that doesn't change a lot. Turns out that only applies to a small set of problems, so you need to put in lots of effort to turn your problem into lots of dumb math instead of a little bit of smart math and justify the penalty for leaving L1.


Yes, communications overhead. SIMD instructions in the CPU have direct access to all the same registers and data as regular instructions. Moving data to a GPU and back is a very expensive operation relative to that. The chips are just physically further away and have to communicate mostly via memory.

Consider a typical use case for SIMD instructions - you just decrypted an image or bit of audio downloaded over SSL and want to process it for rendering. The data is in the CPU caches already. SIMD will munch it.


For certain professions like media editing vector instructions help. But for your average Facebook / Netflix / Microsoft Word user, a kind of user that 95% users are, there are less benefits on vector instructions.


Are you saying Facebook, Netflix and Microsoft Word don't require media processing? Pretty sure you'd see plenty of SIMD instructions being executed in libraries called by those applications.


AVX is widely used in things as basic as string parsing. Does your application touch XML or JSON? Odds are good that it probably uses AVX.

Does your game use Denuvo? Then it straight-up won't run without AVX.

People are stuck in a 2012 mindset that AVX is some newfangled thing. It's not, it's used everywhere now. And it will be even more widely used once AVX-512 hits the market - even if you are not using 512-bit width, AVX-512 adds a bunch of new instruction types that fill in some gaps in the existing sets, and extend it with GPU-like features (lane masking).


Are you saying that iPhones and iPads are bad at Facebook, Netflix, and Microsoft Word? If they are, the end user certainly can’t tell. If they aren’t, then it doesn’t really matter does it?


Phones are much more reliant on having hardware decoders for things like video while desktops can usually get away with a CPU-based implementation, yes.


Sure but the same is true about performance in general.


That's not really true. Single-threaded scalar performance is still super important for the everyday responsiveness of laptop/desktop systems. Especially for applications like web browsing which run JavaScript.


Your UI is slow because of IO and RAM and O(n^2) code, not CPU. Look at your activity monitor.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: