Hacker News new | past | comments | ask | show | jobs | submit | Cold_Miserable's comments login

This is ~9.6x faster than "scalar".

ASM_TestDiv proc ;rcx out, rdx A, r8 B mov rax,05555555555555555H kmovq k1,rax vmovdqu8 zmm0,zmmword ptr [rdx] vmovdqu8 zmm4,zmmword ptr [r8] vpbroadcastw zmm3,word ptr [FLOAT16_F8] vmovdqu8 zmm2{k1},zmm0 ;lower 8-bit vmovdqu8 zmm16{k1},zmm4 ;lower 8-bit vpsrlw zmm1,zmm0,8 ;higher 8-bit vpsrlw zmm5,zmm4,8 ;higher 8-bit vpord zmm1,zmm1,zmm3 vpord zmm2,zmm2,zmm3 vpord zmm5,zmm5,zmm3 vpord zmm16,zmm16,zmm3 vsubph zmm1,zmm1,zmm3{rd-sae} ;fast conv 16FP vsubph zmm2,zmm2,zmm3{rd-sae} vsubph zmm5,zmm5,zmm3{ru-sae} vsubph zmm16,zmm16,zmm3{ru-sae} vrcpph zmm5,zmm5 vrcpph zmm16,zmm16 vfmadd213ph zmm1,zmm5,zmm3{rd-sae} vfmadd213ph zmm2,zmm16,zmm3{rd-sae} vxorpd zmm1,zmm1,zmm3 vxorpd zmm2,zmm2,zmm3 vpsllw zmm1,zmm1,8 vpord zmm1,zmm1,zmm2 vmovdqu8 zmmword ptr [rcx],zmm1 ;16 8-bit unsigned int ret


Heh? Surely fast convert 8-bit int to 16-bit FP,rcp+mul/div is a no-brainer? edit make that fast convert,rcp,fma (float 16 constant 1.0) and xor (same constant)

Unfortunately none of the hardware used for testing supports FP16 arithmetic. Between Intel and AMD, the only platform that supports AVX512-FP16 is currently Sapphire Rapids.

I tried a similar approach with 32-bit FP before, and the problem here is that fast conversion is only fast in the sense of latency. Throughput-wise, it takes 2 uops instead of one, so in the end, a plain float<->int conversion wins.

How is the built-in Administrator account faster than a named account? I wonder if its a bloatware effect. They need to test with all bloatware removed. No control flow guard, defender, firewall, appearance to performance, no sounds, no background, microcode dll's deleted, mitigations disabled in regedit, storport disabled, every service disabled, every app deleted, edge deleted etc.


From the article it sounds like it’s a change to scheduling priorities not what is running (from other comments it sounds like there’s a change in latency related to moving processes across cores in different packages?)


Indeed. Intel is trash right now. No AVX512 thus AMD is infinitely faster.


Define infinitely


Does Zen 5 have cldemote or senduipi ?


Probably not.

The gcc Zen 5 patch adds as new instructions only AVX-VNNI, MOVDIRI, MOVDIR64B, AVX-512-VP2INTERSECT and PREFETCHI.

Zen 4 already had AVX-512 VNNI (for ML/AI), AVX VNNI is only an alternate encoding for the programs that use only AVX (because they have been compiled for Intel Alder Lake/Raptor Lake).

MOVDIRI and MOVDIR64B can be very useful in some device drivers, less often in general-purpose programs.


250 is nonsense. 2xFMA per cycle @ ~4.5Ghz = 32*4.5 = ~144 Gflops

Beating cuBlas is unlikely. You probably made a mistake. Last I tested it, it was even better than MKL in efficiency.


Yes, the figure was 250GFLOP for 4 cores instead of per core, I misread. Still impressive but more reasonable


The floating-point FMA throughput per desktop CPU socket and per clock cycle has been doubled every few years in the sequence: Athlon 64 (2003) => Athlon 64 X2 (2005) => Core 2 (2006) => Nehalem (2008) => Sandy Bridge (2011) => Haswell (2013) => Coffee Lake Refresh (2018) => Ryzen 9 3950X (2019) => Ryzen 9 9950X (2024), going from 1 FP64 FMA/socket/clock cycle until 256 FP64 FMA/socket/clock cycle, with double numbers for FP32 FMA (1 FMA/s is counted as 2 Flop/s).


I'd wish memory bandwidth could also be doubled so often on desktops. Instead of 256 (even more due to 2-3 times higher core frequency) only 14 times increase: DDR-400 6.4 GB/s => DDR5-5600 89.6 GB/s. The machine balance keeps falling even further.

While flash memory became so fast in recent years, I haven't heard of any break-through technology prototypes to bring some significant progress into RAM. Let alone the RAM latency, which remained constant (+/- few ns) through all the years.


You are right, which is why in modern CPUs the maximum computational throughput can be reached only when a large part of the operands can be reused, so they can be taken from the L1 or from the L2 cache memories.

Unlike the main memory bandwidth or that of the shared L3 cache memory, the memory bandwidth of the non-shared L1 and L2 caches has been increased exactly in the same ratio as the FMA throughput. Almost all CPUs have always been able to do exactly the same number of FMA operations per clock cycle and loads from the L1 cache per clock cycle (simultaneously with typically only a half of that number, of stores to the L1 cache per clock cycle).

Had this not been true, the computational execution units of the cores would have become useless.

Fortunately, the solution of systems of linear equations and the multiplication of matrices are very frequent operations and these reuse most of their operands, so they can reach the maximum computational throughput.


Its not possible to automate. Log is one example. You can't just use log(x) but log(x+1). There are other problems like the "zero problem" and its still possible to devise better approximations with other elementary operations such as |x| or ABS(x).


Intel makes money selling motherboard chipsets. There's no excuse, just greed.


Even a thicko can figure out this "research". You can't exercise your eyes to change their physical shape.


Biased article. 14nm was delayed and 10nm was delayed by 4 years.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: