
AMD is sending Ryzen/Radeon care packages to game developers - valeg
https://www.overclock3d.net/news/cpu_mainboard/amd_is_sending_ryzen_radeon_care_packages_to_game_developers/1
======
brynet
Does anyone @ AMD want to help out our niche developer/gaming community? ;-)

[https://www.reddit.com/r/openbsd_gaming/](https://www.reddit.com/r/openbsd_gaming/)

edit: With OpenBSD developer hat on, hardware donations welcome here too:
[https://www.openbsd.org/want.html](https://www.openbsd.org/want.html)

~~~
voltagex_
I don't think I've seen any AMD employees on HN. I wonder if there's a better
contact.

~~~
OnlyLys
I've seen AMD representatives on Reddit's r/amd often. Maybe parent poster can
try there.

~~~
brynet
Thanks! You never know.

[https://www.reddit.com/r/Amd/comments/8u8c80/amd_is_sending_...](https://www.reddit.com/r/Amd/comments/8u8c80/amd_is_sending_ryzenradeon_care_packages_to_game/e1f24yh/)

------
joombaga
> Several professional tools were also updated, with Zbrush offering a
> 204,772% performance improvement in specific workloads, showcasing how much
> can be gained when software is optimised with Ryzen in mind.

Comments indicate that's not a misprint. I'd love to hear the details of a 204
thousand percent improvement.

~~~
dragontamer
[https://developer.blender.org/T53068](https://developer.blender.org/T53068)

So I dunno about ZBrush. But Blender is a similar 3d program. There was a
multithreading issue which would cause Blender to work for HOURS before
actually completing some trivial tests.

Clearly, AMD Zen / Infinity Fabric has an edge case that older systems can
handle just fine. Note that Blender uses Windows pthreads, which has an
incredibly BAD implementation of spinlocks and mutexes.

Blender fixed this issue by making a customized implementation of spinlocks. I
dunno what ZBrush did, but these kinds of architectural regressions happen
every now and then.

~~~
Dig1t
I'd love to learn more about Windows pthreads poor implementation of spinlocks
mutexes. Do you have any cool links?

~~~
dragontamer
Citation: myself. I'm not an expert but I think I know enough to see a bad
implementation when I see one. As such, please take this analysis with a grain
of salt.

Second: I'm not entirely sure if this applies to "modern" pthreads-win32, but
the Blender project uses a relatively old build of pthreads-win32.

Third: my analysis is derived from the Blender diff here: [https://dev-
files.blender.org/file/data/rjmzozouj44bs6qicd3y...](https://dev-
files.blender.org/file/data/rjmzozouj44bs6qicd3y/PHID-FILE-
cre57pid5mvoizjirptg/file)

I'm assuming that the Blender devs are smarter than me and have already tested
this. :-)

\------------

So going through the Blender source code we come across pthread_spin_lock as
part of the diff. Considering the comment thread (as well as the diff) as
adding Win32 #ifdefs and such, we can assume this was the cause of the
performance issue.

\---------

I assume... okay, so my logic isn't 100% tight, sorry :-(... that the
implementation uses this pthreads-win32: [http://sourceware.org/pthreads-
win32/](http://sourceware.org/pthreads-win32/)

Which brings us to here: ftp://sourceware.org/pub/pthreads-
win32/sources/pthreads-w32-2-9-1-release/pthread_spin_lock.c

\---------

Lets break it down why this is a bad implementation:

1\. No "pause" instruction, causing latency issues when leaving the spinlock.
See [https://msdn.microsoft.com/en-
us/library/windows/desktop/ms6...](https://msdn.microsoft.com/en-
us/library/windows/desktop/ms687419\(v=vs.85\).aspx) or
[https://software.intel.com/en-us/node/524249](https://software.intel.com/en-
us/node/524249) for more information on why "pause" should be used in every
spinlock. Strangely enough, the "pause" instruction is found in Linux
implementations of pthreads (so the author for pthreads-w32 just... didn't
look at the Linux implementation or something?)

2\. PTW32_INTERLOCKED_COMPARE_EXCHANGE_LONG seems inefficient to me. I'd
imagine that the usage should be using a swap instead (aka:
[https://docs.microsoft.com/en-us/windows-
hardware/drivers/dd...](https://docs.microsoft.com/en-us/windows-
hardware/drivers/ddi/content/wdm/nf-wdm-interlockedexchange) ). Although I
don't have any microbenchmark that "proves" this, I'd imagine that a swap is
more efficient than a compare-and-swap.

3\. A potential "slowpath" which includes a mutex-lock. Note that Windows
Mutexes are incredibly heavy: including security features and even a
"filesystem like" naming scheme. (See Windows Object Handles if you don't
believe me). Windows Mutexes are very fully featured: recursive, multiple
levels of security, and more.

This is a common problem in Linux system-coders -> Windows system coders. The
Windows "Critical Section" is a more appropriate replacement for Linux
pthread_mutexes.

I'd say Windows Mutexes might be more similar to a Linux filesystem-level
semaphore. Linux semaphores provide those security features integrated with
the filesystem.

The to Linux mutexes and spinlocks is something called a "Critical Section" in
Windows land.

[https://docs.microsoft.com/en-
us/windows/desktop/Sync/critic...](https://docs.microsoft.com/en-
us/windows/desktop/Sync/critical-section-objects)

Critical sections are Windows's lightweight yielding / task sleeping
mechanism.

\---------

Finally, Windows provides a spinlock: [https://docs.microsoft.com/en-
us/windows/desktop/api/synchap...](https://docs.microsoft.com/en-
us/windows/desktop/api/synchapi/nf-synchapi-
initializecriticalsectionandspincount)

Which is probably is what pthreads-w32 should really be using. I mean, Linux
devs basically just don't know what Windows offers, and the pthread-win32
library is a poor fit at the moment.

\-----------------

Sorry if you were expecting more complete analysis and testing. Lol. Its
mostly me looking at the thing and making a guess and reverse-engineering the
patch-notes from the Blender discussion.

~~~
caf
These days it would likely be best to implement pthread-win32 mutexes using
WaitOnAddress.

~~~
dragontamer
Huh, I didn't know about this new primitive. Thanks! It seems to have been
introduced into Windows 8 and later, as a competitor against Linux's futex.

I did a brief search online, and Raymond Chen had this to say:

[https://blogs.msdn.microsoft.com/oldnewthing/20160825-00/?p=...](https://blogs.msdn.microsoft.com/oldnewthing/20160825-00/?p=94165)

> [snip] in real life, you should just use Enter­Critical­Section because it
> has stuff like spin counts and lock convoy resistance.

Is there anything that pthread_mutex_lock does that EnterCriticalSection
doesn't do? It seems like a good "translation" to me. I'd only use the
WaitOnAddress thing if pthread_mutex_lock had some features that a raw
EnterCriticalSection wouldn't do.

~~~
caf
You need to be able to link your pthread condition variable implementation to
your pthread mutex implementation such that a pthread_cond_wait() does an
atomic wait-on-condition-variable-and-unlock-mutex, noting that another thread
can signal the condition variable without holding any mutex.

In addition, pthread mutexes can be statically initialised which requires some
kind of initialise-on-first-use for those if you implement them using Critical
Section objects.

I believe WaitOnAddress(), rather than being a "competitor" for futex was
actually added (or at least, the underlying infrastructure was) in order to be
able to implement futex() in the WSL.

~~~
dragontamer
Ah yes, I forgot about condition variables. That makes sense then.

------
shmerl
Will such optimizations help in Wine on Linux too?

 _> Game developers also know that next-generation consoles from both Sony and
Microsoft are on the horizon, both of which are likely to use Ryzen/Radeon CPU
and graphics hardware, making now the perfect time to start looking deeper
into Ryzen. Both future consoles are also expected to feature Radeon graphics
features like Rapid Packed Math, which enables FP16 calculations to be
completed 2x faster than standard 32-bit math, allowing some graphical
elements to be accelerated, assuming the extra precision of FP32 isn't
required._

Is anyone working on unblocking Sony's PlayStation lock-in, to provide Vulkan
→ GNM translation layer? There is such effort for D3D12 which will address
Xbox.

~~~
kevingadd
fwiw, SDL2-based games that use regular OpenGL are already portable to XBox
because SDL2 has a UWP backend and you can run ANGLE on UWP to get OpenGL
support. There's even a working implementation of FNA for it which means old
XBox Live Indie Games titles can be ported over easily.

~~~
shmerl
Vulkan translation is important for more demanding engines. UWP doesn't seem
to support Vulkan even on Windows: [https://github.com/KhronosGroup/Vulkan-
Docs/issues/366](https://github.com/KhronosGroup/Vulkan-Docs/issues/366)

On Xbox there is no option but to translate to D3D12 anyway.

~~~
pjmlp
More demanding games are doing just fine with Cocos2d-X, Unreal, Unity,
Ogre3D, Crytech, Lumberyard....

------
usermac
I love this. I recall when Palm did an SF event and I thought they should have
given every participant a phone–they didn't. Not long after they were gone.

------
trengrj
Does anyone have any good info on Ryzen ECC RAM support? I’ve read a bunch of
conflicting information and that some motherboard vendors have been
misleading.

~~~
ComputerGuru
It’s very straightforward: ECC is fully supported by the CPU and chipset, but
must be accounted for by the motherboard maker. Check the specs/manual for the
motherboard before buying, in particular the tested hardware list.

I’ve been using 32GB of ECC DDR4 with a 1950X and X399 Taichi without a
problem.

~~~
simcop2387
I've also been using ECC with a 1950x, but an X399 Asrock Pro-gaming board.
I've confirmed that it works by watching errors when overclocking the memory.

------
SubiculumCode
A little off-topic, but planning on building a Ryzen 2600x system soon. Does
anyone know if Ubuntu works well with Ryzen processors?

~~~
pja
Just switched my home desktop to a Ryzen 2600X / Gigabyte AX370M-DS3H
motherboard: Only hiccup was having to source a pre-ryzen AM4 cpu to update to
BIOS to get Ryzen support, which was a pain - I'd advise making sure that any
board you buy has had the BIOS updated if required by the seller.

I run Debian Testing mostly, so a 4.16.16 kernel. Been perfectly stable for me
so far.

NB - For anyone hoping to use the the new AMD Vega CPUs with integrated
graphics, the Debian kernels don't have the relevant support compiled in, so
you'll have to compile a custom kernel & grab the latest firmware files from
the linux-firmware git repo if you want accelerated graphics.

~~~
PascLeRasc
If you end up getting a motherboard without a compatible BIOS, AMD will send
you a temporary CPU for free. ([https://support.amd.com/en-us/kb-
articles/Pages/2Gen-Ryzen-A...](https://support.amd.com/en-us/kb-
articles/Pages/2Gen-Ryzen-AM4-System-Bootup.aspx))

~~~
pja
Yes, but online reports suggest the round trip time is a couple of weeks. I
ended up ordering the cheapest possible AM4 CPU from Amazon Warehouse &
returning it after updating the BIOS instead.

~~~
Tijdreiziger
A tech forum that's local to me ran a 'CPU forwarding ring' \- you could sign
up to have a CPU sent to you, and after upgrading your BIOS you'd have to send
it to the next person who needed it. I thought that was a genius solution.

------
rasz
Wake me up when they start hiring driver developers.

