
“CPUs are optimized for video games” - zx2c4
https://moderncrypto.org/mail-archive/noise/2016/000699.html
======
sapphireblue
This may be an unpopular opinion, but I find it completely fine and reasonable
that CPUs are optimized for games and weakly optimized for crypto, because
games are what people want.

Sometimes I can't help but wonder how the world where there is no need to
spend endless billions on "cybersecurity", "infosec" would look like. Perhaps
these billions would be used to create more value for the people. I find it
insane that so much money and manpower is spent on scrambling the data to
"secure" it from vandal-ish script kiddies (sometimes hired by governments),
there is definitely something unhealthy about it.

~~~
camelNotation
People spend a lot of money on physical security as well. They put locks on
their homes and cars, install safes in banks, drive money around in armored
cars, hire armed guards for events, and pay for a police force in every
municipality. The simple fact is that if your money is easy to get, someone
will eventually take it without your permission. That is reality, but calling
it "unhealthy" implies that the current state of things is somehow wrong. I
agree with that premise, but it carries with it a lot of philosophical
implications.

~~~
ythl
> That is reality, but calling it "unhealthy" implies that the current state
> of things is somehow wrong.

I don't spend a lot of money on physical security. I leave my car and front
door unlocked usually, and don't bother with security systems.

If you find yourself having to lock and bolt everything under the sun lest it
get damaged/stolen, then yes, I think it is an indication that the current
state of things is wrong. There is something wrong with the
economy/community/etc. in your area.

I realize that "the internet" doesn't really have boundaries like physical
communities do, but I too wish for a world where security was not an endless
abyss sucking money into it and requiring security updates until the end of
time. In other words - a world where you could leave the front door unlocked
online without having to worry about malicious actors. It will never happen,
of course (at least not until the Second Coming ;)

~~~
ajmurmann
Wow, what country do you live I that you feel comfortable keeping you door and
car unlocked? I've lived I the US and Germany and wouldn't have felt
comfortable doing that in either place.

~~~
dtparr
Countries are not granular enough to use for this sort of thing. I've lived in
several places in the US where unlocked doors were the norm, and then in
several places where it would be a bad idea if you want to keep your things.

------
pcwalton
Games are also representative of the apps that actually squeeze the
performance out of CPUs. When you look at most desktop apps and Web servers,
you see enormous wastes of CPU cycles. This is because development velocity,
ease of development, and language ecosystems (Ruby on Rails, node.js, PHP,
etc.) take priority over using the hardware efficiently in those domains. I
don't think this is necessarily a huge problem; however, it does mean that CPU
vendors are disincentivized to optimize for e.g. your startup's Ruby on Rails
app, since the problem (if there is one) is that Ruby isn't using the
functionality that already exists, not that the hardware doesn't have the
right functionality available.

~~~
nostrademons
Interestingly, the one thing that typical web frameworks _do_ do very
frequently is copy, concatenate, and compare strings. And savvy platform
developers will optimize that heavily. I remember poking around in Google's
codebase and finding replacements for memcmp/memcpy/STL + string utilities
that were all nicely vectorized, comparing/copying the bulk of the string with
SIMD instructions and then using a Duff's Device-like technique to handle the
residual. (Written by Jeff Dean, go figure.)

No idea whether mainstream platforms like Ruby or Python do this...it wouldn't
surprise me if there's relatively low hanging fruit for speeding up almost
every webapp on the planet.

~~~
MichaelGG
Why is this even a thing? Copying and the like is such a common operation. Why
don't chip providers offer a single instruction that gets decoded to the
absolute fastest way the chip can do? That'd even allow them to, maybe, do
some behind-the-scenes optimization, bypassing caches or something. It's
painful that such a common operation needs highly specialized code. I know you
can just REP an operation but apparently CPUs don't optimize this the same
way.

This is too obvious an issue, so there must be a solid reason. What is it?

~~~
Tloewald
This seems like something that compilers should do and CPU instruction sets
should not.

~~~
MichaelGG
Why? It's a common op that requires internal knowledge of every
microarchitecture, isn't it? Seems like something that should be totally
offloaded to the CPU so you're guaranteed best performance.

~~~
anjc
The message you were referring to was talking about code for copying strings.
If you wanted an instruction to copy lots of strings, the CPU would need to
know what a character is (which could be 7 bits, 8 bits, 16 or 32), what a
string is, how it's terminated, what ascii and unicode is, be able to allow
new character encoding standards etc etc. Then you would need other
instructions for other high level datatypes. That's not what CPUs do, because
you're limited by how much more logic/latency you can add to an architecture,
how many distinct instructions you can implement with the bits available per
instruction, how many addressing modes you want etc.

So instead, this information/knowledge about high level data types is
encapsulated by standard libraries and then the compiler below that. Most CPUs
have single instructions to copy a chunk of data from somewhere to somewhere
else and a nice basic way to repeat this process efficiently, and it's up to
the compiler to use this.

------
speeder
As a gamedev I found that... weird.

A CPU for games would have very fast cores, larger cache, faster (less
latency) branch prediction, fast apu and double floating point.

Few games care about multicore, many "rules" are completely serial, and more
cores doesn't help.

Also, gigantic simd is nice, but most games never use it, unless it is
ancient, because compatibility with old machines is important to have wide
market.

And again, many cpu demanding games are running serial algorithms with serial
data, matrix are usually only essential to stuff that the gpu is doing anyway.

To me, cpus are instead are optimized for intel biggest clients (server and
office machines)

~~~
SolarNet
I disagree. As a gamedev writing game logic you are right.

But as an engine programmer, I agree with the linked author. I'll take your
points one at a time.

Most engines are multi-core, but we do different things on each core (and this
is where Intel's hyper-threading, where portions are shared between the
virtual cores, for cheaper than entire new cores, is a solid win). Typically a
game will have at least a game logic thread (what you are used to programming
on) and a "system" thread which is responsible for getting input out of the OS
and pushing the rendering commands to the card along with some other things.
Then we typically have a pool of threads (n - 1; n is the number logical core
of the machine; -2 for the two main threads, +1 to saturate) which pull work
off of an asynchronous task list: load files from disk, wait for servers to
get back to us, render UI, path-finding, AI decisions, physics and rendering
optimization/pre-processing, etc.

AAA game studios will use up to 4 core threads by carefully orchestrating data
between physics, networking, game logic, systems, and rendering tasks (e.g.
thread A may do some networking (33%), and then do rendering (66%), thread B
might do scene traversal (66%), and then input (33%), see the 33% overlap?),
they also do this to better optimize for consoles. But then they have better
control of their game devs and can break game logic into different sections to
be better parallelized, where as consumer game engines have to maintain the
single thread perception.

SIMD is used everywhere, physics uses it, rendering uses it, UI drawing can
use it, AI algorithms can use it. Many engines (your physics or rendering
library included) will compile the same function 3 or 4 different ways so that
we can use the latest available on load. It's not great for game logic because
it's expensive to load into and out of, but for some key stuff it's amazing
for performance.

That stuff the GPU is doing eats up a whole core or more of CPU time. So what
if we are generally running serial algorithms, we need to run 6 different
serial algorithms at once, that's what the general purpose CPUs were built
for.

This is all the stuff you don't often have to deal with coddled by your game
engine. The same way that webdevs don't have to worry about how the web
browser is optimizing their web pages.

~~~
yoklov
Glad somebody wrote this. I agree 100% (well... probably more like 90% -- but
mostly nits that aren't worth getting into).

~~~
SolarNet
To be fair I'm more of a hobbyist - who writes game-engine-esque code (I never
said what kind of engine programmer I am did I) for my day job (pays better) -
that just builds game engines for fun (like the last 10 years now... but no
games). So some details are likely wrong, I'm kinda super curious as to your
nits.

~~~
daemin
Sounds like what I did for the past 10 years before joining the gamedev world
about 3 years ago. It is cool to work on your own tech and to learn a lot of
different things, but it's also scary how much can get done with a whole team
working at it.

------
Narann
The real quote would have been:

> Do CPU designers spend area on niche operations such as _binary-field_
> multiplication? Sometimes, yes, but not much area. Given how CPUs are
> actually used, CPU designers see vastly more benefit to spending area on,
> e.g., vectorized floating-point multipliers.

So, CPUs are not "optimized for video games", they are optimized for
"vectorized floating-point multipliers". Something video game (and many
others) benefits from.

~~~
nemothekid
Why are they optimized for vectorized floating-point multipliers? Does the CEO
of Intel just tell all the engineers to do this because he likes
multiplication?

~~~
daveguy
They are optimized for that because a lot of algorithms can make use of them,
from quicksort/mergesort through image rendering and encryption. It is an easy
optimization from a hardware perspective -- simple repetitive hardware
structure. This is why GPUs are so powerful and games are not the only thing
that benefits from this type of optimization. Matrix multiplication is also
used in signal processing. The CEO asked, how can we optimize the use of our
hardware for the most benefit? And SIMD with wide pipes is at the top of the
list. Most of the post is about all the new algorithms that can take advantage
of the hardware push. The hardware push is there because it is an easy use of
hardware resources.

This is also an optimization that compilers can readily take advantage of on a
small scale (similar to pipelining) so the combination of benefit + ability to
use + simplicity/low resource use makes it an inevitability.

~~~
mcguire
A sort that uses vectorized floating point multiplication?

~~~
sbierwagen
From 2008:
[http://www.vldb.org/pvldb/1/1454171.pdf](http://www.vldb.org/pvldb/1/1454171.pdf)

Also: [http://researcher.watson.ibm.com/researcher/files/jp-
INOUEHR...](http://researcher.watson.ibm.com/researcher/files/jp-
INOUEHRS/PACT2007-SIMDsort.pdf) and
[https://github.com/NumScale/boost.simd](https://github.com/NumScale/boost.simd)

------
joseraul
TL;DR To please the gaming market, CPUs develop large SIMD operations. ChaCha
uses SIMD so it gets faster. AES needs array lookups (for its S-Box) and gets
stuck.

------
wmf
Maybe a better headline would be something like "How software crypto can be as
fast as hardware crypto". I was curious about this after the WireGuard
announcement so thanks to DJB for the explanation.

------
nitwit005
Not really. Just look through the feature lists of some newer processors:

AES encryption support:
[https://en.wikipedia.org/wiki/AES_instruction_set](https://en.wikipedia.org/wiki/AES_instruction_set)

Hardware video encoding/decoding support (I presume for phones):
[https://en.wikipedia.org/wiki/Intel_Quick_Sync_Video](https://en.wikipedia.org/wiki/Intel_Quick_Sync_Video)

It's more that it's relatively easy to make some instruction useful to a
variety of video game problems, but difficult to do the same for encryption or
compression. You tend to end up with hardware support for specific standards.

~~~
apendleton
Did you read the post? This is specifically addressed. The AES hardware
support requires a bunch of die area specifically for that purpose and still
isn't that performant. Smaller-area CPUs don't spend the area and perform
abysmally on AES, and even in CPUs that do include AES-NI, Chacha achieves
comparable performance for the same security margin without any custom
hardware support, just using the general vector instructions added to improve
game performance. DJB expects that because vector math continues to improve
while AES hardware does not, Chacha will soon outperform AES even on devices
with hardware support.

~~~
nitwit005
Thank you for pointlessly regurgitating much of his post?

The fact that Intel put an encryption feature in their chip, which does indeed
make that algorithm faster, would tend to indicate they wanted faster
encryption wouldn't it? That some other algorithm could be faster still isn't
really contradicting that.

~~~
brohee
I'd wager that the goal wasn't so much speed (which is very rarely the issue)
but security. It was way too hard to program a constant time AES
implementation without AES-NI.

------
magila
One important aspect DJB ignores is power efficiency. ChaCha achieves its high
speed by using the CPU's vector units, which consume huge amounts of power
when running at peak load. Dedicated AES-GCM hardware can achieve the same
performance at a fraction of the power consumption, which is an important
consideration for both mobile and datacenter applications.

Gamers generally don't care about power consumption. When you've spent $1000
on the hardware an extra dollar or two on your electricity bill is no big
deal.

~~~
acqq
> CPU's vector units consume huge amounts of power when running at peak load.
> Dedicated AES-GCM hardware can achieve the same performance at a fraction of
> the power consumption

Citation needed. Where did you get that idea? Please show how djb's vector
code spends more power vs the built-in AES "dedicated hardware" instruction
when, as he measures:

"* Both ciphers are ~1.7 cycles/byte on Westmere (introduced 2010).

* Both ciphers are ~1.5 cycles/byte on Ivy Bridge (introduced 2012).

* Both ciphers are ~0.8 cycles/byte on Skylake (introduced 2015)."

"even though AES-192 has "hardware support", a smaller key, a smaller block
size, and smaller data limits" (his code is 256 bits and 12 rounds).

~~~
wmf
AVX is so hot that Intel CPUs may have to clock down ~200 MHz when executing
heavy AVX code to stay within their power/thermal limits. I have no idea if
this hits DJB's code in reality.

[http://www.intel.com/content/dam/www/public/us/en/documents/...](http://www.intel.com/content/dam/www/public/us/en/documents/white-
papers/performance-xeon-e5-v3-advanced-vector-extensions-paper.pdf)

~~~
acqq
Thanks for the link, I can only find "when the processor detect AVX
instruction additional voltage is applied, the processor can run hotter which
can require the frequency to be reduced" but I don't see anywhere mentioned
that the base frequency is 200 MHz. If you mean 200 MHz lower than TDP marked
frequency, but processing twice as much data, it doesn't sound so bad, it's
still 1.7 times more power efficient than the shorter instructions spending
twice as much time at the marked TDP frequency. And I'd be surprised that AES
is magically not needing serious processing too. Otherwise it would be already
implemented to be much faster than it is now.

~~~
honkhonkpants
It really depends on your instruction mix. If only one in twenty instruction
uses AVX, the rest of your instructions are running slower due to the lower
clock and they aren't getting double the throughput. On top of that it could
be some other thread using AVX, clocking down the entire core and harming the
given thread that isn't using AVX.

Intel has done a lot of things to try to balance this. One of those things is
they don't even bother turning half the vector unit on unless you use it a
lot. If you seldom issue an op with 512-bit operands, the CPU will actually
dispatch them as multiple 256-bit operations, in which case you won't incur
the drop in clock, but you also don't get the supposed benefit of double
throughput. Furthermore the performance may be much worse if the CPU decides
to turn up the remaining vector bits, because the clock drops dramatically
while those units are charging up.

So you can see that for someone trying to wring out every last bit of
performance on a recent Intel CPU using all the advertised vector
capabilities, optimization can become quite complicated.

------
revelation
I thought modern video games are predominantly limited by GPU performance?
Maybe the argument is that while usually CPU performance isn't the most
important part of the equation, video _gamers_ base their purchasing decision
on misguided benchmarks that expose it.

The big CPU hog and prime candidates for these vector operations nowadays
seems to be video encoding.

~~~
skykooler
It depends on the game. Physics-heavy games for example, like Kerbal Space
Program or Besiege, are usually CPU-limited.

~~~
revelation
Those are both built with Unity though, right? Where the game is basically C#.
Are the actual physics even done vectorized?

~~~
dogma1138
No part of a Unity game runs in a .NET VM or any other VM. They chose C# as
the scripting language because C# is one of the most popular programming
languages, it's extremely popular in the non-game dev development community,
and it's probably the only non-Web language that most code academies teach for
traditional development maybe other than Java.

It's syntax is also pretty close to C and C++ which means developers with game
dev background will feel at home as most game development is done in C++.

Unreal Engine uses Unreal Script which is now pretty much C++ but it is also
not compiled directly (although with Unreal Engine 4 and onwards it's much
closer to direct compile than any other scripting language).

Unity engine has it's own interpreter which then builds highly optimized C++
code and compiles it when you build the game.

Unity Engine is a pretty decent engine with kickass performance when
optimized, without fine optimization any general purpose engine including
Unreal 4 acts like utter crap. I'm alpha/beta testing a few UE4 games atm and
you can see just how bad performance can get even on a solid defacto industry
standard like UE4 like when dynamic shadows tank a GTX Titan X (Maxwell) SLI
setup to below 20 fps any time there are light sources that are not properly
fenced and culled - e.g. explosions.

~~~
KON_Air
Your last paragraph ticks me off so much about the current non-sequetor
"industry standard". Most recent example I can give is that doesn't really
care is EDF 4.1. It takes carpet bombing an entire city to make its FPS dip
with hundreds if not thousands of gaint incest gibs (and four players) being
flung across the map.

Do they really need a bazillion shaders and dynamic shadows on everything?

~~~
dogma1138
Unreal Engine is an industry standard when it comes to commercial general
purpose engines.

There are more unreal engine titles for any given version than any other
engine on the market on PC's and consoles.

On mobile unity is probably bigger atm.

------
joaomacp
Of course. Gamers are the biggest consumers of new, top of the line PC
hardware.

~~~
vonmoltke
I doubt gamers are outspending datacenters and owners of private clusters.

~~~
firethief
Server hardware is mostly a separate market from PC hardware, since there are
different things to optimize for.

~~~
dbenhur
Yet Intel sells essentially the same microarchitecture to both

~~~
vegabook
...at enormously different price/flop, basically because it restricts RAM size
and disables ECC in the Core chips. It's why we need AMD's Zen to be
competitive again, so that this price gouging ends. Same for Tesla/Geforce at
Nvidia.

------
milesf
And because CPUs are optimized for both gamers and Windows, the world has
access to lots of cheap, powerful hardware. I'm not a Microsoft fan, but I'm
very appreciative to them for making this ecosystem possible.

In fact, games have always driven the modern computer industry. Even Unix
started because of a game
([http://www.unix.org/what_is_unix/history_timeline.html](http://www.unix.org/what_is_unix/history_timeline.html)).

------
rdtsc
Wonder how a POWER8 CPU would handle it or if it is optimized differently. It
obviously is not geared for the gaming market.

~~~
gtirloni
Not sure about Power8 as I wasn't able to find anything conclusive. But if you
believe Oracle's marketing efforts, the SPARC chips do much better than Power8
and Intel on that front.

[https://blogs.oracle.com/BestPerf/entry/20151025_aes_t7_2](https://blogs.oracle.com/BestPerf/entry/20151025_aes_t7_2)

~~~
Symmetry
SPARC chips are much more optimized for a certain intended set of workloads
than x86 or POWER are so that's not surprising.

------
stephenr
Isn't this exactly why HSM's exist - to provide optimised hardware crypto
functionality?

Honestly I would treat this the same as eg Ethernet - high end cards have
hardware offload capabilities that the software stack can utilise to get
better performance.

------
tgarma1234
I really find it hard to believe that people for whom such an interest in
security at the CPU level would buy "retail" processors like you and me have
access to. I am no expert in the field but it just seems weird that there
isn't a market for and producer of specialized processors that are more
militarized or something. Why does everyone have access to the same Intel
chips? I doubt that's actually the case. Am I wrong?

~~~
pcwalton
> I really find it hard to believe that people for whom such an interest in
> security at the CPU level would buy "retail" processors like you and me have
> access to.

DJB's interest here is specifically in creating algorithms that work well on
general-purpose popular CPUs.

------
Philipp__
ARMA III could be the good example of CPU bottleneck. Or maybe it is badly
optimized... Then we hit the hot topic of multicore vs singlecore performance.

~~~
ohstopitu
with the latest update, ARMA III seems to have a massive FPS boost. So it was
definitely not optimized earlier.

~~~
swampthinker
I seem to perpetualy hear that with ARMA games.

------
wangchow
The form-factor for laptop screens are built for media consumption, even
though the square form-factor is superior for productivity (I found an old
Sony Vaio and the screen form-factor felt very pleasant). Seems the general
consumption of media has dominated CPU design _in addition to_ everything else
in our computers.

~~~
digi_owl
Well the wider screen format allows for a keyboard with a numpad now, without
getting a massive "lip" below the keyboard.

------
rphlx
Perhaps that was true in the mid 90s, but today Intel optimizes x86_64 for its
highest margin core business: server/datacenter workloads. Any resulting
benefit to desktop PC gaming is appreciated, but it's a side effect rather
than a primary design goal.

------
wscott
No, Intel CPUs are optimized to simulate CPUs

Some stories from back around 2000 when designing CPUs at Intel. Some people
did bemoan the fact the few software actually needed the performance in the
processors we were building. One of the benchmarks where the performance is
actually needed was ripping DVDs. That lead to the unofficial saying "The
future of CPU performance is in copyright infringement." (Not seriously, mind
you)

However, here is a case where the CPUs were actually modified to improve one
certain program.

From:
[https://www.cs.rice.edu/~vardi/comp607/bentley.pdf](https://www.cs.rice.edu/~vardi/comp607/bentley.pdf)
(section 2.3)

"We ran these simulation models on either interactive workstations or compute
servers – initially, these were legacy IBM RS6Ks running AIX, but over the
course of the project we transitioned to using mostly Pentium® III based
systems running Linux. The full-chip model ran at speeds ranging from 05-0.6
Hz on the oldest RS6K machines to 3-5 Hz on the Pentium® III based systems (we
have recently started to deploy Pentium® 4 based systems into our computing
pool and are seeing full-chip SRTL model simulation speeds of around 15 Hz on
these machines)"

You can see that the P6-based processors (PIII) were a lot faster than the
RS6K's and the Wmt version (P4) was faster still? That program is csim and it
is a program that does a really dumb translation of the SRTL model of the chip
(think verilog) to C code that then gets compiled with GCC. (the Intel
compiler choked) That code was huge and it had loops with 2M basic blocks. It
totally didn't fit in any instruction cache for processors. Most processors
assume they are running from the instruction cache and stall when reading from
memory. Since running csim is one of the testcases we used when evaluating
performance the frontend was designed to execute directly from memory. The
frontend would pipeline cacheline fetches from memory which the decoders would
unpack in parallel. It could execute at the memory read bandwidth. This was
improved more on Wmt. This behavior probably helps some other read programs
now, but at the time this was the only case we saw where it really mattered.

The end of the section is unrelated but fun:

"By tapeout we were averaging 5-6 billion cycles per week and had accumulated
over 200 billion (to be precise, 2.384 * 1011) SRTL simulation cycles of all
types. This may sound like a lot, but to put it into perspective, it is
roughly equivalent to 2 minutes on a single 1 GHz CPU!"

Games were important but at the time most of the performance came from the
graphics card. In recent years Intel has improved the on-chip graphics and
offloaded some of the 3d work to the processor using these vector extensions.
That is to reclaim the money going to the graphic card companies.

------
xenadu02
tl;dr: AES uses branches and is not optimized for vectorization. Other (newer)
algorithms are designed with branchless vectorization in mind, which makes
specialized hardware instructions unnecessary.

------
Philipp__
And what if games are better (or worse) optimised for certain type of
hardware? So that way, you spend on new Intel CPU every 3 years. So the point
is, what if some games are badly optimisied and run bad on certain hardware on
purpose. Maybe it sounds like a conspiracy theory. But look, CPUs are
stalling, Intel wants to sell it's things every year, what if they come to
developers and say "Look make your game run 10% better on our latest hardware
and we give you money"?

~~~
lagadu
[citation needed]

------
DINKDINK
Off-topic: That's a great favicon

