
CPU Usage Differences After Applying Meltdown Patch at Epic Games - _jcwu
https://www.epicgames.com/fortnite/forums/news/announcements/132642-epic-services-stability-update
======
Abishek_Muthian
If it's helpful,

Our Node.js, MongoDB, Python servers all with significant network traffic
didn't have any measurable impact after KPTI patches on Amazon Linux on
T2.medium(burst), M4.large, T2.large(burst) respectively.

Our impact is lesser than the figures suggested by redhat's advisory -
[https://access.redhat.com/articles/3307751](https://access.redhat.com/articles/3307751)

~~~
bartread
Would you be able to post some ballpark figures, please?

~~~
Abishek_Muthian
Actually, there is not any measurable difference. Our architecture is
completely state-less and when compared with equivalent load there's no
difference at all. I guess, default network throughput bottle-neck itself is
higher than the random memory cache bottle-neck in my case. There was no
impact on the latency either.

Fearing automatic update by AWS resulting in performance issue, we rushed to
update all our serves and gladly there weren't any performance impact.

I don't think not every one has been fortunate, there's still lot of variables
reg performance impact of KPTI. This user here using PHP server mentioning 50%
performance -
[https://twitter.com/timgostony/status/948682862844248065](https://twitter.com/timgostony/status/948682862844248065)
and of-course OP issue is being covered here.

~~~
justin66
> Actually, there is not any measurable difference.

That doesn't strike you as a little odd?

~~~
discoursism
It's the same result as Google apparently observed across their fleet. It
doesn't seem like it should be _that_ odd.

------
contrarian_
Pretty much what I predicted here:
[https://news.ycombinator.com/item?id=16054674](https://news.ycombinator.com/item?id=16054674)

> Sounds like servers handling lots of small UDP packets would be hit pretty
> hard.

~~~
_wmd
There's potential for a little rearchitecting to help, at least in the case of
UDP:

    
    
        NAME
               sendmmsg - send multiple messages on a socket
    
        SYNOPSIS
               #define _GNU_SOURCE          /* See feature_test_macros(7) */
               #include <sys/socket.h>
    
               int sendmmsg(int sockfd, struct mmsghdr *msgvec, unsigned int vlen,
                            unsigned int flags);

~~~
jacquesm
That only works if messages are independent of answers received and are all
known at the same point in time. In most games this typically would not be the
case, you'd use a message to cram as much state change into it as is known to
keep the game moving fluidly. Packing more than one such message together
would serve no purpose.

~~~
_wmd
I'd have presumed otherwise, but I'm not sure if you're understanding the API
correctly.. it's not about sending multiple messages to the same destination,
but to multiple destinations in a single call. The msg_hdr struct has room for
specifying the target address.

From userspace' perspective, even if the same data isn't being broadcast at
every client, just building up a big array (perhaps while looping over the
input from recvmmsg()!) and spitting it out once would have the same semantics
as just calling sendmsg() immediately on each, etc

~~~
jacquesm
Yes, I understand the API correctly. Having implemented it once I think I have
the basics down ;) But that said I was assuming that this would be in the
context of multiple UDP messages sent from a game client to a game server.

~~~
Doxin
The bottleneck generally isn't at the client side for games, The server has
much more network traffic to handle. So even if this performance fix only
works server-side that might be enough.

------
Darthy
The Meltdown attack requires an attacker to have a piece of code executed on
your server. Epic's servers are used for login, where people send you data,
and for game logic, where people also just send you data like "player x moved
his avatar here, player y shoots etc". If all the server does is execute the
code which Epic wrote themselves and already trust, why would it need to apply
the Meltdown patch?

~~~
drawnwren
This is a horrible approach to security. If you only secure against attacks
you expect, you're gonna have a bad time.

~~~
toomuchtodo
Security is about risk mitigation. You cannot derisk entirely, so you make
tradeoffs. Without knowing all of the parameters, it’s disingenuous to say
it’s a horrible approach to security.

The most secure computer is powered off, enclosed in concrete 6 feet below the
surface of the earth. It is not very useful though.

~~~
em3rgent0rdr
Reminds me of Battlestar Galactica. The humans were so (wisely) fearful of the
Cylons that their ship computers were not networked in anyway. For their
risk/reward trade-off curve, the benefit from having computer networks were
not worth the risk of the Cylons being able to compromise the entire ship. All
communication was either verbal or done via fax printed to paper and so had to
go through human intermediaries.

~~~
toomuchtodo
Good catch!

------
mmaunder
Post. More. Benchmarks. I'll do the same once I have them. More data on this
across platforms and apps is incredibly helpful for all.

~~~
chrisseaton
Real data like this post are more useful than more benchmarks.

~~~
jacquesm
They _are_ benchmarks, just not microbenchmarks. Actual application
performance is the gold standard for benchmarks.

[https://en.wikipedia.org/wiki/Benchmark_(computing)#Types_of...](https://en.wikipedia.org/wiki/Benchmark_\(computing\)#Types_of_benchmark)

~~~
wolfgke
> They are benchmarks, just not microbenchmarks. Actual application
> performance is the gold standard for benchmarks.

This depends on your intended purpose of the benchmark. If you want to
evaluate performance of a whole system, sure. But one often does benchmarks to
find parts of the system that are critical to performance so that you can do
further optimizations or do change in the software architecture to avoid these
critical performance hotsports. For this case other benchmarks that allow an
easy fine-grained interpretation of the data are surely better.

------
dannyw
Surprise, surprise, Intel’s Meltdown patch has significant and serious
performance impacts for specific kinds of workloads.

All because Intel decided to “optimise” by checking for permissions after
speculative retirement, instead of before like AMD.

~~~
imjasonmiller
If, as you say, AMD is not affected by Meltdown unlike Intel, will this
significantly change the server market? Excuse the pun, but would this make
e.g. AMD EPYC a lot more attractive for such data centers?

~~~
arien
Aren't they still vulnerable to Spectre?

~~~
Tuna-Fish
Yes to the bounds check violation version, so far no to the BTB poisoning
version. The bounds check version only works inside a single process, so is
only relevant when you run untrusted code within the same process as some
private data (such as JS in a web browser).

AMD claims that they believe that the way their branch predictor works
effectively makes the BTB poisoning unusable, but there is no actual proof,
and their statement regarding it was much more wishy-washy than they were with
meltdown. (Which they specifically state they are completely immune to.)

~~~
em3rgent0rdr
Regarding branch predictor spoiling, if AMD doesn't update the branch
predictor based off evaluation of branches the occurred along a speculative
path, then the Spectre exploit (from my understanding that it trains the
branch predictor using speculative code) won't be able to work.

------
jitbit
Thought I'd share our graph too: [https://www.jitbit.com/alexblog/270-cpu-
usage-stats-after-pa...](https://www.jitbit.com/alexblog/270-cpu-usage-stats-
after-patching-for-meltdown/)

Seeing almost 40% increase in CPU load

~~~
simik
Do you mean 40 percentage points? In other words, tripling?

~~~
lucb1e
Looking at the graph: probably. It was already increasing (from 16-19%, one
outlier to 27%, first 8 datapoints) to around 25-30% (the last 6 before the
big jump). Then a big jump to 62-85% (N=20) with most around 75% (N≃13). The
last bit is around 60-72% (N=12) and the very last datapoint is below 40%
again but that is half outside the graph (cherry-picking datapoints?).

So to summarize, from 25% to 65%, or 2.6× the original.

Note that this is all from eyeballing that graph and estimating what
percentage a datapoint is at, since the scale has an interval of 20 percent
points and no minor grid or anything.

------
jedisct1
The Meltdown patch also introduces a serious performance hit for DNS servers
and resolvers.

Before giving figures, I want to run the same tests on a more recent CPU, but
my current benchmarks are not great to say the least.

~~~
pixl97
Is this a Program -> CPU interaction that is slowing things down, or is this a
CPU -> network interaction.

I'm wondering if there are classes of network drivers that are having a much
larger effect in performance. Network cards these days can do many things to
improve performance like TCP/UDP offloading and because of that their drivers
are very complex and I'm going to assume that there well be Meltdown fallout
because of this.

------
erdbeerkuchen
The image didn't show up for me, here is a direct link for anyone with the
same problem:
[https://lh6.googleusercontent.com/MwzsHRXQLVbmJ3pusNuGwn0ZQV...](https://lh6.googleusercontent.com/MwzsHRXQLVbmJ3pusNuGwn0ZQVjo9h8nRJHJhIo4d3XFqbvUYCj8EPq5jV7zeVEEcHAkraNBesbbNDW_UAlIjvw-
hZBd80rKt7ZYl35nBIcfCCVyRvW5V7M7KVejv9tvVBHfgSKr)

~~~
fermienrico
Can someone explain what 1, 2 and 3 series in the plot means? What are we
looking at? Why did one of them all of a sudden spiked?

~~~
Groxx
They patched a single host.

\----------

Text of the forum post:

Attention Fortnite community,

We wanted to provide a bit more context for the most recent login issues and
service instability. All of our cloud services are affected by updates
required to mitigate the Meltdown vulnerability. We heavily rely on cloud
services to run our back-end and we may experience further service issues due
to ongoing updates.

Here is a link to an article[1] which describes the issue in depth.

The following chart shows the significant impact on CPU usage of one of our
back-end services after a host was patched to address the Meltdown
vulnerability.

[the image]

Unexpected issues may occur with our services over the next week as the cloud
services we use are updated. We are working with our cloud service providers
to prevent further issues and will do everything we can to mitigate and
resolve any issues that arise as quickly as possible. Thank you all for
understanding. Follow our twitter @FortniteGame for any future updates
regarding this issue.

Epic suggests following security best practices by always staying up to date
with latest patches. General Recommendations for Computer Security[2]

We will continue to update this thread with similar information as it comes to
us.

[1]: [https://spectreattack.com/](https://spectreattack.com/)

[2]: [https://www.howtogeek.com/173478/10-important-computer-
secur...](https://www.howtogeek.com/173478/10-important-computer-security-
practices-you-should-follow/)

------
viraptor
I know they don't have to share the details, but the "patched" part is not
really clear. Did they update to a new image / more recent kernel / anything
else? Much like the redis post linked in HN before, we don't know if the
impact is because of the "pti turned off/on" change, or are there more moving
parts involved.

~~~
arien
If I recall correctly from older posts (I played Fortnite for a while),
they're using AWS.

~~~
viraptor
That's still not answering many questions. I hope they publish a full analysis
at some point.

~~~
lawrenceyan
Since they're using default provisioned EC2 instances, it's likely that the
developers don't necessarily even fully understand their performance
degradation. They just expect the service that they pay for to work properly.

~~~
viraptor
It's true, but it's not what I meant. They wrote "after a host was patched".
This is ambiguous. So they mean the host as in instance, or host as in AWS
host machine? Did they just reboot it get on the new/updated VM host, or did
they rebuild to include the PTI fixes as well. Did they upgrade anything, or
did everything else stay on the same version.

------
lathiat
I’d encourage anyway seeing super huge hits to make sure they are not using
paravirt (particularly on Amazon). As that needs mitigation the the virt level
the impact seems very large.

------
lazyjones
Those who have looked at the patches in detail: does it treat older generation
CPUs differently than 7th gen? The Paper authors who wrote the KAISER patch
expected much worse performance on older CPUs due to implementation issues...

Unfortunately, Epic Games don't provide any details about their CPUs AFAICT.

~~~
cthalupa
I believe PCID is supposed to help mitigate the performance impact from the
KPTI changes.

Both the hardware and kernel have to support PCID

------
forgotpw2018
Huge real world performance impact, they didn't say how much but it looks like
close to 100%.

I smell a class action lawsuit coming.

~~~
sillysaurus3
Hopefully not. Lawsuits are a fine way to stifle innovation. Imagine how hard
it would be to push through any idea at Intel.

No one saw this coming. Things happen. It's impossible to predict every
contingency.

They acted in good faith.

Also, heh, users are funny: "all other games i have work fin by the way so
there must be a problem whit fortnite."

~~~
serf
>They acted in good faith.

That's new for a company, and it's not indicated by their PR spin right now.

What I think happened : a company produced a product with a problem. Probably
not out of malice, but ignorance.

One of two things happened after, which can kill the 'good faith' argument ;
the problem was found internally and hushed, or the problem was found
externally and minimized to reduce financial burden arising from fixing the
problem and the PR related.

We have no way of knowing how well it was known about internally, but we can
_all_ see the PR going on from Intel right now, and I hope i'm not the only
one who reads into those press releases to establish intent.

~~~
djsumdog
Well this type of attack has been theoretical for years. The Project Zero
referenced some papers from the mid-2000s that talked about it. But the
implementation, even today, isn't exactly trivial.

Modern processors are insanely complex systems. Branch prediction, out of
order execution, hardware virtual memory management, hardware virtualization,
etc. Not to mention that these are side-channel attacks. It's not a direct
vulnerability, it requires executing some code and measuring timing very
precisely; similar to and oscilloscope and a very expensive safe.

Of course Intel is going to be spinning this however they can for damage
control. That's what PR departments do. I still doubt engineers at Intel
really thought this attack was plausible, or else they wouldn't have been
engineering chips this way for the past decade.

~~~
wyager
> Modern processors are insanely complex systems.

And until we align their market incentives properly, silicon vendors are going
to continue to ignore this fact when it comes to verification. Intel is
especially bad here; they’ve had an unreasonable number of hardware bugs in
recent years.

------
simooooo
Wow that's a huge jump. Scared for my web servers now

~~~
lsd5you
Not entirely sure why we need to update/protect most servers, since generally
they won't be running untrusted code, right?

~~~
rocqua
The replies are missing something from the original article.

They are running on the cloud, and Meltdown / spectre means that exploits can
escape a VM. This means you don't just need to trust your VM, but also any VMs
you are sharing the hardware with.

~~~
Asdfbla
I suppose if a Cloud provider could ensure that all your VM instances run on
the same host and no other VM is allowed there then the issue would be a bit
mitigated.

Although this restriction on how the cloud provider is allowed to schedule
your VMs probably would somewhat defeat the point of cloud hosting in the
first place.

Out of curiosity, how many VM/container instances usually run on a physical
host at any given time (for your typical cloud computing provider)?

~~~
Fargren
With AWS, and I assume most other cloud computing providers, you can pay extra
for you instances to run in a host without someone else's VMs. You probably
should be doing this for any servers were you handle sensitive data, but it is
a place were many will be cutting corners.

~~~
mrep
You don't really even have to pay extra. Just use the biggest instance since
and you are guaranteed to be isolated because there is no room for anyone
else.

Granted, that only works for workloads that are spread across enough small
instances.

------
jacksmith21006
Really like to seem some benchmarks from different cloud providers before and
after patches are applied to see if there is a material difference.

The one most interested in is Google versus Amazon.

------
herf
Graph says pretty clearly the patch is 60-70% slower! or 2.5-3x hardware
costs. I wonder which CPU family this is?

~~~
jug
This surprised me a lot. I thought 30-50% would be worst case and then with
additive effects from both Spectre and Meltdown fixes. Not sure how it can be
this bad. I imagine it could get even worse if you are running in a
virtualized environment on top where the server is affected in turn, but I
figure thst wouldn’t show in a CPU graph like this..

~~~
Tuna-Fish
The kernel address space isolation makes syscalls much more expensive than
they were. A toy program that just repeatedly calls the cheapest syscall in
the kernel would lose much more than half it's speed, but that's not what's
reported because no-one actually needs to run that workload. Epic seems to
have been particularly unlucky in how KPTI impacts them.

------
j1vms
Maybe off-topic: Is formal verification viable anywhere in CPU logic design?
Also, could any existing "CPU static analyzers" have caught the issue that
caused Meltdown?

Edit: It looks like the answer to the first is a definite yes.

~~~
andrewaylett
Formal verification of what?

You can only verify properties you've thought of, and no-one conceived of this
particular 'feature' causing issues like this until now. So I don't think
formal verification would have helped: if anyone was in a position to realise
the issue was worth verifying, they'd have been able to raise it without
formal verification too.

~~~
serf
>You can only verify properties you've thought of, and no-one conceived of
this particular 'feature' causing issues like this until now.

I've read for years about supposed insecurities with branch prediction -- it
just wasn't shown practically. To say that it wasn't conceived of is a little
off.

------
ZenoArrow
Considering the performance impact, I wonder how console manufacturers are
going to handle this (assuming that the processors they used are vulnerable to
Spectre/Meltdown).

~~~
rcarmo
Console manufacturers don't run unsigned code, so I expect they'll just sit
still until the next hardware refresh.

~~~
ZenoArrow
Modern consoles now come with web browsers, and the researchers proved that
the attacks could be performed via web browsers, did they not?

~~~
lrem
Next patch: we enhanced your security by disabling execution of JavaScript
from untrusted domains. In unrelated news, we now block ads!

~~~
ZenoArrow
If they are going to block JavaScript, they need to do it for all domains, not
just untrusted domains. For example, if I can MITM my own website activity
(which I can, by having a device that sits between a router and a console),
then I can change the JavaScript coming from trusted domains.

In other words, if I visit a site like Google or Facebook on an affected
device, I can change the JavaScript that is run, and make it still appear like
it came from a trusted domain.

~~~
Thiez
You can't just mitm when the JavaScript is served over https, unless you also
can install your own certificates on the machine you're trying to hack. But
when you have permissions to install new certificates you probably don't need
the hack in the first place.

------
merb
if people would still use good old owned hardware without any virtualizsation,
spectre and meltdown would not be as scary as it really is on server hardware.
which means that only clients would be affected. but since everything runs on
the cloud, we basically need to update the whole world.

~~~
kazagistar
It's still pretty scary to have computers where all of memory is readabe from
every process, even if you own them.

~~~
merb
well 50:50 chances are high that if a process goes bogus, that you have more
problems than just "memory is readable from every process".

(of course that does not apply to clients where code can run in jit
(javascript), software that communicates with the internet, etc and runs on a
remote machine). most servers probably should only run trusted code (of course
that is mostly never the case because no company evaluates every process they
running) but chances are high that if some uses linux and gnu stuff that most
shady stuff gets catched. (I mean if not, some people could do a lot of bad
stuff, consider a misbehaving systemd, nginx/apache, databases, which could
basically do a lot of harm.)

------
jacksmith21006
Really like to see similar done on the Google Cloud infrastructure and see if
there is any difference.

------
mtgx
Looks like it may be time for Epic Games to consider switching to AMD
Ryzen/EPYC.

~~~
dijit
I have been looking at AMD/Epyc CPUs recently.

(Because making NUMA aware C++ code is hard, and AMD Epyc is a single socket
on a server with 4 very closely knit NUMA zones so non-NUMA code will run
better on that vs Intel)

But unfortunately there's no comoddity server from HP/Dell available yet. But
I hear one is on the way on the Dell side.

~~~
snuxoll
HP has paper launched the DL385 G10 with EPYC, I'm not sure if it's fully
available through sales channels or not. Supermicro also has had EPYC systems
available for a while - though that won't do you any good at all if you need a
big name OEM for "nobody got fired for buying Cisco" reasons.

------
brendangregg
What's the syscall/sec per CPU rate for this workload? My guess: over 1M.

------
stefantalpalaru
It's time to move to userspace network drivers like
[https://github.com/snabbco/snabb/](https://github.com/snabbco/snabb/)

------
aceoflolo
>We wanted to provide a bit more context for the most recent login issues and
service instability. All of our cloud services are affected by updates
required to mitigate the Meltdown vulnerability. We heavily rely on cloud
services to run our back-end and we may experience further service issues due
to ongoing updates.

So they are saying "you can't log in because we're too cheap to rent more
servers now that this patch has increased CPU usage"

~~~
transpostmeta
Resolving scalability issues is not as simple as "rent more servers".

~~~
aceoflolo
For a CPU-bound problem it mostly is as simple as that, yes

~~~
bhouston
Only if the algorithm can be split across multiple machines easily.

Sometimes people assume that you can use local shared memory or something
between threads in order to synchronize state. You figure out how many
individuals can be on a server at once and then ensure that you can handle
that load on a specific machine.

I've seen this type of stuff for game state before because they need to keep
everyone in a specific game domain (level or city depending on the type of
game) synchronized and be nearly real-time. It can be hard to pull this across
different machines without introducing significantly latency, redis or DBs is
slow for a FPS shootter.

Not saying this is the case but I can see it could be something like that.

~~~
tetha
Yup. When I was in a team working on distributed game servers, this forced us
to shard games instead of distributing individual games.

Terminology wise, if a game session was fully distributed across multiple
instances, each server could accept traffic for this game session. Think
elasticsearch - each node can answer searches for any index in the cluster.

If you shard your games, you just put all games with an even ID on box 1 and
all games with an odd ID on box 2. If box 1 dies, all games on that box
disappear. And then you usually end up with a lobby server as an initial
connection and redirect to the actual game server in the background.

This is a very simple architecture. It's easy to develop for this
architecture, because you don't need to worry about complex clustering issues
- exactly 1 client talks to exactly 1 server and it doesn't matter if there's
50 other servers answering other clients.

This is also very nice to scale. Most server side code for games tends to have
very predictable resource consumption, because it's running a pretty
predictable simulation loop on a pretty predictable and bounded data set.
Especially from this perspective, I can see why the Epic guys are bugged. It's
not pretty to put a factor of 2 into those calculations.

------
ChildOfChaos
I have turned windows updates off so I don’t have to deal with this crap.
Screw you intel + Microsoft.

~~~
bartread
Whilst I sympathise with your frustration, at least with Intel, I can't help
feeling like you might be storing up bigger problems for yourself with this
course of action.

~~~
ChildOfChaos
I get what you are saying but right now there are no known exploits and with
so much patching happening will there ever? These things are always over blown
in the media and the reality is very little damage happens to the average
user.

It’s servers perhaps that are most at risk.

Also I don’t do much on my Windows, mostly gaming, I run an iMac and boot into
it only for certain tasks, so it’s extremely unlikely I will ever have an
issue.

Any performance hit, even if small is just not worth updating for to me.

~~~
kazagistar
What do you mean "no known exploits"? The authors of the paper have an exploit
that reads arbitrary system memory from a browser. And even after the meltdown
patches, spectre "fixes" we have seen are only partial mitigation, and still
potentially allow reading of passwords and third party cookies. But I guess if
you want to wait til it's too late...

