
Postmortem of Service Outage at 3.4M Concurrent Users - johnnyapol
https://www.epicgames.com/fortnite/en-US/news/postmortem-of-service-outage-at-3-4m-ccu
======
mmanfrin
When Bluehole take down PUBG for 5 hours, there's no communication outside of
two tweets. When Epic see degraded performance for less than 2 hours, they
give a postmortem.

There's a difference in the level of respect each company gives for its
customers. I play PUBG a lot, but I want to see Epic win in the long run.

~~~
cjsawyer
There isn’t even a custom message for “scheduled server maintenance” vs “our
servers are busy” (which I can get multiple times a night anyway). You NEED to
check the tweets to find out which it is. I love PUBG, I just wish it was made
by more competent developers.

~~~
mmanfrin
It's astounding to me how the last couple patches have made performance worse
and introduced new, gamebreaking bugs. The last patch took everything down for
5 hours. The patch before tanked FPS by like 40%. The patch before caused
2/3rds of games to spawn players below the match or simply crash on load.

They don't do QA, and their 'deploy process' takes everything down for hours.
It's embarrassing.

~~~
smnscu
> They don't do QA, and their 'deploy process' takes everything down for
> hours.

This reminded me of this classic:
[https://engineering.riotgames.com/news/automated-testing-
lea...](https://engineering.riotgames.com/news/automated-testing-league-
legends)

~~~
reificator
I'm putting it in a to-read list after 2 minutes of skimming, but that looks
like an excellent article.

Sounds like Riot have a good head on their shoulders about how to handle
testing games.

If you come from a different domain, it's important to realize that testing
games is much more difficult than testing a lot of other types of software.
And much like other kinds of software, when it's done well it looks easy.

~~~
eropple
I have no love for Riot-the-games-company, but there are pretty good reasons
why Riot-the-software-company is so well-respected. Bear in mind that stuff
like that blog post are just the upper crust of that iceberg. They basically
took the notion of Chef best practices _away_ from Opscode (now Chef, Inc.)
and completely redefined them with environment cookbooks and Berkshelf. Things
in devops have moved on somewhat, as they always do--but they were, and I
assume still very much are, serious about building good processes and the
tools necessary to make them better.

------
jasonjayr
> We run Fortnite’s dedicated game servers primarily on thousands of
> c4.8xlarge AWS instances, which scale up and down with our daily peak of
> players.

That's between $572,000 (500 instances, 30 days) - $2,863,800 (2500 instances,
30 days), per month at current prices, and seems like it's only for one aspect
of their infrastructure.

That seems .... excessive? Is that a typical spend with a game server system
like this? That does seem to suggest that once this becomes less than
profitable, it's all going away ...

~~~
013a
c4.8xlarges are the smallest c4 instance that guarantees 10G networking
performance.

Those costs can easily be cut in half with reservations. Its likely there's a
lower-bound of the number of reserved instances they use as a baseline
performance guarantee, then they use on-demand to scale above that.

No one at their scale actually pays the list price.

Basic math: There are ~100 players in each game. At a 30 tickrate, that's 3000
RPS per game minimum. Each of those requests likely involves a number of 3D
math calculations, including hit detection, collision detection, real-time
cheat detection, etc. Those updates then need to batch back to the players at
30 tick. All of this needs to happen with as little eventual consistency as
possible; a difference of milliseconds degrades the player's confidence that
the server is correctly calculating what is happening in-game.

Point being; multiplayer game programming is an entirely different beast than
normal web programming. The same rules don't apply. Its an n^2 problem where
every additional user in one "lobby" actually increases resource utilization
exponentially because you need to update every other user's game-state with
the actions of that new player. Additionally, Battle Royale style games are
the most demanding multiplayer games ever created. Games like WoW have way
more players per realm, but servers only need to worry about the interactions
of a select few in your surrounding area, _and_ it doesn't have as stringent
real-time requirements. Games like CoD only load in 5-20 players.

~~~
vvanders
FYI most game network servers tick the network at 10-12Hz(although the
internal sim may run higher).

We generally didn't go higher than 10Hz since you'd start saturating lower end
connections and any well written game will be good at reconciling state via
dead-reckoning. You have to handle 100ms+ spikes anyway so it doesn't make
sense to run the network super-fast.

~~~
AlphaSite
Most competitive games go well above 10Hz ticks, OW does it, CSGO does it (I
don't really know the tick-rates of other games but when it matters, so no
comment there, but there is an important area for it).

OW just has a restricted rate option for low bandwidth connections (added when
they went from 23 -> 61 Hz).

~~~
vvanders
So that might be the case for LAN games, generally internet tends to stay
lower. Valve for instance recommends 13Hz[1] for Source(CSGO):

> // 60 for updaterate is LAN ONLY use 13 for internet

> // 20 is default but will cut the maxplayers you can handle in 1/2

> // for SRCDS Servers use 30 - you might be able to use 20

> // sv_maxupdaterate 60

> sv_maxupdaterate 13

Like I said, depending on how good your potential-vis code is you may be able
to go higher but you risk saturating your client links and most game netcode
doesn't handle congestion nicely. My info is a little out of date(2-3 years)
but back then we had to keep data rates below 20kb/sec(and ideally 10kb/sec)
if you wanted low jitter across the link.

[1]
[https://support.steampowered.com/kb_article.php?ref=5386-HMJ...](https://support.steampowered.com/kb_article.php?ref=5386-HMJI-5162)

~~~
gsich
This information is very much outdated. (Windows XP is referenced in it)

CSGO matchmaking servers (run at 64 ticks) force you to use cl_updaterate 64.

Even in CS 1.6 days, most leagues and servers enforced updaterate 100 and rate
20000. Nearly every player could handle it. And those who couldn't, well, bad
luck for them. Playing with 100ms+ is no fun for anybody.

~~~
vvanders
Like I said in the root, talking generally 10-12Hz is really common because
you need to handle dropped packets cleanly.

I don't know about the CS 1.6 days, I used to play a ton of early cs and was
an admin on one of the most popular Frontline Force servers. We ran tick rate
of 12 if I recall correctly.

~~~
gsich
It is mainly because PUBGs network code sucks. You can see the effects here
[0]

Unless you had dial-up users, a tickrate of 12 is very bad and not fun. I
don't know if you had hardware restrictions, but tickrate is independent of
updaterate that is sent to the clients. By enforcing a low tickrate you
maintain a very old snapshot of the world on your server (83 ms), which makes
clients interpolate more.

[0]
[https://www.youtube.com/watch?v=u0dWDFDUF8s](https://www.youtube.com/watch?v=u0dWDFDUF8s)

------
matt_s
A lot of players of massive online games tend to get hand-wavey when there are
problems and act like "dude just get more servers" is the answer.

This clearly shows how complex a system is needed that has to handle 3.4
million concurrent, connected users. I think the connected part compounds any
scale problems you have since it is implied they are connected to each other.

~~~
always_good
Agreed. Gamers also are among the most entitled users I've ever had to deal
with.

I'll never accept donations again for a game I've made available for free from
the amount of problems someone has caused me after they gave me $5.

~~~
jjjensen90
I never realized the level of feelings of entitlement and melodrama from
gamers until I ran a passion project to make a private server for a dead game.
We don't even accept donations and people still treat the dev team like dirt,
and will often write nasty manifestos about how we are ruining their lives or
we don't know what we are doing and if only we would listen to their every
whim, etc...

It's one of those things, though, that reveals the human inclination to
remember negative events more even among a sea of positive events...
Frequently as a game developer you will get loads of positive and helpful
feedback but the small minority that want to shit on you for no good reason
stick in the front of your mind.

~~~
dleslie
Out of curiousity, what's the project?

~~~
jjjensen90
London 2038 [https://london2038.com/](https://london2038.com/) It's a revival
of the MMOARPG Hellgate: London which released in 2007, closed in 2009, was
revived as a pay-to-win Korean reincarnation in 2011 and shut down again in
2015. Our version matches the version of the game from 2009.

~~~
dleslie
Hey, I've heard of this! Good work. :D

One thing I notice from your site is that you provide _many_ ways of giving
feedback. The Discord widget, the 'Contact Us' menu item, the FAQ is peppered
with communication links, there's a Forum, Wiki, et al. Basically, I'm seeing
an incredibly low barrier to entry for anyone wanting to express themselves to
the developers.

Have you considered paring that back? IE, ensuring all communication channels
are moderated, and limiting it to only one or two services?

~~~
jjjensen90
Great feedback, I have actually done a lot to pare it down since we launched
open alpha last Halloween. Typically people know now to make forum posts for
long-term discussions and to come chat in Discord for quick back and forth.
Nobody has actually used the contact form on the site, now that you mention
it.

My team of community manager volunteers is pretty active with moderation in-
game and out-of-game so it's gotten pretty good. But still, I think I need to
isolate myself and the other developers from the public, because it can be
really demoralizing when you spend 30-40 hours of free time per week on a side
project for free and you hear some terrible things.

~~~
dleslie
Sounds like you've got a sound plan, and I whole-heartily endorse placing a
barrier and filter between you and your fellow developers and the community at
large. :)

------
pkilgore
Love this because it shows two things 1) competent people are handling
problems and 2) they actually care.

A whole lot better than spoon feeding customers bullshit for weeks while
hamstringing your product rather than investing in it (looks at EA, mumbles
about SimCity).

~~~
fhood
Lets not labor under the impression that Epic invested resources in fixing
these issues purely out of the goodness of their hearts.

The success or failure of the recent Sim City game was hardly EA's number one
concern. Fortnite, however, is probably extremely important to the folks at
Epic.

I'm not defending EA here, but it was definitely in Epic's best interests not
to piss off Fortnite's player base given that there are some other, ahem,
similar titles on the market right now.

~~~
gameswithgo
Why do you think a game that sells for ~$50 was not important to EA, but a
game that sells for $0 is extremely important to Epic?

Epic is a pretty big operation too.

~~~
chickenfries
Because Free For All games are really popular right now, and they want their
engine (Unreal) to be a competitive choice for these and other massively
multiplayer games.

~~~
282883392
Unreal is already an extremely competitive choice for multiplayer games (read:
the best choice). They are putting resources to solving these problems because
they are making millions of dollars from "free" players.

------
tweenagedream
Disclaimer: I work on Google Cloud so I will be speaking from the bias of
knowing those products.

They talk a lot about reducing operating complexity and scaling their
infrastructure, I wonder what the cost of their current infrastructure + the
staff to maintain it might be vs the managed solutions that cloud providers
offer now.

For example, using cloud datastore or spanner or big table as a persistent
layer, these managed services can definitely scale to the current need and
I've seen them go much higher as well.

For logs ingestion and analysis, big query can be a very powerful tool as
well, and with streaming inserts that data can be queried in near real time.
For things that are less urgent, batch queries. For other things dataflow can
help with streaming workloads.

I think one of the problems they alluded to though was that at the moment
they're on a single provider, and what they're looking for is a multi cloud
strategy which totally makes sense. A lot of the above products create some
kind of locking, with some exceptions, like using hbase as an interface to big
table or beam as an interface to dataflow. Though I don't know what the other
providers offer that may have these same interfaces.

Another option is kubernetes, which I believe all providers are pretty
strongly embracing. Having most of the supporting infrastructure be brought up
with a few kubectl commands could help them scale across several cloud
providers quickly.

~~~
ShakataGaNai
All of the managed products from the different cloud providers are, more or
less, great. The problem is they are black boxes. When something goes wrong
you're completely at the whim of support. Ever call Dell/Comcast support and
want to tear your hair out? Yea... it's like that except neither AWS nor
Google have phone numbers to call.

The other problem is that most of these things aren't easy to migrate to. AWS
RDS is much easier because its just managed whatever you're already using. But
cloudspanner? DynamoDB? You have to completely re-architect your application.
Then you have to move your application, and data, to this new system...without
massive outages. It's a lot of work and a lot of cost. So until things go
HORRIBLY sideways, most companies don't have the spare time/money.

Been there, tried that.

~~~
trevyn
If your service is mission-critical, you’ll have a support contract, and there
absolutely is a number you can call. I have high confidence that Google and
Amazon will make every effort to make sure their services perform to spec, the
real problem is the _feeling_ of helplessness when something does go wrong.

~~~
ShakataGaNai
Of course you have a support contract, but how fast you can get a response
still doesn't change. AWS Business Support - which is what is very common, a
"Production Down" has a 1hr SLA for response. That's a long ass time for a
database to be down without doing anything more than sitting on your hands.

~~~
nogbit
Epic would have enterprise with the money they are making. That's business
critical <15min response time for production down.
[https://aws.amazon.com/premiumsupport/compare-
plans/](https://aws.amazon.com/premiumsupport/compare-plans/)

------
victorqhong
Really surprised that they use XMPP. Since you don't really hear anything
about XMPP anymore, I think most people assumed that it's dropped off in
usage/popularity (or people have moved to some other proprietary solution).

I've always thought that XMPP would be useful for games, just surprised to
hear that people are actually doing it.

~~~
solotronics
a ton of services use XMPP underneath the hood like Signal, WhatsApp, Jabber
etc.

it looks like a great business model to take something open source and hard to
use and make a proprietary version thats easy to install and use

[https://en.wikipedia.org/wiki/Signal_Protocol](https://en.wikipedia.org/wiki/Signal_Protocol)
[https://en.wikipedia.org/wiki/WhatsApp](https://en.wikipedia.org/wiki/WhatsApp)
[https://www.cisco.com/c/en/us/about/corporate-strategy-
offic...](https://www.cisco.com/c/en/us/about/corporate-strategy-
office/acquisitions/jabberinc.html)

~~~
e12e
Signal isn't xmpp based?

Jabber _is_ xmpp.

Not sure about WhatsApp.

~~~
merb
> Not sure about WhatsApp.

whatsapp at least was in the "early" days. they had some extensions to it but
you could connect to it pretty easily (once you had a valid account generated
with a real mobile number) not sure if it is still the case, things changed a
lot I guess, especially with encryption.

~~~
gsich
it still is

[https://github.com/tgalal/yowsup](https://github.com/tgalal/yowsup)

------
Kagerjay
I was playing fortnite on 2-04-18 22:00 UTC during the "Friends Service"
outage.

You couldn't see friends lists at all during that time period. So you couldn't
queue up in a friends / people you knew at all in a match, the only options
were either playing solo or using a "filled" team with random players.

I've been playing fortnite as one of the early 60k concurrent users all the
way to the 3.4M, so its been interesting seeing their load / server issues
over time and then reading this (Granted, I don't understand everything
discussed in their blog). They've done a outstanding job handling their
growing traffic.

One thing I've noticed with Fortnite, compared to PUBG or other MMOs, is how
large their patch updates are. Its usually several GB large, and it comes
fairly frequently about once a week.

~~~
Kagerjay
Forgot to note that fortnite had an ongoing "friend list" problem before

Most notably, whenever you wanted to add someone to your party. You had to do
the following.

1\. You Send user friend invite

2\. User would have to accept invite

3\. User would have to disconnect to refresh your friends list (due to friend-
service issue mentioned in blog)

4\. User would relog back on (took approximately 3 mins)

5\. You could then see them on friends list

6\. Send party invite

------
fokinsean
As an addicted Fortnite player this is a neat read. However as an application
layer dev, the architecture specifics were slightly over my head. My biggest
concern is shipping a working docker image, all of the architecture is mostly
abstracted at our company. This gave me some inspiration to dive deeper into
our architecture.

------
aaossa
Loved the tone of the article. They know they have some problems to work on,
they're being transparent about them and they're explicitly saying that they
need help with it.

~~~
stevenwoo
Look forward to seeing how they fix that MongoDB collection write stalling
problem. Vaguely recall that was still a big problem the last time I was
looking at MongoDB years ago.

------
iBotPeaches
That was an incredible fun read. Makes me curious of the other failures in
this industry if they could be explained in this detail.

~~~
lazyjones
The EVE developers used to post a lot of details about their setup, upgrades,
software and about failures / congestion issues...

~~~
robotmay
EVE's time dilation system is ingenious:
[https://www.eveonline.com/article/introducing-time-
dilation-...](https://www.eveonline.com/article/introducing-time-dilation-
tidi/)

~~~
plopz
Its an absolutely awful experience for a user. Especially when some things are
being affected by tidi while others are not.

~~~
always_good
Worse UX than failing completely any time a solar system gets more than X
players in it, or having a player cap per solar system?

Those would be "absolutely horrible" to me. Time dilation is a compromise, not
just arbitrary lore that became a game mechanic.

~~~
plopz
Sure, its the least bad system, but its still a bad system.

~~~
FLUX-YOU
Is there any game that can do 6k+ simultaneous players with a similar level of
physics/combat system that EVE has at a 1s tick rate?

I'm fairly sure there isn't.

~~~
AlphaSite
Eve doesn't really have much in the way of physics, which would be genuinely
impressive.

~~~
FLUX-YOU
By physics, I mean the ship/object sphere bouncy collisions and the velocities
that are calculated from a collision. Ship agility is also technically
physics. You still have to calculate those every tick

------
eterm
This is an interesting read, it's always interesting to hear why something
that ought to be fairly heavily federated or sharded can nevertheless fall
over centrally.

------
SilverSurfer972
> "Along with a number of things mentioned, even small performance changes
> over N nodes collectively make large impacts for our services and player
> experience."

I think this is where Stacktical helps with proactively detecting performance
regressions at the CI level, before they hit production:
[https://stacktical.com](https://stacktical.com)

Disclaimer: I am Stacktical's CTO

------
einrealist
Nice read. And nice to see Java running at the backend.

I wonder whether Epic can solve its problems by rearchitecting more into a
CQRS driven system with event sourcing: store events in a more write optimized
DB (e.g. Cassandra) and then process the events for fast reads through
whatever is required for the usecases. Maybe they touched the limits of
MongoDB to handle both, reads and writes at their scale.

------
tlynchpin
This is a great article, lots of detail, props to Epic team for generally
killing it and specifically putting this together.

------
orliesaurus
I never spent a dime on any of these free 2play games. I am in awe at how
dedicated the team behind Fortnite seems to be when it comes to providing us
data (real data?) of what's happening on their side, while I am sitting on my
couch logging into one of the matches with my keyboard and mouse

------
halflings
Meanwhile, the game is pretty much unplayable [1] on Mac OS while it was
heralded as the first game to support Metal (even featured in Apple's
keynote).

[1] Getting ~16 FPS on medium settings, with the high-end late 2016 MBP 15".

~~~
stats_n_trends
As a counter point I could play quite well on my MBP 17 at medium settings.
(45 FPS)

Note: I did tweak some minor settings like AA to get better performance.

~~~
halflings
Interesting. There shouldn't be any major difference in performance before the
two MBPs. Did you get any AMD driver updates or something like that?

------
aecorredor
How do you get to this level of expertise? What are the resources people like
these use to learn about this type of scalable systems? Any good books that
start from the ground up on these topics?

------
dom96
Looks like they are still having scaling issues. I just tried creating a new
Epic account and was shown an error.

~~~
sgarman
Sure, the game is not done growing in popularity. That's one of the difficult
things - they don't just have to scale to 3.4m users because next week the
target is moved and now its 4m then 6m etc. It's hard to hit a moving target
like that.

------
je42
I wonder why they want to do the step:

\- Followed by removing Nginx + Memcached couple altogether out of equation.

~~~
MuffinFlavored
I'm kind of surprised nobody in the comments is mentioning MongoDB
"unknowingly" causing write delays. Albeit, it's after handling an amazing
programming feat.

------
andrea_s
404 not found at the moment - does anyone has a snapshot?

edit: nevermind, it is available again - weird

------
skaplun
any thoughts by game production managers on the success of the battle royale
concept? anything we should take away for other products?

------
NelsonMinar
Launching this web page in Firefox on Windows causes my Oculus Rift software
to start up. I am.. not happy about that. WTF?

------
mevile
I'd be more likely to share this with my team if they didn't have a recruiting
pitch in there. I have a (probably irrational) fear of people abandoning the
team for some new shiny thing. Just a thought. It's a great writeup otherwise!

------
bullen
This is why you use an async-to-async database:

Why:
[https://github.com/tinspin/rupy/wiki/Fuse](https://github.com/tinspin/rupy/wiki/Fuse)

How:
[https://github.com/tinspin/rupy/wiki/Storage](https://github.com/tinspin/rupy/wiki/Storage)

