Hacker News new | past | comments | ask | show | jobs | submit login
Postmortem of Service Outage at 3.4M Concurrent Users (epicgames.com)
414 points by johnnyapol on Feb 9, 2018 | hide | past | favorite | 174 comments

When Bluehole take down PUBG for 5 hours, there's no communication outside of two tweets. When Epic see degraded performance for less than 2 hours, they give a postmortem.

There's a difference in the level of respect each company gives for its customers. I play PUBG a lot, but I want to see Epic win in the long run.

There might also just be the ability for the developers to understand what actually went wrong. I think the guys at Epic spent a lot of time building their system so they can understand it and I'm not sure that PUBG developers started out with that in mind.

Do you (or anyone reading) have any way to expand upon this? I.E. what sorts of patterns/practices/tradeoffs lead to building a more "understandable" system?

1) Experience - knowing what you are gonna want to know in the future and knowing what is likely to go bad.

2) Investing in logging early, so that your system is giving you metrics as it grows

3) Understanding that logging and metrics are a conversation that happens with a system over time. You don't do it once and then are done, it's like teaching a child to talk. You have to leave room to expand, and be actively building a 'vocabulary' in your logging and metrics that makes your day-to-day simpler.

4) KISS -- at least on the "big stuff". Your high level architecture should be 'clean' in the sense that issues can be isolated and readily assigned to a subcomponent. I find that isolated communication points that let you bifurcate responsibility are key to quick deductions. Ie the backend publishes data to a store, front-end reads data from that store, so if the data store agrees/disagrees with either you know exactly where the problem is.

Most of the time you have to write metrics into the code (and have a place set up to collect the data), that alone is a step a lot of places don't do or don't do enough.

There isn’t even a custom message for “scheduled server maintenance” vs “our servers are busy” (which I can get multiple times a night anyway). You NEED to check the tweets to find out which it is. I love PUBG, I just wish it was made by more competent developers.

It's astounding to me how the last couple patches have made performance worse and introduced new, gamebreaking bugs. The last patch took everything down for 5 hours. The patch before tanked FPS by like 40%. The patch before caused 2/3rds of games to spawn players below the match or simply crash on load.

They don't do QA, and their 'deploy process' takes everything down for hours. It's embarrassing.

> They don't do QA, and their 'deploy process' takes everything down for hours.

This reminded me of this classic: https://engineering.riotgames.com/news/automated-testing-lea...

I'm putting it in a to-read list after 2 minutes of skimming, but that looks like an excellent article.

Sounds like Riot have a good head on their shoulders about how to handle testing games.

If you come from a different domain, it's important to realize that testing games is much more difficult than testing a lot of other types of software. And much like other kinds of software, when it's done well it looks easy.

I have no love for Riot-the-games-company, but there are pretty good reasons why Riot-the-software-company is so well-respected. Bear in mind that stuff like that blog post are just the upper crust of that iceberg. They basically took the notion of Chef best practices away from Opscode (now Chef, Inc.) and completely redefined them with environment cookbooks and Berkshelf. Things in devops have moved on somewhat, as they always do--but they were, and I assume still very much are, serious about building good processes and the tools necessary to make them better.

I wish Valve would do this for Dota 2.

Exploits are such aheadache my friends and I just quit playing pubg. I am not sure how bad hacking is in FR with only a week of gameplay time, but so far I haven’t encountered any really strange kills. FR shooting is actually very lax. You don’t need to aim very precise to get a hit mark.

Its backstory is famously that it was developed by some guy as a side project, and that guy's tacky username is in the title.

I mean, what did you expect?

Sort of. BattleRoyale was initially developed as a mod for ARMA 2 by Brendan Greene (PlayerUnknown). He then went on to make a mod for ARMA 3, then later worked with the H1Z1 devs for the H1Z1 battle royale mode named King of the Kill.

After that, he moved on to making his own game with a supporting development company fully centered around his ideas at his direction with creative freedom that was not limited with being a mod (ARMA 2/3) or under the restraints of a larger company (SOE/Daybreak with H1Z1).

Brendan is largely the progenitor of the video game battle royale mode, and his name carries a lot of weight in the genre itself.

The first 3 versions were. PUBG as we know it was developed by an professional studio.

PUBG combines great game design with terrible technical implementation and bad community management. I'm with you on hoping that epic will succeed in a long run.

Why does it need to be a zero sum game with one winner? Fortnite and PUBG are two very different games fundamentally and there is no reason they can't exist side-by-side.

  two very different games fundamentally
No they are not. Sure, they have different art styles, some different mechanics, but they are the two largest Battle Royale games out right now. The only games I would consider to be closer to PUBG than Fortnite are perhaps H1Z1 and the Arma mod (that PUBG devs made).

However, as for:

  Why does it need to be a zero sum game with one winner
It doesn't, but I am so thoroughly frustrated with the PUGB developers that I hope they suffer financially enough to do something about their poor development practices.

Yes, they are different. Fortnite is arcade-y and the end game changes significantly due to the building mechanic. PUBG is much more methodical and strategic. I play both for different reasons and clearly the market can support two wildly successful battle royale games with 3mil+ CCU. So again, why do need to have a “winner”?

Your desire to see them suffer because god forbid the servers go down for five hours is an example of the shitty toxicity game developers have to deal with from fans nowadays. Blue hole isn’t Epic, they don’t have the resources or the comparable talent, but they made a really fun game. Isn’t that enough for $29.99?

Fortnite started out as something very different and they made the Battle Royal part of it free which is why it has so much traction.

Dev, singular. PlayerUnknown/Brendan Greene did the ARMA 2 and 3 mods by himself, and was brought in by Daybreak/SOE to direct the creation of the H1Z1 BR mode.

>A zero sum game with one winner

A "battle royale" if you will.

I want the gunplay of PUBG with the support of Epic.

A little counter to that narrative, Epic has really been letting me down lately.

When Epic started the UE4 project, they promised to take care of us gnu/linux users. I bought in the second I could. Immediately they started ignoring us. The first year and a half or so most of us Linux people were using a community fork because Epic refused to merge changes. We banded together and we're forced out of their irc and had to form another channel. I tried to be understanding, at first thinking they were low on resources. As time went on and games like Paragon (which I payed $20 for at release and which has now been abandoned, and I'm still waiting for my refund for over two weeks), and Epic started showing off how well things were going, they still basically abandoned us. There is still no marketplace/launcher on Linux, so to retrieve my $500 worth of assets I have to use a windows computer. Major bugs persist in all branches, and not just for the native editor... games like pubg would love to ship for Linux if the crosscompile tool gain wasn't an exercise in cryptic puzzle solving (which is why so many UE4 games are windows only not by choice. All pleas for more resources and love on the forums are met with comments about marketshare and how unworthy Linux is of their time and resources.

They promised us all this love and then after many of us spent lots of money on them they just ignored us. I'm thankful for the attempt for a native Linux editor, but crashes and uncompilable projects have essentially halted dev for me to the point I'm having to consider godot and blender even though they aren't nearly at feature parity, and due to epics licensing all that money on assets is wasted since those assets can only be used in UE4...

I love UE4 when it works. Blueprints, the animation and rigging system, the camera system are all wonderful to work with. I want to use it... but I'm feeling increasingly taken advantage of by Epic.

Tim, if you're reading this, how about a post mortem of the abandonware that is the Linux editor?

> We run Fortnite’s dedicated game servers primarily on thousands of c4.8xlarge AWS instances, which scale up and down with our daily peak of players.

That's between $572,000 (500 instances, 30 days) - $2,863,800 (2500 instances, 30 days), per month at current prices, and seems like it's only for one aspect of their infrastructure.

That seems .... excessive? Is that a typical spend with a game server system like this? That does seem to suggest that once this becomes less than profitable, it's all going away ...

c4.8xlarges are the smallest c4 instance that guarantees 10G networking performance.

Those costs can easily be cut in half with reservations. Its likely there's a lower-bound of the number of reserved instances they use as a baseline performance guarantee, then they use on-demand to scale above that.

No one at their scale actually pays the list price.

Basic math: There are ~100 players in each game. At a 30 tickrate, that's 3000 RPS per game minimum. Each of those requests likely involves a number of 3D math calculations, including hit detection, collision detection, real-time cheat detection, etc. Those updates then need to batch back to the players at 30 tick. All of this needs to happen with as little eventual consistency as possible; a difference of milliseconds degrades the player's confidence that the server is correctly calculating what is happening in-game.

Point being; multiplayer game programming is an entirely different beast than normal web programming. The same rules don't apply. Its an n^2 problem where every additional user in one "lobby" actually increases resource utilization exponentially because you need to update every other user's game-state with the actions of that new player. Additionally, Battle Royale style games are the most demanding multiplayer games ever created. Games like WoW have way more players per realm, but servers only need to worry about the interactions of a select few in your surrounding area, and it doesn't have as stringent real-time requirements. Games like CoD only load in 5-20 players.

> Additionally, Battle Royale style games are the most demanding multiplayer games ever created.

I'd like to draw your attention to Planetside 2. A pretty recent FPS that was run a lot like WoW, with 3 'continents' and complete seamlessness on those continents. Firefights where 100 people were actively engaging each other were an almost constant experience back when I played it. I think there was a cap of 1000 players per continent. Those players could gather wherever they wanted. I'm pretty sure I was in some 300 player battles, and nothing but lag would prevent all 1000 people on a continent from going to the same base.

It has to be said that in the big battles (at around a 100) lag became an issue. I'm not sure whether this was clientside, serverside or both.

(There is also eve-online where recently there was a 'fight' with 5000 players in the same location, but that is a different beast from FPSes, and the main logic of that game is single-threaded)

> It has to be said that in the big battles (at around a 100) lag became an issue.

I remember the crown frequently having 96+ per faction without any lag problems. Must have been clientside.

> (There is also eve-online where recently there was a 'fight' with 5000 players in the same location, but that is a different beast from FPSes, and the main logic of that game is single-threaded)

It was 6000, and it was terrible to be in. It took me 10 minutes to enter a station (normally takes a second) and I had to relog because I couldn't get out of the station. Relogging took another hour of black screens. It was pretty neat when I finally got out, but there wasn't much of a fight compared to previous encounters.

Planetside 2 was horrendously single-threaded, so anything but recent Intel chips would leave you with really bad performance in large battles. I would often max out my AMD CPU (well, one core of it) at around 12 FPS without my graphics card breaking a sweat in 100-200 player battles.

PS2 was for the most part an amazing game. The graphics were truly beautiful and the gameplay was very fun. Unfortunately, their shitty performance basically killed the game by making it unplayable for many users during the trademark huge battles.

I recall some fights in biolabs that were really laggy, same with some fights at crossroads watchtower. The issue was really when a lot of people had mutual line of sight and we got grenade spam (especially revive grenade spam).

I might be off on the numbers though.

FYI most game network servers tick the network at 10-12Hz(although the internal sim may run higher).

We generally didn't go higher than 10Hz since you'd start saturating lower end connections and any well written game will be good at reconciling state via dead-reckoning. You have to handle 100ms+ spikes anyway so it doesn't make sense to run the network super-fast.

Most competitive games go well above 10Hz ticks, OW does it, CSGO does it (I don't really know the tick-rates of other games but when it matters, so no comment there, but there is an important area for it).

OW just has a restricted rate option for low bandwidth connections (added when they went from 23 -> 61 Hz).

So that might be the case for LAN games, generally internet tends to stay lower. Valve for instance recommends 13Hz[1] for Source(CSGO):

> // 60 for updaterate is LAN ONLY use 13 for internet

> // 20 is default but will cut the maxplayers you can handle in 1/2

> // for SRCDS Servers use 30 - you might be able to use 20

> // sv_maxupdaterate 60

> sv_maxupdaterate 13

Like I said, depending on how good your potential-vis code is you may be able to go higher but you risk saturating your client links and most game netcode doesn't handle congestion nicely. My info is a little out of date(2-3 years) but back then we had to keep data rates below 20kb/sec(and ideally 10kb/sec) if you wanted low jitter across the link.

[1] https://support.steampowered.com/kb_article.php?ref=5386-HMJ...

This information is very much outdated. (Windows XP is referenced in it)

CSGO matchmaking servers (run at 64 ticks) force you to use cl_updaterate 64.

Even in CS 1.6 days, most leagues and servers enforced updaterate 100 and rate 20000. Nearly every player could handle it. And those who couldn't, well, bad luck for them. Playing with 100ms+ is no fun for anybody.

Like I said in the root, talking generally 10-12Hz is really common because you need to handle dropped packets cleanly.

I don't know about the CS 1.6 days, I used to play a ton of early cs and was an admin on one of the most popular Frontline Force servers. We ran tick rate of 12 if I recall correctly.

It is mainly because PUBGs network code sucks. You can see the effects here [0]

Unless you had dial-up users, a tickrate of 12 is very bad and not fun. I don't know if you had hardware restrictions, but tickrate is independent of updaterate that is sent to the clients. By enforcing a low tickrate you maintain a very old snapshot of the world on your server (83 ms), which makes clients interpolate more.

[0] https://www.youtube.com/watch?v=u0dWDFDUF8s

CS(:GO) would be completely unplayable at any update rate below 50. Especially with the necessary interpolation, you would end up with delays of over 100-200ms+ at an update rate around 10. Most weapons in the game could even fire more than one shot in that time.

Strong disagree for the recommended tick rate for csgo. I played a decent amount of competitive cs, from 1.3 all the way to csgo. Lots of competitive server hosts will go up to 120 (ESEA comes to mind) , and the difference between 120 and 60 is certainly noticeable. Going down to 13 would be almost unplayable.

>So that might be the case for LAN games

Overwatch has a 63Hz tick rate over the internet.

And they handle congestion cleanly[1] which is one of the things I predicated you need if you didn't do 10-13Hz(along with smarter vis sets).

Every game is going to be unique and gameplay structure has a huge impact on your data rate. We had games back in '97 like Subspace[1] that handled hundreds of players in the days of dial-up and 200ms+ pings. Generally if your data rate per-link goes above 10kb/sec you'll start to see degradation of jitter across your userbase. There's smart ways to handle that but if you want to do a fixed-tick you run up against that limit really quickly.


Other games will do smart things to keep skill high and tick rates low. CS(but not CS:Go that I know of) used to famously rewind time based on your latency so your view of the world was accurate, but lead to fun behaviors like rubberbanding as you got shot around a corner(due to movement speed slowing down at that past point in time and the game having to reconcile it with the current timestamp).

[1] http://www.eurogamer.net/articles/2016-08-12-looks-like-bliz...

[2] https://en.wikipedia.org/wiki/SubSpace_(video_game)

Man Subspace/Continuum was such a great game. I do recall issues with packet loss, but in general the game functioned really well even on my 33.6kbps modem with hundreds of players and pretty frenetic gameplay. It's not until you mentioned it that I realize how amazing that feat is.

Oh yeah, I love to hold up Subspace as a perfect example of why games is so cross-discipline driven. For networked games building design about predicting where someone will be means that frame-to-frame latency almost doesn't matter. The Newtonian physics of that game plays into it perfectly leading to a something that works over primitive modems. Even FPS games to a large degree are about predicting/leading the positions of the other players.

Compare that to say, fighting games which are very reactionary and very hard to run over any latent connection without some seriously complicated netcode involving time rewinding.

CS:GO (and all other source games) do the same lag compensation as CS did (though it's horrifically buggy, but it was pretty messed up in 1.6 as well). People seriously underestimate how impossible it is to play fast-paced games without it - you wouldn't be able to hit a HS on a moving player in CS:GO even with ~30ms ping without leading your shot if that wasn't there.

It's also not actually why you rubberband - that's just caused by your own latency (moving while already dead). It does mean you can get hit after moving behind a wall though.

Only PUBG and Fortnite have such shitty tickrates.

What do you mean by "game network servers tick the network"? Like data going server -> client ?

Since gameservers are mostly single threaded they probably run 35~~ of them on a single instance ( c4.8xlarges have 36 vcores ), so if you have a good matchmaking algorithm you can fit 3500~~ players per instance, to reach 3.5M you need between 1000 and 2000 instances.

As for your last comment it's wrong, a game like Battlefield is probably more demanding compare to a battle royale type of game, a FPS like Battlefield runs at 60hz and is using a lot of physics everywhere at fast pace, battle royal games don't have that much action going on.

Likely more than 35 games on a single instance, as ideally you wouldn't be starting all games at the same time on the same instance. Staggering the start of games on each instance would be ideal.

For example you could have dozens of late stage games that combined have less players than a single, just started, 100 person game.

We run 16 game servers per c4.8xl instance for Battle Royale. We run more than that (on a different instance type) for StW.

You usually don't run two or more game server on single a thread / core, it kills performance, and with VMs with many physical processor you have other issues like NUMA so you have to be very careful.

I'm pretty sure they have a ratio of one gameserver per thread / core.

We don’t tick at 30Hz. Somewhat slower than that. Working on making that better, though.

And we do indeed take advantage of reserved instances. We’re also looking at other options along these lines.

> and it doesn't have as stringent real-time requirements

Blizzard is absolutely stringent about their real time requirements for WoW. It's crucial for PvP, Raids, etc.

Fortnite doesn't need to broadcast nitty player info for players on the other side of the map to you, either, I would think. That should definitely be optimized away

Battle Royale games generally need to broadcast player state to all players within a large distance (iirc for PUBG, it's one km) because it's common to spot another player 500+ meters out and take some shots at them. If they're not moving much you might hit them.

Even if all 100 players are nearby, 100 isn't unusual in the context of MMOs. That's not to say it's trivial to do, it's just not a new feat unique to battle royale games.

> Its an n^2 problem

> actually increases resource utilization exponentially

I think you'll have to choose there.

> No one at their scale actually pays the list price.

I've always wondered what AWS rates do large users receive from negotiations?

MAG somehow managed to be a competitive FPS with 256 players back in 2008ish, its not an impossible battle.

I think MAG was using clever tricks. There were multiple “zones” per battle, and I think each of those zones was a separate instance with some coordination between.

to be fair you could apply the same tricks to any Battle Royale type game

> actually increases resource utilization exponentially

Quadratically actually.

Exponential would mean each user doubles the amount of calculations, for instance.

I run game servers. Our intra-day swing gives us about 10x traffic during primetime hours compared to off hours. Autoscaling sheds a lot of overhead.

Hi. Guy who does the autoscaling code here at Epic. It's really fun to watch the graphs of how many server sessions we have in a region at any given time over the course of a day.

I like that you're in here commenting. After reading through the postmortem, it reminds me of scaling issues we had at previous job. We had hundreds of thousands of clients that would get "hyper active" if they had issues connecting, retry loops FTW.

System goes down and it was hard resurrecting it since the traffic just kept pounding away. No autoscaling, no cloud. It woulda been handy to just fire up some more servers, let alone have things auto-scale via CPU %.

Have you run the calculations to determine where that cutoff is? If you have a daily primetime of 10x traffic that's pretty consistently the same amount, then it's still cheaper (if we're talking the scale of hundreds of instances) to rack up the servers in a datacenter (COLO or your own) with the capacity for the standard 10x load.

Auto-scaling instances on EC2 only makes sense if you don't have the in-house skills to troubleshoot/manage hardware servers or if the traffic is unpredictable and the bursts are infrequent (e.g. once or twice a week or just busy for a short time period of the year).

> in-house skills to troubleshoot/manage hardware servers

Which costs time, money and opportunity cost. Something that all the "rack and stack" people seem to forget when they compare owning bare-metal servers against managed cloud providers.

The amount of time these "bare-metal" shops waste building and maintaining infrastructure management tools, scaling tools, reporting and monitoring tools is mind boggling. None of this stuff has anything to do with the core business the company actually is in. Opportunity costs here are immense....

You don't use physical servers in a DC to pay the 10x peak time, it's a big waste of money, if you want to go this way you pay the minimum numbers of servers at any time and you use cloud for the rest. But again I think noways it doesn't make any sense to buy servers in DC.

You pay for hardware that you need to mange and have to be sure that you will be using it after the game launch. ( most games player base decrease by 2x just after release. )

> You don't use physical servers in a DC to pay the 10x peak time, it's a big waste of money

No, it's cheaper once you factor in the real performance of physical vs virtual machines.

Someone really needs to bring some actual numbers to this conversation.

Numbers were done it's cheaper to run in the cloud nowdays, also VM ( EC2 on AWS ) have the same perf that baremetal. Have a read: https://aws.amazon.com/ec2/instance-types/c5/

> VM ( EC2 on AWS ) have the same perf that baremetal

Extraordinary claims require extraordinary evidence.


Brendan Gregg has written here and elsewhere about the performance of VMs on EC2. Worth reading if you're interested in figuring out how close it is (or isn't).

AWS makes $20 billion a year in revenue.

Maybe it's possible that you're wrong?

That is a complete non sequitur. AWS has traditionally never been the cheapest way to rent machine power, but people don't use it because it's cheap, they use it because it's easy to get started with a wide range of facilities without dedicated sysadmins.

Perhaps they're doing some creative counting and thousands refers to all the instances spun up over a day rather than the actual concurrent boxes.

Nope. No creative counting. Those are our real numbers.

I want bandwidth monthly pricing (maybe it adds up)! And what's minimum networking required on each server ?

Shit. Switching to c5 would save them over 100k/month at a minimum and net better performance to boot. I’d get on that.

We’d love to, but have been told that our providers don’t yet have enough capacity for us to flip our entire fleet over to c5’s just yet.

A lot of players of massive online games tend to get hand-wavey when there are problems and act like "dude just get more servers" is the answer.

This clearly shows how complex a system is needed that has to handle 3.4 million concurrent, connected users. I think the connected part compounds any scale problems you have since it is implied they are connected to each other.

> act like "dude just get more servers" is the answer.

The biggest problem here is that it largely used to be a distributed system. You used to just be able run your own dedicated server on whatever provider you liked. The dev would just run a single server list. Now many game developers have decided they're the only ones who get to run servers - primarily because they can charge more for micro transactions and private servers this way as far as I can tell.

It really hurts games - PUBG is the best example I've seen - constant lag issues, complete lack of server side checks for things like shooting through the ground (because hey, that costs CPU and every additional cloud server they need means less profit), etc. It's basically made the game unplayable.

Game developers are unfortunately stuck between immersion in their games and the rage that leaves players with when technical issues occur. The more immersed your players are, the more rage they'll experience when your game crashes or lags at the wrong time.

>primarily because they can charge more for micro transactions and private servers this way as far as I can tell.

It also hurts the game's image if many servers spam players with "Donate $20/mo for the MASTER rank and get exclusive weapons that deal twice the damage of normal weapons!"

I haven't played in a while but I remember playing Minecraft when I was still in school and servers were filled with this- many would let you donate hundreds for admin/OP/whatever, many would restrict core parts of the game to "donating" members (such as mining certain ores, going into certain areas, etc). Eventually it got so bad that Mojang actually had to step in and (vaguely) threaten to sue any server owner who does this - http://andrewtian.com/mojang-threatens-lawyers-against-pay-t...

Also, admins that can ban players for any reason, who are picked by the server owners (and not the devs), can result in backlash if someone influential is banned.

>It also hurts the game's image if many servers spam players with "Donate $20/mo for the INVINCIBILITY rank!"

That's a problem with the client design. Don't allow the server to send banners or any map tiles and that problem goes away.

That isn't really a realistic solution if you also want to support the mod ecosystem (which IMO is much more valuable to a game's community than the potential downsides)

Ok, but the mod ecosystem is a separate discussion. And even if you do want the mod ecosystem, you can distribute mods as client add-ons so the mods can be evaluated independently of the host server.

Which is exactly what Gmod did, and they have this problem.

Just make mods opt-in. Vanilla servers by default.

Dedicated game servers are only part of the problem. The matchmaking/community-connecting servers were also implicated in the outages discussed in TFA.

Community/matchmaking servers don't distribute ("run your own server") too well. If you only ever want to play with and against people you know, that's fine, but a lot of people don't want that: some want to play on randomized (or ranked, if the game supports it, across a large number of players; not just the ones in the potentially-altered rankings on a single person's server) teams. Others (more commonly) want to play against new and different teams.

Unless your graph of social connections willing to agree to use a given server is big enough and the right shape, hosting your own matchmaking/community services just results in isolated islands without a vibrant overarching community.

Put another way: the separate-server approach of e.g. TF2 works fine for some, but a lot (judging by play counts) of other folks prefer games with a centralized community.

To be clear, I think it's fine to have the option of a separate community server. But if a game developer already has the centralized infrastructure, is making tons of money off of it, and spending tremendous resources keeping it intact (as Epic apparently is), it's a pretty tough sell to say "oh and you should also spend the resources to make a hosted option for this".

"Take the complex thing designed for centralized, single-instance internal use with a team of support staff, put it in a box, and let users run it on their own hosts" is far from a simple proposition.

I'm unsure if any companies doing centralized community servers/community federation also have a large portion of their actual game servers not hosted by the company; I'd love some examples of that phenomenon. The examples of Minecraft and TF2 come to mind as doing some subparts of that, but those games only have limited matchmaking capabilities.

I think that part of this trend may be tied to games, such as PUBG, becoming popular while still "under development". I imagine once a game has a stable server build it is much more practical to distribute that server to all of the community operators and have them update on a fixed / infrequent schedule. Whereas, while changes are still being constantly made to the server application, this would probably result in users only being able to connect to a small fraction of available community servers while waiting for the other servers to update.

>primarily because they can charge more for micro transactions and private servers this way as far as I can tell.

I believe its more about standardizing the user experience. Now that enough games have dedicated servers with a standard expectation of latency your game looks poor if your user experience is beholden to community servers with selfish moderation and varying stability.

The added benefit of centralized server hosting and control is exactly what you point out though.

No: standardisation is easily enforced by just having two server lists, one for normal/central hosting and one for custom games where any host can register. At most you might want additionally give the user the ability to blacklist servers for themselves, and that's it.

The reason for banning community servers is probably mostly for combating piracy; by eliminating the ability to run on a LAN the game must always call home to be played.

Agreed. Gamers also are among the most entitled users I've ever had to deal with.

I'll never accept donations again for a game I've made available for free from the amount of problems someone has caused me after they gave me $5.

I never realized the level of feelings of entitlement and melodrama from gamers until I ran a passion project to make a private server for a dead game. We don't even accept donations and people still treat the dev team like dirt, and will often write nasty manifestos about how we are ruining their lives or we don't know what we are doing and if only we would listen to their every whim, etc...

It's one of those things, though, that reveals the human inclination to remember negative events more even among a sea of positive events... Frequently as a game developer you will get loads of positive and helpful feedback but the small minority that want to shit on you for no good reason stick in the front of your mind.

I think it comes from a combination of a good set of the potential populace being children, the content being mostly emotional and not needing to be professionally employed to use the game. Those 3 things alone are great filters in getting good users and you are lacking that in games.

Out of curiousity, what's the project?

London 2038 https://london2038.com/ It's a revival of the MMOARPG Hellgate: London which released in 2007, closed in 2009, was revived as a pay-to-win Korean reincarnation in 2011 and shut down again in 2015. Our version matches the version of the game from 2009.

Hey, I've heard of this! Good work. :D

One thing I notice from your site is that you provide _many_ ways of giving feedback. The Discord widget, the 'Contact Us' menu item, the FAQ is peppered with communication links, there's a Forum, Wiki, et al. Basically, I'm seeing an incredibly low barrier to entry for anyone wanting to express themselves to the developers.

Have you considered paring that back? IE, ensuring all communication channels are moderated, and limiting it to only one or two services?

Great feedback, I have actually done a lot to pare it down since we launched open alpha last Halloween. Typically people know now to make forum posts for long-term discussions and to come chat in Discord for quick back and forth. Nobody has actually used the contact form on the site, now that you mention it.

My team of community manager volunteers is pretty active with moderation in-game and out-of-game so it's gotten pretty good. But still, I think I need to isolate myself and the other developers from the public, because it can be really demoralizing when you spend 30-40 hours of free time per week on a side project for free and you hear some terrible things.

Sounds like you've got a sound plan, and I whole-heartily endorse placing a barrier and filter between you and your fellow developers and the community at large. :)

Gamers aren't simply entitled; they can be aggressive, threatening and altogether horrible. I've personally received threats on my own life from users. There was a nice medium piece on it recently. [0]

Anecdotally, I'm seeing more and more peers leave the industry as a result of being disgusted and worn down by the constant, unending abuse that game developers receive from users.

0: https://medium.com/@morganjaffit/the-cost-of-doing-business-...

I think modern game devs create entitlement through games as service models that rely on unhealthy pve or player versus player content either real or implicit, and quickly force them into factions to compete for limited dev time.

The games as service part prevents players from keeping proper emotional distance to games by not allowing them to have closure; to look at the game, say you've beaten it, and move on. Every game needs some kind of grind or persistence beyond the actual play, through online connectivity and often actual player competition. So you kind of prime players through negative play models very often, especially when in-game purchases exist.

What's worse instead of an end to a game, devs are forced into chasing the mythical idea of balance instead. This actually sets players up into warring factions because balance is often zero sum; my warrior gets buffed at the expense of your paladin. There's no existing balance state, ever; you can look at Overwatch to see it just ending up into random changes to heroes that please and incense players in equal measure.

These things combine to make players strung out over time; their nerves are jangled through constant grinding, death (screw roguelikes!), or competition, and then you end up with developer as God who made decide to alter everything to please a particular player faction, or even no one at all but themselves. You make your players invest a hundred hours in your game, and they aren't going to be disinterested enough to take changes in perspective; you don't give them the chance to be, they have to be logged on every day doing busywork either through ranking or grinding.

I think this explains why modern gamers can be so enraged. It's more than personality; a lot of the structure of many modern games can really create this. The games make players too invested in them too easily, over too long a time.

Did you quickly offer/send a refund? This is going to be my strategy for upset customers, and I’m curious about people’s experiences with this approach.

No, but it's not like they are privately emailing me with their complaints. They drum up threads in the subreddit and rally others to complain as if they are being personally wronged.

I suppose I could offer a refund, but refunding a $5 donation is just another example of how a donation box is a waste of time.

In the future I would at least consider a Patreon system where donations renew every month since server and development costs certainly do. Way too many people think they're buying an annuity against your free time with a one time donation/purchase.

Couldn't agree more, I shut down my profit generating gaming project because of death threats, not worth.

Love this because it shows two things 1) competent people are handling problems and 2) they actually care.

A whole lot better than spoon feeding customers bullshit for weeks while hamstringing your product rather than investing in it (looks at EA, mumbles about SimCity).

Lets not labor under the impression that Epic invested resources in fixing these issues purely out of the goodness of their hearts.

The success or failure of the recent Sim City game was hardly EA's number one concern. Fortnite, however, is probably extremely important to the folks at Epic.

I'm not defending EA here, but it was definitely in Epic's best interests not to piss off Fortnite's player base given that there are some other, ahem, similar titles on the market right now.

Why do you think a game that sells for ~$50 was not important to EA, but a game that sells for $0 is extremely important to Epic?

Epic is a pretty big operation too.

Unlike EA, Epic does not have a diverse portfolio of games on different devices. EA can absorb a flop while Epic is clearly pushing people to install their PC launcher through their free Battle Royale to get potential customers to pay for their paid content (Fortnite, in-game purchases, other games, etc.)

For SimCity, most of the sales were already done, so EA could have chosen to do the bare minimum because they would only see small returns for those fix. Whereas for Epic, the business model is different since it's F2P, so they'll need the player to keep coming back so they keep buying stuff.

Because Free For All games are really popular right now, and they want their engine (Unreal) to be a competitive choice for these and other massively multiplayer games.

Unreal is already an extremely competitive choice for multiplayer games (read: the best choice). They are putting resources to solving these problems because they are making millions of dollars from "free" players.

SimCity was the most successful game of all time on the Origin platform. Anyone who claims it was a failure is ignoring the things that actually matter, the money, in deference to meaningless (at least in terms of business) issues like the experience of gamers. Nothing is better news to a game publisher than mountains of gamers who pre-ordered were discouraged by launch day troubles (caused by not spending more for more resources, the game was on Amazon and providing more resources was a few button clicks away but would cost money) and who aren't coming back. "Look at the playerbase dwindling!" sounds exactly like "look at our server costs dropping!" for any game that isn't based on subscriptions or microtransactions.

1. competent people are handling the problems

2. they actually care

3. they enjoy what they are doing

One thing I really miss working solo is working with a team of smart folks to solve a complex problem together. It's so damned fun.

These three things (plus your addendum of working together to solve stupidly hard problems) are basically why I love going to work every day.

(I work at Epic on Team Online, aka the team that this whole PM is about.)

Glad to hear you're loving it. And thanks for detailed yet fun write-ups like this. Keep on keeping on!

Epic game is not a publicly traded company and there is a lot of things you can say when you're private.

That is an interesting state of affairs isn't it? Why do we, the world, expect that large, supposedly more professional companies, communicate less honestly?

> Why do we, the world, expect that large, supposedly more professional companies, communicate less honestly?

Because being honest can hurt the share price.

But if this shows anything, it shows that the company has people who are on stand-by and who constantly monitor and optimize their product. Also, the fact that "the product" they are developing is not a consumable (think processors) there would be no recall cost for example, how can that hurt the shares ?

Particularly, it can hurt the share price on an horizon over which the C-level execs have their performance measured.

I would have thought it was the opposite way around! Private companies always seem so tight lipped especially being that they dont have to answer to shareholders.

You're not going to say to shareholders "please give me your money and also we're looking for expert because we lack expertise in the product we're doing"

Disclaimer: I work on Google Cloud so I will be speaking from the bias of knowing those products.

They talk a lot about reducing operating complexity and scaling their infrastructure, I wonder what the cost of their current infrastructure + the staff to maintain it might be vs the managed solutions that cloud providers offer now.

For example, using cloud datastore or spanner or big table as a persistent layer, these managed services can definitely scale to the current need and I've seen them go much higher as well.

For logs ingestion and analysis, big query can be a very powerful tool as well, and with streaming inserts that data can be queried in near real time. For things that are less urgent, batch queries. For other things dataflow can help with streaming workloads.

I think one of the problems they alluded to though was that at the moment they're on a single provider, and what they're looking for is a multi cloud strategy which totally makes sense. A lot of the above products create some kind of locking, with some exceptions, like using hbase as an interface to big table or beam as an interface to dataflow. Though I don't know what the other providers offer that may have these same interfaces.

Another option is kubernetes, which I believe all providers are pretty strongly embracing. Having most of the supporting infrastructure be brought up with a few kubectl commands could help them scale across several cloud providers quickly.

I think they detailed in the article that the problem isn't their game servers which are AWS cloud based and can scale up, it is their login/setup/matchmaking (my term) server infrastructure that is the first thing users first encounter that is having issues.

Usually there is a cost/scale threshold with managed providers where it is cheaper do DIY than to pay thousands upon thousands per month for say log ingestion.

All of the managed products from the different cloud providers are, more or less, great. The problem is they are black boxes. When something goes wrong you're completely at the whim of support. Ever call Dell/Comcast support and want to tear your hair out? Yea... it's like that except neither AWS nor Google have phone numbers to call.

The other problem is that most of these things aren't easy to migrate to. AWS RDS is much easier because its just managed whatever you're already using. But cloudspanner? DynamoDB? You have to completely re-architect your application. Then you have to move your application, and data, to this new system...without massive outages. It's a lot of work and a lot of cost. So until things go HORRIBLY sideways, most companies don't have the spare time/money.

Been there, tried that.

If your service is mission-critical, you’ll have a support contract, and there absolutely is a number you can call. I have high confidence that Google and Amazon will make every effort to make sure their services perform to spec, the real problem is the feeling of helplessness when something does go wrong.

Of course you have a support contract, but how fast you can get a response still doesn't change. AWS Business Support - which is what is very common, a "Production Down" has a 1hr SLA for response. That's a long ass time for a database to be down without doing anything more than sitting on your hands.

Epic would have enterprise with the money they are making. That's business critical <15min response time for production down. https://aws.amazon.com/premiumsupport/compare-plans/

> Yea... it's like that except neither AWS nor Google have phone numbers to call.

At my current company we have access to AWS support. I'm not high enough on the ladder to know the specifics (separate contract, size threshold?), but when we've had issues in Aurora we've had personalized support. I have no doubt if you're scaling to thousands of instances you would have personalized support, if for no other reason that a cloud provider would not want to lose such a huge customer.

Yes, certainly. Large companies have good relationships with the cloud providers and have "necks to strangle" when things go south. But unfortunately that's not everyone on these providers. You've gotta grow up from a lil ol startup (when you hopefully start using all these cool technologies suggested) to a big company before you get the attention and during that time... you can get majorly screwed.

Agreed, Cloud Spanner is an impressive piece of technology, but question: If you build your business on Spanner and it does start having problems for whatever reason, what do you do? Obviously at this size you’d get great support from Google, but ultimately you rely on one managed provider and your hands are tied. That’s a tough situation to be in when you’re servicing 3.4M users.

> .. currently unclear to us and support why our writes are being queued ..

You think GCP offers better support on spanner et al when customer is having performance problems? In this case probably yes, because an Epic sized monthly spend is highly effective at escalating through support.

It takes low effort to find <cloud persistence horror story> around here so we know cloud is not a special magic that is immune from integration performance problems. But the economic incentives are meaningfully different and especially so at runtime.

Really surprised that they use XMPP. Since you don't really hear anything about XMPP anymore, I think most people assumed that it's dropped off in usage/popularity (or people have moved to some other proprietary solution).

I've always thought that XMPP would be useful for games, just surprised to hear that people are actually doing it.

a ton of services use XMPP underneath the hood like Signal, WhatsApp, Jabber etc.

it looks like a great business model to take something open source and hard to use and make a proprietary version thats easy to install and use

https://en.wikipedia.org/wiki/Signal_Protocol https://en.wikipedia.org/wiki/WhatsApp https://www.cisco.com/c/en/us/about/corporate-strategy-offic...

Signal isn't xmpp based?

Jabber is xmpp.

Not sure about WhatsApp.

> Not sure about WhatsApp.

whatsapp at least was in the "early" days. they had some extensions to it but you could connect to it pretty easily (once you had a valid account generated with a real mobile number) not sure if it is still the case, things changed a lot I guess, especially with encryption.

Riot Games is using XMPP to provide in-game chat for League of Legends.

Same for Blizzard with Battle.net chat.

I was playing fortnite on 2-04-18 22:00 UTC during the "Friends Service" outage.

You couldn't see friends lists at all during that time period. So you couldn't queue up in a friends / people you knew at all in a match, the only options were either playing solo or using a "filled" team with random players.

I've been playing fortnite as one of the early 60k concurrent users all the way to the 3.4M, so its been interesting seeing their load / server issues over time and then reading this (Granted, I don't understand everything discussed in their blog). They've done a outstanding job handling their growing traffic.

One thing I've noticed with Fortnite, compared to PUBG or other MMOs, is how large their patch updates are. Its usually several GB large, and it comes fairly frequently about once a week.

Forgot to note that fortnite had an ongoing "friend list" problem before

Most notably, whenever you wanted to add someone to your party. You had to do the following.

1. You Send user friend invite

2. User would have to accept invite

3. User would have to disconnect to refresh your friends list (due to friend-service issue mentioned in blog)

4. User would relog back on (took approximately 3 mins)

5. You could then see them on friends list

6. Send party invite

That has to do more with how they package the game than how much they changed. Often times large parts of the game are bundled together so a single small change in one bundle causes you to re-download the whole thing

As an addicted Fortnite player this is a neat read. However as an application layer dev, the architecture specifics were slightly over my head. My biggest concern is shipping a working docker image, all of the architecture is mostly abstracted at our company. This gave me some inspiration to dive deeper into our architecture.

Loved the tone of the article. They know they have some problems to work on, they're being transparent about them and they're explicitly saying that they need help with it.

Look forward to seeing how they fix that MongoDB collection write stalling problem. Vaguely recall that was still a big problem the last time I was looking at MongoDB years ago.

That was an incredible fun read. Makes me curious of the other failures in this industry if they could be explained in this detail.

The EVE developers used to post a lot of details about their setup, upgrades, software and about failures / congestion issues...

EVE's time dilation system is ingenious: https://www.eveonline.com/article/introducing-time-dilation-...

Its an absolutely awful experience for a user. Especially when some things are being affected by tidi while others are not.

Worse UX than failing completely any time a solar system gets more than X players in it, or having a player cap per solar system?

Those would be "absolutely horrible" to me. Time dilation is a compromise, not just arbitrary lore that became a game mechanic.

Sure, its the least bad system, but its still a bad system.

Is there any game that can do 6k+ simultaneous players with a similar level of physics/combat system that EVE has at a 1s tick rate?

I'm fairly sure there isn't.

Eve at 6k+ simultaneous players has a 10s tick rate due to TiDi. At 6k+ simultaneous players Eve isn't actually playable. I don't think there's an other game with a 1s tick rate, AFAIK Eve has the worst tick rate in the business.

Eve doesn't really have much in the way of physics, which would be genuinely impressive.

By physics, I mean the ship/object sphere bouncy collisions and the velocities that are calculated from a collision. Ship agility is also technically physics. You still have to calculate those every tick

Riot Games has an engineering blog that goes into some good details: https://engineering.riotgames.com/

Disclaimer: Riot Games Engineer :-)

It's great to actually see a game developer that seems to have some grip on what they're doing. At least writing about issues like clocking, animation quantization, etc is far better than what everyone else does - screwing it up and never fixing it. It's not all perfect, but still(I don't play LoL though).

Video game industry doesn't reveal that much about their tech. You only see a glimpse of what they use during public talk like GDC ect...

This is an interesting read, it's always interesting to hear why something that ought to be fairly heavily federated or sharded can nevertheless fall over centrally.

> "Along with a number of things mentioned, even small performance changes over N nodes collectively make large impacts for our services and player experience."

I think this is where Stacktical helps with proactively detecting performance regressions at the CI level, before they hit production: https://stacktical.com

Disclaimer: I am Stacktical's CTO

Nice read. And nice to see Java running at the backend.

I wonder whether Epic can solve its problems by rearchitecting more into a CQRS driven system with event sourcing: store events in a more write optimized DB (e.g. Cassandra) and then process the events for fast reads through whatever is required for the usecases. Maybe they touched the limits of MongoDB to handle both, reads and writes at their scale.

This is a great article, lots of detail, props to Epic team for generally killing it and specifically putting this together.

I never spent a dime on any of these free 2play games. I am in awe at how dedicated the team behind Fortnite seems to be when it comes to providing us data (real data?) of what's happening on their side, while I am sitting on my couch logging into one of the matches with my keyboard and mouse

Meanwhile, the game is pretty much unplayable [1] on Mac OS while it was heralded as the first game to support Metal (even featured in Apple's keynote).

[1] Getting ~16 FPS on medium settings, with the high-end late 2016 MBP 15".

As a counter point I could play quite well on my MBP 17 at medium settings. (45 FPS)

Note: I did tweak some minor settings like AA to get better performance.

Interesting. There shouldn't be any major difference in performance before the two MBPs. Did you get any AMD driver updates or something like that?

How do you get to this level of expertise? What are the resources people like these use to learn about this type of scalable systems? Any good books that start from the ground up on these topics?

Looks like they are still having scaling issues. I just tried creating a new Epic account and was shown an error.

Sure, the game is not done growing in popularity. That's one of the difficult things - they don't just have to scale to 3.4m users because next week the target is moved and now its 4m then 6m etc. It's hard to hit a moving target like that.

Our account service went down for a brief period this morning. Sorry about that.

I wonder why they want to do the step:

- Followed by removing Nginx + Memcached couple altogether out of equation.

I'm kind of surprised nobody in the comments is mentioning MongoDB "unknowingly" causing write delays. Albeit, it's after handling an amazing programming feat.

404 not found at the moment - does anyone has a snapshot?

edit: nevermind, it is available again - weird

any thoughts by game production managers on the success of the battle royale concept? anything we should take away for other products?

Launching this web page in Firefox on Windows causes my Oculus Rift software to start up. I am.. not happy about that. WTF?

I'd be more likely to share this with my team if they didn't have a recruiting pitch in there. I have a (probably irrational) fear of people abandoning the team for some new shiny thing. Just a thought. It's a great writeup otherwise!

This is why you use an async-to-async database:

Why: https://github.com/tinspin/rupy/wiki/Fuse

How: https://github.com/tinspin/rupy/wiki/Storage

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact