
Ask HN: What is the ops architecture like for AAA multiplayer game servers? - qyirius
I’d be really interested in reading about multiplayer servers for big title FPS games with matchmaking and lots of players
======
lwansbrough
Halo 4 and 5 use a project developed “in house” at Microsoft called Orleans.
Roughly speaking it is similar to Akka. It’s an actor based distributed system
which attempts to hide the implementation of distributed networking. In
essence, each match and each player get their own “network attached“ object (a
“grain” in Orleans terms.) So:

    
    
        var player = new Player(123);
    

is actually instantiated _somewhere_ on the network (not necessarily locally
on the calling server.) Then operating on that object, such as:

    
    
        await player.FireWeapon();
    

tells the server (“silo” in Orleans terms) which owns that player instance to
invoke that method on it.

In this way, it is very easy to quickly update game state without obsessing
over the throughput of a single machine.

~~~
ryanjshaw
Orleans provides higher level abstractions than Akka. It is literally a plug
'n play distributed application kit and consequently very opionated.

It took me quite some time to shift my mental model and decide I prefer it
over Akka. It's also worth noting that Orleans is open source. There is also
an excellent operational dashboard named OrleansDashboard.

~~~
dillonmckay
[https://github.com/dotnet/orleans](https://github.com/dotnet/orleans)

------
efokschaner
Not FPS specifically, but you might enjoy Riot Games' tech blog
[https://technology.riotgames.com/](https://technology.riotgames.com/) which
has articles on a variety of game technology things such as service
deployment, network infrastructure, game performance monitoring etc.

(Disclosure: I work there)

~~~
ptrott2017
Riot games tech blog is an awesome read. The fps (as in frames per second)
performance monitoring on league of legends post recently was a great example
of why its worth a read.

~~~
godelmachine
I thought by FPS, the OP meant First Person Shooter

~~~
TotempaaltJ
I think that's true and gp was clarifying that the post he was referencing is
about the other fps.

------
Thorrez
There's a slidedeck and presentation that's been posted to HN a number of
times about Call of Duty's servers using Erlang.

[https://www.erlang-
factory.com/upload/presentations/395/Erla...](https://www.erlang-
factory.com/upload/presentations/395/ErlangandFirst-PersonShooters.pdf)

[https://news.ycombinator.com/item?id=14120506](https://news.ycombinator.com/item?id=14120506)

[https://news.ycombinator.com/item?id=2671755](https://news.ycombinator.com/item?id=2671755)

[https://vimeo.com/26307654](https://vimeo.com/26307654)

------
dalailambda
I can highly recommend this talk by Respawn, Multiplay, and Google on how
Titanfall 2 does multiplayer server management. It's geared more at the
infrastructure side as opposed to actual dev but worth a watch:
[https://www.youtube.com/watch?v=p72GaGq-
B_0](https://www.youtube.com/watch?v=p72GaGq-B_0)

------
markmandel
If you want to learn more about this in person, in 2020 there will be the
first "Online Game Technology Summit" at the Game Developers Conference.

(Disclaimer: I'm one of the summit advisors)

We're trying to grow the education in this space of game development, because
currently it is very sparse. AFAIK, this is the only event that is dedicated
to the technical aspects of online, connected or multiplayer games.

Details, and CFP, which is currently open:
[https://www.gdconf.com/summits/c4p](https://www.gdconf.com/summits/c4p)

~~~
druerridge
So happy to see this happening! I've also been advocating for more community
development and knowledge sharing in this space. I believe there's a huge
unfulfilled need to connect the industry more on these topics. I've also been
organizing an event in the space, and looking at the possibility to land a
special interest group in larger organizations that can support our community.
I'll drop you a line!

------
druerridge
Spent half a dozen years working in AAA realm for multiplayer games and an
equal amount of time working on my own indie projects, and as you can see
already in this thread, there's really two (often confused) pieces to this
conversation. First, there's Multiplayer Gameplay Engineering. This is
typically a single process handling the 2~64 people shooting each other in a
single game. Second, there's Online Services Engineering for Games. This is
typically orchestrating the above process, hundreds or thousands of times, as
well as things like Matchmaking systems, party systems, Storefronts, etc.

Below are two articles which I think can be really valuable "Baby's first" for
the topics of Multiplayer Gameplay Engineering and Online Services for Games.
I wrote them beginning from my perspective of approaching these problems my
first time as a college student many years ago, and continued with how my
approach evolved through experience. I also include lots of links to seminal
articles/talks on the topics that I read as I attempted to "do my homework"
while making my own indie game.

Multiplayer Gameplay Engineering
[https://www.gamasutra.com/blogs/DruErridge/20181004/327885/B...](https://www.gamasutra.com/blogs/DruErridge/20181004/327885/Building_a_Multiplayer_RTS_in_Unreal_Engine.php)

Online Services for Games
[https://www.gamebreakingstudios.com/posts/dedicated-game-
ser...](https://www.gamebreakingstudios.com/posts/dedicated-game-server-
allocation-for-unreal/)

------
dijit
This question sounds like it's pointed directly at me.

However, I can only speak for one AAA gaming company, and my team operate a
bit differently than most in the company.

My team operates the infrastructure for "Tom Clancy's The Division" video game
series (1&2).

Most of the programming approach is spent on doing the cheapest (in terms of
CPU) possible thing, everything is C++

Things like: matchmaking will happen ideally on a single machine with no
shared state, everything will happen in memory, which is much faster and can
be more reliable than any distributed state or failover.

(it's less reliable if you're in matchmaking and the server or service dies;
But then everyone's client will reconnect to the newly scheduled matchmaking
instance and populate in memory state.)

We use a lot of windows, nearly every machine that doesn't handle state is a
windows server. This has pros and cons, from my ops perspective I try to treat
windows like cattle, but windows doesn't like that. they have their own way of
operating fleets of machines which include using SCCM and packaging things.
There's nice GUI's, but we use saltstack and we removed AD, because it was a
huge burden to: create a machine, link it to AD, reboot it, finally get a
machine worth using.

From a dev perspective, Windows is good, IO Completion Ports is superior to
the linux epoll in terms of interface and performance, so we can have machines
that take 200,000+ connections.

How you decide which dedicated server you connect to is up to your client, it
does a naive check as it's logging in where it will do a TLS handshake with a
random gameserver in a region, for each region. (during the login phase we
send your client a list of all currently active gameservers and an int to
represent the region).

This works fine until there's packet loss on a particular path because your
single ping might be fine but overall your experience could be better
elsewhere; if you're not able to ping anything then we fall back on geoip.

That said, if you have friends on another datacenter, we try to put you on the
same server. So that if you join groups or whatever then it's just a phase
transition rather than a full server sync.

Everything is orchestrated with a "core" set of servers which handle player
authentication and matchmaking centrally, then each of the gameserver
datacenters (geographically distributed to be closer to players) is attached
via ipsec VPN.

In Division 2 we spread out into GCP as well as having physical machines, so
we developed a custom auto-scaler. The autoscaler checks the capacity of a
region and how many players are currently in a region, keeps a record over 10
minutes and makes a prediction. If the prediction goes over the current
capacity in 20 minutes or less, it will create a new instance (since making
windows servers on GCP takes longer than linux servers)

If the prediction goes lower than the capacity of a server, it will send a
decomission request to the machine, which takes up to 3hrs to complete (to
give people time to leave the server naturally).

Idk, I've been doing this for 5 years now so I can talk at length about how we
do it, but ultimately out biggest challenges are the fact that we can't use
cool shit or new trends because latency matters a lot and we use windows
everywhere.

\--

As an aside; the overwhelming majority of other ubisoft games (excption: For
Honor) use something very similar to what we released open source in
collaboration with google to do matchmaking:
[https://agones.dev/site/](https://agones.dev/site/)

~~~
Slartie
What is the rationale behind using Windows on the servers? I'm wondering
because the argument I usually hear is that someone high up in the ops team
hierarchy declares "but we need everything in the AD, including all the
servers, otherwise I can't sleep well". In your case however this apparently
doesn't apply, as you do not register them in the AD.

Is it so the game devs don't need to write cross-platform server code?

~~~
dijit
Real reasons for:

* One platform to develop on; back-end coders tend to go back and forth between client and server programming.

* Faster iterations. (just hit F5 in visual studio)

* IOCP

Stupid reasons for:

* Old IT Director denied the use of virtualisation software. (mac address randomisation wasn't great and it caused a switch crash if two people had the same mac)

* Windows licenses are really cheap compared to developer time. (until we went to cloud, where Microsoft charges insane amounts for licensing)

~~~
speedplane
I no longer do windows development, but I miss it. It's strengths are greater
than you portray.

The C# / .NET libraries are really beautiful and well thought out. I most
often program in Python, and while the Python language is more concise and
elegant than C#, the Python libraries are not nearly as well organized and
consistent as C# / .NET. Java libraries are better than Python but worse than
C#, and the Java language is worse than both C# and Python. As far as dev
enivronments, Visual Studio is far better than Eclipse and roughly equal to
Python's PyCharm. Because C# is compiled, it's far more performant than Python
and a bit better than Java.

The downside of course, is that once you start down that road, you are pretty
locked into the MS world.

If I was starting a new project, and I was a scrappy startup, I'd probably go
with Python. But if I had resources (ie $$$) to support MS licenses and
performance actually mattered, I'd go with MS. Side projects for
experimentation may lead to node or rust. But sorry Java, can't really imagine
a scenario where I'd start with you.

~~~
scarface74
I was an exclusive Microsoft developer for a little over 20 years. It wasn’t
until I started doing cloud deployments - like the original poster implied -
that I started avoiding Windows as often as I could. Once you add Windows to
the mix, everything gets worse - startup time, resource requirements,
automation, snd costs.

Luckily, you can do C# without having to use Windows with .Net Core.

~~~
brianpgordon
Ooh, you may be the perfect person to ask a question that I've been wondering
about. Is .NET Core on Linux viable yet for serious use? I've done some
tinkering with it and gotten a project working on MacOS but I did the
development in Visual Studio on Windows. Is it realistic to do your
development in Rider, with the open source version of msbuild for CI, and have
it perform reasonably on Linux in production, without ever paying Microsoft
anything?

~~~
throwaway8941
The kind of development we do probably wouldn't count as "serious" here on HN,
but since the other comment didn't really answer your question, I'll pitch in.
We've been developing using .NET Core/Linux/Rider for the past two years. It's
been pretty great. Once you get to know the IDE really well, I'd say Rider
beats VS in terms of functionality and usability.

.NET Core was developed with CLI usage in mind and common operations (like
migrations/boilerplate generation) can be driven from CLI or Rider, you don't
need VS for that.

You may have some problems with 3rd-party libraries. Not all of them have been
ported to .NET Standard. I'd recommend checking your project's dependencies in
advance. (We did have some problems with legacy COM-based cryptographic
libraries and had to build a separate "microservice" hosted under Windows to
offload this stuff to. COM stuff works fine under .NET Core, but only under
Windows (obviously).)

~~~
scarface74
If you are just running a few static Windows servers the slight cost benefit
of running Linux for .Net Core is really not that great. If you are doing
anything more dynamic where you are rapidly bringing “servers” up and down is
where you see the real cost and performance benefits. Of course you can’t run
Windows instances with either Fargate or lambda but even with regular ECS
(Docker) or autoscaling EC2 instances where you can, Windows costs more, takes
more resources, and is slower to launch.

------
cbartlett
Some network related info from the pubg devs

[https://steamcommunity.com/games/578080/announcements/detail...](https://steamcommunity.com/games/578080/announcements/detail/1689300456332644145)

~~~
ezekg
This is a great read. I always wondered if AAA battle royale games did
something like that, since updating distant players every frame, especially if
they're behind the local player, seems like a waste.

------
godelmachine
Not related, but relatable.

I often wonder between Games and Enterprise Software, who has a more complex
backend architecture?

At the outset, it seems like Games are complex as hell, in the sense that when
a gamer shoots another gamer dead, there has to be freeing and re-allocation
of processes/ resources, and with tens or hundreds of persons playing the same
game over the internet , both the gaming architecture and network
infrastructure’s gonna be complex.

Would someone help me out here?

~~~
s_kilk
Just a thought : games have the advantage of being constrained by a particular
purpose and design, while enterprise architectures are more free to (d)evolve
into a sprawling, incoherent mess of systems.

Can you imagine anyone at a game company wanting to replatform the hit-
detection system onto hadoop?

~~~
ljm
I've thought about that, but at the same time I wonder just how much of each
game (from a single studio) is actually reusable. There's the networking/infra
side for multiplayer games, but even the engines themselves seem to get
significant rework for each new game. It's all insanely optimised C++ (which I
imagine can be reused in many places until the hardware is updated) and then a
huge amount of scripting to piece the dialogue, story, cutscenes together,
etc. right?

I imagine it must be quite difficult switching careers between games and
(particularly larger scale) web projects. Imagine defaulting to Ruby on Rails
to handle your multiplayer infrastructure and trying to model the game state
through REST.

------
lucifirius
Here's some EVE Online links. Really amazing stuff.

[https://www.eveonline.com/article/tranquility-
tech-3](https://www.eveonline.com/article/tranquility-tech-3)

[http://highscalability.com/eve-online-
architecture](http://highscalability.com/eve-online-architecture)

~~~
Accujack
Seconded. Eve's server architecture is amazing given the original constraints
of the network... they managed to find a way to make a single worldwide server
work and to make the game treat everyone fairly no matter what their ping was.

The architecture used has limitations (the biggest one is that fps-style real
time isn't really possible) but it's aged very well.

------
Negitivefrags
I created an RPG game backend for a game called Path of Exile. Not an FPS but
similar challenges. I don’t know how similar any of this is to other game
backends, but I’ll supply a few details.

Our backend consists of a few somewhat large services that are broken up
mostly around how they are sharded.

The biggest one is the account authority which contains most of the
accout/character/item data and handles the vast majority of traffic.

We have 5 shards of that (sharding on account id) with 2 read only replicas of
each one of those. All the read only requests go to one replica, and the other
replica is for redundancy.

There are also other services like the party manager, ladder authority,
instance manager, etc.

All of those shard on different things which is why they are seperate
services.

The instance managers handle creating game instances which are the servers
that the players actually play on.

We have a pile of servers which we call instance servers each of which runs an
instance spawner. When it starts up the instance spawner adds its capacity to
an instance manager and creates a process called a Prespawner. This prespawner
loads the entire data set of the game and then waits for the instance spawner
for orders.

When the instance spawner wants to create a new game instance, the prespawner
runs fork() and then the new process generates it’s random game terrain which
takes a few hundred milliseconds.

Because all the game resources are loaded before the fork they are all already
available in memory that is shared between all of the instances running on the
machine. Therefore each instance only takes 5-20 Mb of memory each which is
mostly the generated terrain and monsters.

We typically run about 500 instances on the min-spec cheap Single processor
Xeon servers we rent. This used to be around 1600 instances in the early days
but the game got more and more CPU intensive over time as the game got more
hectic over the years.

All the instances connect to Routers. There is one per instance machine, all
of which connects to a few routers per data center, all of which connects to a
set of core routers which also have all the backend services connected to
them.

These routers are important because they know where everything and everyone
currently is.

The routers work sort of like internet routers work, but instead of IP
addresses, you address your requests to logical entities or groups which can
move around and the router network is tasked with keeping track of.

So for example, when you whisper someone, you are sending a message to
Account:123 and it will find it’s way to whatever server currently has
Account:123 on it right now. If you send a message in global chat to
GlobalChat:1 it will be multicasted through the network to all the servers
which have currently registered an interest in hearing GlobalChat:1.

If you add someone to your friend list, then what that means is the server you
are on will register interest in multicast group AccountSession:123 which is a
group that account 123 will multicast all its status updates to like moving
between zones or leveling up or whatever.

Parties, Leagues, Guilds, Etc, etc. All of these things have multicast groups
associated with them.

If you have any more questions then feel free to ask.

~~~
pcnix
Tala Moana, exile!

Very interesting to see a GGG answer here! I admit to being very curious about
the Path of Exile architecture, and your answer has barely whet my appetite. I
have some questions that might give me better clarity over architecture if you
are up for answering them: 1\. How is data replicated across regions? And how
is trade across regions handled? Do the instance servers hand over character
data to the account authority in the new region? 2\. I remember speculation
about some builds that caused extreme amounts of server side compute and
slowed things down, was this compute performed on the instance servers? Like
poison/chain/monster damage calculations? 3\. Is there any sort of automated
detection of inconsistent game states done by the instance servers? Duping
protections or some such? 4\. What is the scaling plan like at GGG? Does the
system have obvious bottlenecks that are known or is it easy to scale for the
near future?

------
killerbunbun22
[https://open-match.dev/site/](https://open-match.dev/site/)

[https://mobile.serverwatch.com/server-news/how-epic-games-
us...](https://mobile.serverwatch.com/server-news/how-epic-games-uses-
kubernetes-to-power-fortnite-application-servers.html)

------
segmondy
Check out this talk from Andrew who worked for Hulu & Riot Games. I enjoyed
the session at QconSF [https://www.infoq.com/presentations/microservices-
pitfalls-l...](https://www.infoq.com/presentations/microservices-pitfalls-
lessons/)

------
Kaivo
I believe Activision uses an studio dedicated to multiplayer servers called
Demonware: [https://demonware.net/](https://demonware.net/)

Maybe they have some information on some of the methods they use, when
googling for them (haven't searched myself)

------
gameswithgo
at electronic arts the ancillary stuff like matchmaking, stats etc is C#
microservices stuff in AWS.

~~~
SeanBoocock
It varies quite a bit between studios and game. Some legacy games have C++
service backends. There is a lot of java throughout the org. I am currently
working in node/typescript.

------
bullen
I hesitate to comment because I don't have AAA experience (thank god) but I
made this simple platform from scratch that would be good enough for AAA FPS:
[https://github.com/tinspin/fuse](https://github.com/tinspin/fuse)

Among the unique features are globally distributed hot-deploy of server side
code/resources and globally distributed JSON database over HTTP. It's
completely async. concurrent and all threads work on all memory which is not
what most systems have if you look under the hood.

Only Java (begin the down vote without arguments) has good enough non-blocking
IO and memory model for networked concurrency to be workable on the server
side, but most implementations are bloated.

The real discussion we should have is tick vs event based protocols, I'm 100%
convinced we need to move away from tick based protocols.

~~~
codetrotter
> Only Java (begin the down vote without arguments) has good enough non-
> blocking IO and memory model for networked concurrency to be workable on the
> server side

I think you should post some sources or more detailed arguments if you are
going to make that claim.

