var player = new Player(123);
In this way, it is very easy to quickly update game state without obsessing over the throughput of a single machine.
It took me quite some time to shift my mental model and decide I prefer it over Akka. It's also worth noting that Orleans is open source. There is also an excellent operational dashboard named OrleansDashboard.
I’m only pointing this out because it’s easy to spend a lot of time trying to use the dashboard as a dev tool instead of focusing on good logging techniques. There has been a lot of great work put into the dashboard and I certainly don’t want to take away from that.
I use Orlean’s, have contributed to it, and have interacted with the dev team quite a bit.
It's a great framework and I hope others adopt it for backend services. My only complaint, and it isn't particular to Orleans, is that there isn't a lot of examples that demonstrate best practices for building large applications with it. There is definitely a "right" and wrong way to build with Orleans and it can take you a while, and a lot of refactoring, to discover that.
To be fair, this is precisely what Akka doesn't do, often citing A Note on Distributed Computing (1994) which explains why that approach is problematic.
(Disclosure: I work there)
(Disclaimer: I'm one of the summit advisors)
We're trying to grow the education in this space of game development, because currently it is very sparse. AFAIK, this is the only event that is dedicated to the technical aspects of online, connected or multiplayer games.
Details, and CFP, which is currently open:
However, I can only speak for one AAA gaming company, and my team operate a bit differently than most in the company.
My team operates the infrastructure for "Tom Clancy's The Division" video game series (1&2).
Most of the programming approach is spent on doing the cheapest (in terms of CPU) possible thing, everything is C++
Things like: matchmaking will happen ideally on a single machine with no shared state, everything will happen in memory, which is much faster and can be more reliable than any distributed state or failover.
(it's less reliable if you're in matchmaking and the server or service dies; But then everyone's client will reconnect to the newly scheduled matchmaking instance and populate in memory state.)
We use a lot of windows, nearly every machine that doesn't handle state is a windows server. This has pros and cons, from my ops perspective I try to treat windows like cattle, but windows doesn't like that. they have their own way of operating fleets of machines which include using SCCM and packaging things. There's nice GUI's, but we use saltstack and we removed AD, because it was a huge burden to: create a machine, link it to AD, reboot it, finally get a machine worth using.
From a dev perspective, Windows is good, IO Completion Ports is superior to the linux epoll in terms of interface and performance, so we can have machines that take 200,000+ connections.
How you decide which dedicated server you connect to is up to your client, it does a naive check as it's logging in where it will do a TLS handshake with a random gameserver in a region, for each region. (during the login phase we send your client a list of all currently active gameservers and an int to represent the region).
This works fine until there's packet loss on a particular path because your single ping might be fine but overall your experience could be better elsewhere; if you're not able to ping anything then we fall back on geoip.
That said, if you have friends on another datacenter, we try to put you on the same server. So that if you join groups or whatever then it's just a phase transition rather than a full server sync.
Everything is orchestrated with a "core" set of servers which handle player authentication and matchmaking centrally, then each of the gameserver datacenters (geographically distributed to be closer to players) is attached via ipsec VPN.
In Division 2 we spread out into GCP as well as having physical machines, so we developed a custom auto-scaler. The autoscaler checks the capacity of a region and how many players are currently in a region, keeps a record over 10 minutes and makes a prediction. If the prediction goes over the current capacity in 20 minutes or less, it will create a new instance (since making windows servers on GCP takes longer than linux servers)
If the prediction goes lower than the capacity of a server, it will send a decomission request to the machine, which takes up to 3hrs to complete (to give people time to leave the server naturally).
Idk, I've been doing this for 5 years now so I can talk at length about how we do it, but ultimately out biggest challenges are the fact that we can't use cool shit or new trends because latency matters a lot and we use windows everywhere.
As an aside; the overwhelming majority of other ubisoft games (excption: For Honor) use something very similar to what we released open source in collaboration with google to do matchmaking: https://agones.dev/site/
Is it so the game devs don't need to write cross-platform server code?
* One platform to develop on; back-end coders tend to go back and forth between client and server programming.
* Faster iterations. (just hit F5 in visual studio)
Stupid reasons for:
* Old IT Director denied the use of virtualisation software. (mac address randomisation wasn't great and it caused a switch crash if two people had the same mac)
* Windows licenses are really cheap compared to developer time. (until we went to cloud, where Microsoft charges insane amounts for licensing)
The C# / .NET libraries are really beautiful and well thought out. I most often program in Python, and while the Python language is more concise and elegant than C#, the Python libraries are not nearly as well organized and consistent as C# / .NET. Java libraries are better than Python but worse than C#, and the Java language is worse than both C# and Python. As far as dev enivronments, Visual Studio is far better than Eclipse and roughly equal to Python's PyCharm. Because C# is compiled, it's far more performant than Python and a bit better than Java.
The downside of course, is that once you start down that road, you are pretty locked into the MS world.
If I was starting a new project, and I was a scrappy startup, I'd probably go with Python. But if I had resources (ie $$$) to support MS licenses and performance actually mattered, I'd go with MS. Side projects for experimentation may lead to node or rust. But sorry Java, can't really imagine a scenario where I'd start with you.
Luckily, you can do C# without having to use Windows with .Net Core.
.NET Core was developed with CLI usage in mind and common operations (like migrations/boilerplate generation) can be driven from CLI or Rider, you don't need VS for that.
You may have some problems with 3rd-party libraries. Not all of them have been ported to .NET Standard. I'd recommend checking your project's dependencies in advance. (We did have some problems with legacy COM-based cryptographic libraries and had to build a separate "microservice" hosted under Windows to offload this stuff to. COM stuff works fine under .NET Core, but only under Windows (obviously).)
I develop on Windows with Visual Studio and deploy to “Linux”. We use CodeBuild to compile and create the zip file (lambda) or Docker Container (Fargate).
CodeBuild basically is “Serverless builds”. It launches s prebuilt or custom Docker Container and you just run .net core “publish” command line and it gets all of your Nuget packages and then you run your standard packaging commands.
But Rider should be good. I’m a huge fan of R# and you can cross target Linux, Mac and Windows from either host using msbuild.
But there are reported issues with sql:
I can understand how value types would help a lot, but it is possible to optimize Java to avoid object headers on the heap at great pain, i.e. avoiding classes and doing index arithmetic on flat arrays of values. Value types coming to the JVM will obviously help here.
Curious to hear about other advantages the CLR has.
>* Faster iterations. (just hit F5 in visual studio)
Is this some sort of code sync to servers/cloud?
sufficed to say: my job is getting easier in future (thanks mostly to Stadia)
I see enough congestion on just the local 2.4Ghz spectrum out here it doesn't really matter how fast the data center is. Without some form of dead reckoning you don't have much room for error.
However, Stadia is Linux, so if you want to be on stadia, you need a linux version of the game, which means primitives need to be ported. Once primitives are ported for the client, they can be optimised for the server, then you have a linux gameserver. Which is good for me.
Did you move to Azure?
But they really gouge the other cloud providers (GCP/AWS) on licensing, more than a third of our infra costs are just windows licenses (when looking only at cloud).
Wow, that is brutal.
I'm currently writing a (much much simpler agar.io-like) multiplayer game and finding it quite challenging to make the experience smooth, I haven't implemented any lag compensation yet though so maybe that's why but I wonder if that is the only problem. Do you or anyone else have any resources that could help me?
In my experience, all of these languages are fairly well equipped to the problem space - even the interpreted/garbage collected ones can handle these problems well as they are (in my experience) infrequently CPU bound, and frequently the scaling considerations come in around your usage of databases and caches (network or IO bound). When you have enough players to hit CPU boundaries, you also typically have enough money to buy some pretty beefy CPUs. The caveat to this is that concurrency gives huge wins, since a lot of the requests/data are not related to each other (beyond the 10-20 people in a game) and so languages or frameworks that don't solve concurrency and/or async IO well are at a disadvantage. This puts Python/Django and Node.js further down the totem for me. Of course, there are ways to resolve those problems even within those languages so they are far from ill-equipped.
When I get my choice, I choose something garbage collected in a VM, with mature web frameworks (basically Java & C#) for online services in games. Knowing an engineer's errant null pointer isn't going to tip over your whole process is pretty handy, and the CPU gains in C++ or Go (native code) aren't put to as good of use here as they are in gameplay engineering where you're trying to squeeze cycles out of client machines. Client machines don't really get more powerful the more players you have, but your servers can, haha.
Below are two articles which I think can be really valuable "Baby's first" for the topics of Multiplayer Gameplay Engineering and Online Services for Games. I wrote them beginning from my perspective of approaching these problems my first time as a college student many years ago, and continued with how my approach evolved through experience. I also include lots of links to seminal articles/talks on the topics that I read as I attempted to "do my homework" while making my own indie game.
Multiplayer Gameplay Engineering
Online Services for Games
I often wonder between Games and Enterprise Software, who has a more complex backend architecture?
At the outset, it seems like Games are complex as hell, in the sense that when a gamer shoots another gamer dead, there has to be freeing and re-allocation of processes/ resources, and with tens or hundreds of persons playing the same game over the internet , both the gaming architecture and network infrastructure’s gonna be complex.
Would someone help me out here?
If you want to explode the complexity of each of these "classes" of applications, just add the thing that makes the other class difficult to the requirement:
- A game that allows players in its world to interact tightly with an entirely different game world (like "really different", with entirely different rules and game play) which also evolves independently over time and on an uncontrollable schedule
- An Enterprise app that implements super complex processes with thousands of different moving parts, modeled out into extreme detail and without the possibility to resort to "this is handled organizationally"
Can you imagine anyone at a game company wanting to replatform the hit-detection system onto hadoop?
I imagine it must be quite difficult switching careers between games and (particularly larger scale) web projects. Imagine defaulting to Ruby on Rails to handle your multiplayer infrastructure and trying to model the game state through REST.
Typical (simplified) day for me: either download files or expect files to be delivered by start of business (typically 8am). Files need to be cleaned, uploaded. Changes need to be sent to untold numbers of other systems. 830am-300pm, various trading activity, new instruments being added, quants want new benchmarks, etc, more data loaded. 3pm, trading ends, mad rush to calc end of day returns, pricing and positions. All of this has to be sent to external accountants and auditors within about an hour.
Id imagine games largely dont have many external dependencies (maybe on a service such as steam, Xbox live, PS network, etc), but they dont have to process and or deliver files from/to numerous sources for tje game to work. They just need the service live. They dont have to wait 15 minutes to check Bloomberg SFTP to find out that their request failed for some untold reason.
I wouldn't be surprised to find AAA gaming and financial networking to be similarly, but differently complex for different and yet similar reasons. Both will want to reduce latency to the greatest extent possible, but for very different reasons.
In the same time we are seen just as cost, not revenue enablers (no comment on this), so we are very limited in money and technologies we can buy. It is also a very limited market where most companies buy, don't build, so you are usually stuck with the 3-5 real offers for any area and I personally know 2 products we use that are considered industry best, but they are pretty bad. There rest of the similar products on the market is worse - about 10 years in the past.
For obvious reasons, SLA’s comprise an integral part in IT services industry, and if I am not wrong, BMC Software’s SLM module is still written in C++, at least that’s what the documents say.
So far, I have never really seen any C++ stack trace in any of the log files, but some day hope to. Furthermore, I have never even seen any C++ related file in any of the configuration files.
The architecture used has limitations (the biggest one is that fps-style real time isn't really possible) but it's aged very well.
Our backend consists of a few somewhat large services that are broken up mostly around how they are sharded.
The biggest one is the account authority which contains most of the accout/character/item data and handles the vast majority of traffic.
We have 5 shards of that (sharding on account id) with 2 read only replicas of each one of those. All the read only requests go to one replica, and the other replica is for redundancy.
There are also other services like the party manager, ladder authority, instance manager, etc.
All of those shard on different things which is why they are seperate services.
The instance managers handle creating game instances which are the servers that the players actually play on.
We have a pile of servers which we call instance servers each of which runs an instance spawner. When it starts up the instance spawner adds its capacity to an instance manager and creates a process called a Prespawner. This prespawner loads the entire data set of the game and then waits for the instance spawner for orders.
When the instance spawner wants to create a new game instance, the prespawner runs fork() and then the new process generates it’s random game terrain which takes a few hundred milliseconds.
Because all the game resources are loaded before the fork they are all already available in memory that is shared between all of the instances running on the machine. Therefore each instance only takes 5-20 Mb of memory each which is mostly the generated terrain and monsters.
We typically run about 500 instances on the min-spec cheap Single processor Xeon servers we rent. This used to be around 1600 instances in the early days but the game got more and more CPU intensive over time as the game got more hectic over the years.
All the instances connect to Routers. There is one per instance machine, all of which connects to a few routers per data center, all of which connects to a set of core routers which also have all the backend services connected to them.
These routers are important because they know where everything and everyone currently is.
The routers work sort of like internet routers work, but instead of IP addresses, you address your requests to logical entities or groups which can move around and the router network is tasked with keeping track of.
So for example, when you whisper someone, you are sending a message to Account:123 and it will find it’s way to whatever server currently has Account:123 on it right now. If you send a message in global chat to GlobalChat:1 it will be multicasted through the network to all the servers which have currently registered an interest in hearing GlobalChat:1.
If you add someone to your friend list, then what that means is the server you are on will register interest in multicast group AccountSession:123 which is a group that account 123 will multicast all its status updates to like moving between zones or leveling up or whatever.
Parties, Leagues, Guilds, Etc, etc. All of these things have multicast groups associated with them.
If you have any more questions then feel free to ask.
Very interesting to see a GGG answer here! I admit to being very curious about the Path of Exile architecture, and your answer has barely whet my appetite. I have some questions that might give me better clarity over architecture if you are up for answering them:
1. How is data replicated across regions? And how is trade across regions handled? Do the instance servers hand over character data to the account authority in the new region?
2. I remember speculation about some builds that caused extreme amounts of server side compute and slowed things down, was this compute performed on the instance servers? Like poison/chain/monster damage calculations?
3. Is there any sort of automated detection of inconsistent game states done by the instance servers? Duping protections or some such?
4. What is the scaling plan like at GGG? Does the system have obvious bottlenecks that are known or is it easy to scale for the near future?
Has there been any protocols which were written from scratch because an out of the box solution simply didn't do what was needed?
Do you leverage any kind of out of the box solutions to co-ordinate services? e.g apache zookeeper?
Can you describe how nodes are added or removed to your swarm based on demand? I presume this is non linear?
Edit: Ive been playing since pre alpha. glad its still going strong.
Thanks for taking the time to write up that overview, very cool to read!
Maybe they have some information on some of the methods they use, when googling for them (haven't searched myself)
Among the unique features are globally distributed hot-deploy of server side code/resources and globally distributed JSON database over HTTP. It's completely async. concurrent and all threads work on all memory which is not what most systems have if you look under the hood.
Only Java (begin the down vote without arguments) has good enough non-blocking IO and memory model for networked concurrency to be workable on the server side, but most implementations are bloated.
The real discussion we should have is tick vs event based protocols, I'm 100% convinced we need to move away from tick based protocols.
I think you should post some sources or more detailed arguments if you are going to make that claim.