
Elo sucks – better multiplayer rating systems for smaller games (2019) - brownbat
https://medium.com/acolytefight/elo-sucks-better-multiplayer-rating-systems-for-smaller-games-8ca588ee652f
======
defertoreptar
The author didn't benchmark to see if this system is actually any better at
predicting outcomes than vanilla Elo. That's how you determine if your implied
win probabilities are accurately being derived from rating differences. The
author seems to be under the impression that there's something fixed and
concrete about an 1800 rating, but when you change the system, you also change
what an 1800 rating means in the first place.

Some of these complaints are solved by existing systems, namely Glicko. For
example, rating deviation helps with experienced players (low RD) losing
points to newer players (high RD). It also has a built-in way to discourage
inactivity. Players' RD increase over periods of inactivity, so they can be
excluded from the leaderboard after reaching a certain point. That allows us
to maintain their rating without decreasing it. After all, that's our best
guess of the player's skill. It's just a less reliable guess over time.

~~~
mcnamaratw
My understanding was that the system consists of using the historical odds of
winning (given the rating difference). If you benchmark that _using only past
data_ , I think it is by definition the most accurate system. (The data is
always a better fit to itself than a theoretical fit is.)

Naturally future data is much harder to deal with than past data. But even for
future data it's not obvious that ELO (or any other theoretical fit to the
odds of winning) will be more accurate than the historical odds.

~~~
BSTRhino
Yes, the best fit for the data is the data itself, it's a tautology. Nothing
wrong with Elo's exponential curve, it just can't beat the actual data.

You raise a good point in that I could've created a training set and a test
set, that probably would be a better validation. But I don't know, I'm not
doing science, I'm making a game.

On the topic of whether the future matches the past, the predictions were
based on a rolling database of the past 100000 matches, which is approximately
the number of matches played per 7 days. So my theory is that the data is
quite recent and up-to-date and so should match, in general.

Of course I never tested this. In the end, I'm not doing science, I'm making a
game. If the retention goes up, complaints are down, then I can't keep working
on the rating system, there are 1000 other things to do.

~~~
mcnamaratw
Yeah, I'm not giving advice on how you should do it. I was just unsure whether
critics here had understood that measured data is probably better than any
theoretical fit, even the revered ELO.

------
dvt
Elo is great for what it was built for: ranking chess players. Chess is (1)
extremely low-variance, (2) has an extremely high skill ceiling, and (3) is
1-on-1. Elo works great for chess, but it would _never_ work for something
like Poker. Let's briefly go over these three points.

Most games aren't chess -- where the only variance is picking who's black and
who's white -- in fact, they might include dozens of RNG mechanics (from
critical strikes to ability rolls, to spawn points). These mechanics (while
fun and well-designed) might pollute your "idealized" model. There's also the
problem of RPS (rock-paper-scissors) mechanics or pick-counter-pick mechanics
which will also heavily skew win rates. For instance, given a slow combo Magic
deck, you will most likely auto-concede to mono red aggro (regardless of skill
level). If you're using Elo, this will pollute your model. (Hint: you
shouldn't be using Elo.)

Most games also don't have chess' high skill ceiling. Chess has such a high
skill ceiling for a number of reasons -- it's one of the oldest games still
being actively played, for one. Suppose your "game" is simply the flip of a
coin (everyone wins 50% of the time). Zero skill involved. Trying to model
win-loss-ratios using a sigmoid curve is silly. Obviously, no game is going to
be a coin flip, but there's a world of difference between chess and DOTA.

TruSkill attempts to fix (3) by using clever Bayesian updating on a player-by-
player basis[1] but in reality, it's a shit-show. Using Elo (or variants
thereof) for team-based games where the team isn't really a team (more like
3-5 random people plopped together for one match) is incredibly misguided, but
continues to be implemented in just about every modern multiplayer game (to
the players' frustration). Of course, mixing and matching pre-made groups with
non pre-made groups creates as many issues as you might imagine.

In short, why so many game devs are enamored with Elo when it comes to ranking
is a bit bizarre.

[1] [https://www.microsoft.com/en-us/research/wp-
content/uploads/...](https://www.microsoft.com/en-us/research/wp-
content/uploads/2007/01/NIPS2006_0688.pdf)

~~~
CWuestefeld
My wife was a champion table tennis player. This sport uses Elo as well, and I
know from watching the sport over time that the rating system has real
problems. It doesn't suffer from the weaknesses that you cite, but even so,
the problem of "rating inflation" is widely discussed.

It seems that much of the problem comes from rating points brought in by
newbie players (and note that, contra TFA, the problem isn't with experienced
players losing to newbies, but the opposite).

A newbie is started off with some nominal rating; I forget the number, but
let's say it's 800. Most likely that newbie is going to lose his first
matches, and some proportion of those newbies will get frustrated and quit.
For the ones that stay in the game, things probably work out in the long run.
But for those that got discouraged and quit, in the course of their loss they
caused a few points (not many, because they're likely way overmatched, but
definitely more than 0) to be credited to their opponents. When they quit the
sport, they're never going to reclaim any of the rating points that they lost
initially. But those points are still in the system, having been added to
their winning opponents.

It's hard to quantify because the Elo system is the only objective comparison
we have, but over the course of the almost 30 years I've been watching my wife
play, the Elo rating enjoyed by a player of a given hypothetical skill level
has increased dramatically. Many are saying that for someone of the upper
echelons, their rating is maybe 200 points higher than it would have been 30
years ago.

So back in 1991, my wife was in the top 30 women in the USA with a rating in
the mid-1700s. Today, someone with that rating isn't even going to be in the
top brackets of serious tournament.

Despite all that, the usefulness of the rating system keeps it in use as a
valuable tool. It seems that the ability to match players who have never seen
each other before, ensuring interesting matches, is part of keeping the game
competitive for those in it. And table tennis is also, because of this, one of
what I believe is few sports where men and women often play head-to-head (even
though men generally have much higher ratings, on account of the sport
requiring far more strength than you might suspect).

~~~
dvt
> It doesn't suffer from the weaknesses that you cite, but even so, the
> problem of "rating inflation" is widely discussed.

Ah yes! Inflation is also a problem I've seen in competitive online games.
Rating inflation was a serious issue with World of Warcraft PvP arenas circa
10 years ago (iirc Blizzard hard capped arena ratings at 3000 during WotLK). I
don't follow chess much, and I'm not exactly sure how chess avoids it (or even
_if_ it does).

~~~
freeone3000
By the point you're playing ranked matches in chess, you're generally invested
enough to keep playing. However, chess has a (statistically) significant
inflation problem, to the point where you can only compare scores within the
same decade or so meaningfully.

------
CodesInChaos
1\. The sigmoid function is the closest thing to linear that makes sense on
probabilities⁺. A purely linear function would cross 0/100% which, while the
sigmoid flattens exponentially as it approaches the extreme values.

2\. The fit isn't as bad as the author claims. It looks like the biggest
difference between the graphs is that the point differences are scaled
differently (400 pts for 90% in elo vs 800 pts in the second graph).

A quick and dirty overlay of the two graphs shows a reasonable fit:
[https://ibb.co/0YwYH9z](https://ibb.co/0YwYH9z)

3\. I like observations about player psychology. Satisfying the players is
more important than having the mathematically best ranking system.

4\. Personally I like Whole History Ranking ([https://www.remi-
coulom.fr/WHR/](https://www.remi-coulom.fr/WHR/)), but it's unlikely to be
popular with players (the psychological criticisms the article makes apply to
it as well, with some additional problems, like rank drifting without
playing). KGS which uses ranking system similar to WHR (but more primitive)
certainly draws a lot of criticism for its ranking system.

If I had to design a mathematically optimal ranking system, I'd start with WHR
and make parts of it trainable/fittable.

\----

⁺ Bayes' theorem turns into addition when applied to logarithmic probabilities
and the sigmoid function converts from logarithmic probabilities to normal
probabilities. This property is why it (or its multi category equivalent
softmax) is used when predicting probabilities using logistic regression or
neural networks.

------
jrek
Elo might or mightn't suck (imo it's a great ranking system). But the article
sucks. Vanilla elo is built around chess and some adjustments to the scale
and/or K-factor might be necessary to fit the circumstance. A quick change of
scale to E = (1 / 1 + 10 ^ ((Ra - Rb) / 800)) and all of a sudden ELO very
accurately reflects the games actual results:
[https://imgur.com/a/rFP5U0g](https://imgur.com/a/rFP5U0g)

Meaning just that skill is a weaker factor in this game than in chess...

Edit: The 'actual' curve includes a correction for the obvious anomaly of ~55%
win expectation at 0 point delta.

------
IanGabes
Creating a custom system to suit your situations needs sounds great and the
thought process was fun to read, but some of the claims lobbed here are pretty
questionable.

Specifically, the claim that Dota's matchmaking system is "probably wrong"
because the model chosen doesn't match your own findings feels like a reach.
Sibling commenters have pointed out how skill variance is important to allow
the ELO system to function in games like chess. Additionally, someone else
pointed out that the sigmoid function is similar to a linear funciton close to
zero.

It seems _at least_ as likely that Acolytefight doesn't have a high enough
level of skill expression present in the game to see top players "curve out"
weaker players, rather than exponential functions mapping player skill to be
useless or wrong.

Does elo suck? Maybe, but this hasn't convinced me.

------
runarberg
I remember a bit back the Go server that I play most of my go these days
[OGS]([https://online-go.com](https://online-go.com)) changed their ratings
from Elo to Glicko-2.

You can read their rationally for it in this forum: [https://forums.online-
go.com/t/ogs-has-a-new-glicko-2-based-...](https://forums.online-go.com/t/ogs-
has-a-new-glicko-2-based-rating-system/13058)

The key takeaway is this:

> Most of the shortcomings [of Elo] can be traced back to the fact that the
> system is too slow to find a player’s correct rank, and too slow to adapt
> when jumps in strength occur.

> The problem of slow moving ratings is a well-known problem with Elo
> implementations. In response to this, Prof. Mark Glickman developed the
> Glicko, and later Glicko-2, rating systems which address this problem very
> well and are fairly widely used

A few weeks ago they then made an update to their implementation of Glicko-2,
where—during the announcements they summarized many interesting statistics on
how the system has panned out for them: [https://forums.online-
go.com/t/2020-rating-and-rank-tweaks-a...](https://forums.online-
go.com/t/2020-rating-and-rank-tweaks-and-analysis/28649)

------
BSTRhino
Wow, I wrote this article ages ago, didn't expect to see it posted here today.

I just want to clarify the point of the article:

Why would you fit a curve to the data when you can just use the actual data?

That's the point of the article.

We're in the age of big data, we should use it to make better win rate
predictions. Elo's exponential curve is fine, it's approximately right, it's
just now we can have databases of millions of games and we can just do better.
Elo was invented before the big data age and it is limited by that.

That's all I'm saying.

I shouldn't have included all the other stuff in the article, it just
distracts from the point.

~~~
OisinMoran
Thanks for writing the article and sharing your work with the world, I really
enjoyed it! I think the central point you make is very interesting.

I'd be interested to know what fit you used for the red "line of best fit",
why not a straight line? My main question here is do you actually expect a
player ~210 points above another to win _less_ than if they were only ~190
points above? (the first dip in the red graph)

------
dcl
If you're interested in evaluating and rating/ranking agents, it might be
worthwhile checking out DeepMind's multidimensional Elo rating system
([https://arxiv.org/abs/1806.02643](https://arxiv.org/abs/1806.02643)) which
attempts to solve some of the issues with Elo and Glicko. Most notably, the
ability to handle non-transitive interactions (like rock, paper, scissors) and
the presence of redundant duplications of matches that might erroneously
inflate ratings.

Shameless plug, I've created an R implementation of it here:
[https://dclaz.github.io/mELO/](https://dclaz.github.io/mELO/)

~~~
sali0
This is fantastic, thank you for bringing this up.

------
noctilux
I'm curious about whether the author tried to optimize Elo's K factor. It's
often left at 32, which is not reasonable for all contests. It's essentially
related to the standard deviation of player skills: if there is a large range
of skills, it should be large, and if there is a small range, it should be
small. It's easy to tune by optimisation, and it has a huge effect on
predictive ability.

------
HideousKojima
The more obvious solution is to bring back custom lobbies and private servers
and forget about ranking players at all. Gets rid of a lot of bad behavior too
because servers can police their own communities and players won't get
frustrated when a crappy teammate is dragging their ranking down

~~~
LoSboccacc
idk that makes extremely hard to find matches in games with a smaller player
base

see war thunder, the simulation queue is a desert, high tier ships a
wasteland, unless all the available player get forcibly lumped together
matches will just not happen

compare with stormworks too, most servers are empty in my timezone and the
populated one as password protected or spawn limited, it wouldn't take much to
get known and partecipare in their community but for working games the time
commitment is simply impossible.

same with arma3 I'd love to get into shack tac but timezone and commitments
make it unavailable to me, and since most of the good players are sucked up in
teams the public server are a mess of "what's left" of the community

~~~
HideousKojima
Matchmaking without custom servers/lobbies makes finding a match even harder,
since a minimum amount of users in a specific ranking/skill level/ship
tier/whatever must all be online and searching for a match at the same time.
Custome servers and lobbies allow just one or two players to start, and it
advertises to other players that they are available to play. The initial
players just need to wait until more people show up, and can play more casual
game modes or with bots or whatever until more people arrive.

------
im3w1l
> If we take a top-level player, and make them fight a high-level, mid-level
> and low-level player repeatedly until we can become statistically confident
> of their win rates against each, there is no reason why their win rates
> would fit an exponential curve.

When I first read this, I thought to myself "well we get to pick the scores,
so it's exponential by definition". The problem becomes more clear when you
express it without any reference to the scores.

If Player A wins 80% of the time against Player B, and Player B wins 80% of
the time against Player C, how often does Player A win against Player C? This
is a question purely in terms of observables. Elo makes a prediction here
(94.1% of the time) and it can be either right or wrong. If it's wrong, then
there is no valid assignment of scores.

------
gverrilla
Isn't a qualitative system possible? It would be really complex to create for
a game such as dota2 or cs:go, but maybe not for a simpler game. I will give
cs:go as an example only because I know it very well.. It would be possible, I
believe, in theory, to measure player knowledge towards specific ingame-
skills. New cs players for instance wouldn't know how to control recoil
effectively. And 100% of global elite/pro players would be above a certain
threshold regarding recoil control. On the other hand, you could say with a
lot of confindence that a player that tries to achieve a high ground pressing
only +jump multiple times with no success, when he would need a crouch jump
instead because of height, is a noob. Elo or something similar could then be
used to measure ranks within specific clusters only. And some form of
mood/form on top of this, to allow for better experience (even though I have
played cs for 20y now, it could happen that I abandon the game for a few
months, or that I have a really bad focus because of external events).

I'm not sure if this makes sense, but what I know for sure is that as an
experienced player, I can watch a player play a single game (sometimes a few
rounds), and access his average rank/skill level with high confidence, with no
need of information from his prior games whatsoever, or detailed statistics of
his gameplay.

There's something else to remember for high skill-cieling games: winrate is
not what really matters. A lot of times I will play a very good, balanced and
fun game and lose. Sometimes it will even happen with very uneven scores like
16-5 or soomething...

------
closed
I am pretty sure the author is describing a well understood limitation of Elo,
they just need a tiny bit of connecting to models.

Elo can be thought of as an approximation to item response theory models [1].
These describe skill as normally distributed, and whether one person will win
using a logistic function (not exponetial).

I think what the author has keyed in on is that afaik in simple Elo there is
no slope coefficient for the logistic, but in general IRT models there is
(called item discrimination). So in Elo you can't learn that flatter curve
they show.

[1]:
[http://hvandermaas.socsci.uva.nl/Homepage_Han_van_der_Maas/P...](http://hvandermaas.socsci.uva.nl/Homepage_Han_van_der_Maas/Publications_files/papers/klinkenberg.pdf)

------
duaoebg
Repeated Bernoulli trials give rise to Gaussian distributions which is where
the e exponential comes from.

This an assumption and an approximation and is not necessarily a good fit.
Pulling from actual probabilities would generally perform better.

The rest is massaging to better fit the different objectives.

------
edaemon
The "newbie suppression" mechanic doesn't make much sense to me. If you play
against someone substantially lower in rating than you and lose, shouldn't you
lose a significant amount of points? After all, you lost to someone you should
have easily beaten.

~~~
ganonm
I agree, and the proposed solution which is to limit point gains/losses to one
point per game feels like throwing the baby out with the bathwater.
Specifically, convergence takes a long time, the result of which is that a
very good player on e.g. a new account (smurf) will end up being the cause of
a lot of unbalanced games for an awful long time.

Having played a lot of ranked LoL, I saw a few recurring but irrational gripes
players had with the Elo based system:

\- "I get matched with bad teammates and they drag me down". On average your
teammates are the same Elo as you. All players get their fair share of games
where they are/aren't the underdog side. On average, it averages out. Deal
with it.

\- "I've been stuck at the same Elo for ages but I should be higher". Nope,
Elo only cares if you win or lose. It doesn't care about kill/death ratio,
creep score or how many ganks you pull off. Focus on winning more.
Incidentally, focusing on winning instead of secondary metrics like kills/CS
was one of the biggest mindset differences between high/low Elo players.

"I should be higher Elo but I play support roles so can't climb". It may be
true that you climb slower but here's the rub - think of your matchups as you
being compared to the enemy team's support player. The other four roles on
each team are actually a constant factor (by symmetry arguments you could not
consistently find that your four teammates are any better/worse than the enemy
support player's teammates). As a result, the only remaining factor in the
statistical equation is you weighed up against the enemy support player. If
you can provide even a slight statistical advantage towards winning vs them
then you will climb the Elo ladder.

~~~
aaronblohowiak
> As a result, the only remaining factor in the statistical equation is you
> weighed up against the enemy support player. If you can provide even a
> slight statistical advantage towards winning vs them then you will climb the
> Elo ladder.

An alternative explanation is that the skill ceiling is lower for support
players.

------
Godel_unicode
If your curve is linear, it's because your game isn't that hard (or more
formally, where winning and skill are less strongly correlated). This is tough
for people to hear if their game is "designed to be a high-skill game".

The curve being linear means essentially that skill in the game confers less
of a relative advantage. Chess is a good counterexample here, also rocket
league. Both are games where difference in MMR is very strongly correlated
with outcome, and both are games where skill is easily measured and highly
correlated with ranking.

------
sytelus
Take a look at TrueSkill, a much better mathematically grounded, created at
Microsoft Research and being used at scale in Xbox:
[https://en.m.wikipedia.org/wiki/TrueSkill](https://en.m.wikipedia.org/wiki/TrueSkill)

------
IshKebab
TrueSkill definitely has a time decay term and I'm fairly sure it lets you fit
the model to previous games. I wonder if the author actually tried it. (Though
to be fair I'm not sure if there are open source versions of the latest
version of TrueSkill.)

~~~
BSTRhino
Yes, tried Glicko then TrueSkill, both generated huge amounts of complaints.
New system produced few complaints. If the community had liked it, would've
stuck with TrueSkill.

~~~
IshKebab
TrueSkill 1 presumably?

------
neolefty
How about coop games — what would you use to rate players where the goal is to
win together?

------
EGreg
Wait why don’t we use a deep learning thingy on this dataset and just back out
a formula that predicts the wins based on just the relative numbers of the
people?

------
musicale
Nonsense - they're in the Rock and Roll Hall of Fame after all! Jeff Lynne is
a musical genius.

------
philliphaydon
Elo was in Age of Empires back when zone .com was a thing.

It worked and worked well. Points were calculated for each person. However
Dots2 and lol don’t implement Elo the same way, points are calculated for the
team. So if you’re Low score and you win against high people. In Dota and lol
you won’t gain many points.

I believe this is done to avoid being carried but it doesn’t work because it
just results in you being stuck in a Low tier for ages.

TLDR: elo works and it’s great. No one implements it right.

Edit: In Age of Empires / Zone, if you had a 4v4, it used all 8 players to
calculate the ELO on an individual player, so if you had in your team 1750
elo, 1550 elo, and anything in between. The 1750 may gain only 1 point, while
the 1550 may gain 16 points (the highest gain lowered the more people who
played) While on the losing side the lowest elo will lose the lowest amount of
points and the highest will lose the highest amount of points.

dota / lol don't do this, the winning/losing team gains/loses the same amount
of points. This is wrong.

This means a high elo player has the potential to farm points from low elo
players with little risk. While low elo players get stuck not playing people
in their own range.

------
dang
I recall at least one large previous thread about Elo but can't find it.
Anyone?

~~~
jsnell
Maybe
[https://news.ycombinator.com/item?id=16255910](https://news.ycombinator.com/item?id=16255910)

------
afwaller
This is useful to increase plays by reducing “ladder anxiety”

------
letmeinhere
Isn't that a logarithmic curve?

~~~
CodesInChaos
It's a sigmoid, which converges to exponential far from 0 and is somewhat
linear near 0.

