
Making Sense of Super Smash Bros. Melee - panic
http://planetbanatt.net/articles/ambistats.html
======
jpk
Interesting article, but this stood out to me.

 _It frustrates me to see people in the smash community treat measures like
elo as "the truth" because they "don't have any human input". This simply
factually incorrect - these so-called objective measures have as much human
input as anything else, codified into the constants and design choices of
their algorithms. Designing these things is as much an art as it is a science,
and the choice on how to weigh placements, upsets, losses, consistency, peaks,
and the like are all just that - choices, made by a human sitting in a chair
with Sublime Text 3 open._

I feel like this is applicable nigh everywhere. From social media timeline
sorting, to industrial processes, to Melee rankings. Using an algorithm
doesn't eliminate the human element from a system, it only abstracts it away.

~~~
rspeer
There is a large contingent of radical empiricists in machine learning who
assume "big data + automation = truth", especially on HN, and this is a
message they need to hear more of.

People have been advocating radical empiricism in some increasingly
uncomfortable contexts recently, and I hope it's just that it's the only thing
they were taught and the only way they know how to think about their craft.
The alternative is that an increasing number of people really do want machines
to triumph over human judgment and morality.

~~~
bo1024
This might be a case where classical econ ("social choice") can give some
helpful perspective. Arrow's impossibility theorem is the most famous
impossibility result in this area; it says that no algorithm can take in a set
of rankings (e.g. match outcomes) and produce an aggregate ranking in a way
that satisfies a small set of fairness criteria. This is classically
interpreted as saying that any aggregation method must be "unfair" in one way
or another.

~~~
thaumasiotes
> Arrow's impossibility theorem is the most famous impossibility result in
> this area; it says that no algorithm can take in a set of rankings (e.g.
> match outcomes) and produce an aggregate ranking in a way that satisfies a
> small set of fairness criteria.

Ignoring the parenthetical "e.g. match outcomes", this is a correct
description of Arrow's impossibility theorem. I don't see how match outcomes
could possibly be an example of a set of rankings in the sense of the theorem,
though.

~~~
bo1024
Yeah, that was unclear. I'm thinking of a match outcome between teams A and B
as a partial ordering on all the teams. Classically Arrow's deals with only
total orderings as input -- I'd been thinking that it extends to partial
orderings, but hmm, I'm not sure what research says about if we restrict the
inputs to be just pairwise orderings/outcomes.

------
kendallpark
> For our purposes, Bloodgood serves as a great example of "closed pool"
> rating abuse. You get inflated ratings by being the best player in your
> playerpool, even if your playerpool is a relatively weak one.

In Melee there people that end up as local kings that don't do well in
nationals. There are also people that are exceptionally good on a national
level but simply don't travel (aka "Hidden Bosses").

Nintendo is very hands-off with Melee so tournament organization remains in
the hands of the community. There is no single major overseer of Melee
tournaments. Anyone can hold a tournament and throw the bracket onto Challonge
or Smash.gg. I imagine if ELO was implemented as part of seeding, people would
start gaming the system.

> The way seeding gets done is that players get placed into broad tiers, and
> then those tiers are then fed into pools, attempting to avoid region
> conflicts or repeat matches from recent tournaments.

This is where the human-in-the-loop part of seeding shines. Mid-tier players
are entering national tournaments for the experience. They will not win, and
their reg fee is essentially donating money to the winner's pot. But what they
gain from the experience is tournament matches with players that they are not
familiar with. Many of them will only get two games in-bracket, so it's a huge
waste for them if they end up playing against buddies from their own region.

The community actively polices good seeding. There is often an outcry if say
too many Nor Cal players get shoved on the same side of a bracket.

~~~
slphil
Hidden Bosses always get exposed at nationals because no matter how good or
talented they are, they will get destroyed by players who are used to
competing against other national threats. We've encouraged our local hidden
boss (#1 in TN) to attend more nationals, but work schedules get in the way.
Just like in chess, Melee is only profitable if you are one of the best in the
world, and life gets in the way.

------
slphil
I also play chess at a competitive level (>2000, Expert in the US) and play
Melee at a low competitive level (playing in local meetups, winning a few
matches). I've had many arguments about ranking systems, ELO, etc with my
fellow Smashers, and I reached similar conclusions. This is a great writeup.

There are huge differences between the Swiss system used in chess (which works
great for ELO, since seeding is done by rating and players are not eliminated)
and the double elimination system used in Melee tournaments. I don't think
it's possible to have an objective ranking system in Melee because of the
intricacies of this issue (seeding influences final placement, low-seeded
players will hit a wall where they lose to high-seeded players earlier, etc).

~~~
gowld
And Elo, optimized for pencil+paper calculations, is obsolete for computer
games. Glicko supersedes it.

------
mcguire
Am I old? For a second there, I thought they were talking about _Melee_
([https://en.wikipedia.org/wiki/Melee_(game)](https://en.wikipedia.org/wiki/Melee_\(game\))),
"... a simple man-to-man combat boardgame designed by Steve Jackson, and
released in 1977 by Metagaming Concepts."

(With _Wizard_
([https://en.wikipedia.org/wiki/Wizard_(board_game)](https://en.wikipedia.org/wiki/Wizard_\(board_game\)))
and _The Fantasy Trip_
([https://en.wikipedia.org/wiki/The_Fantasy_Trip](https://en.wikipedia.org/wiki/The_Fantasy_Trip))
(Yay, 1970s!), _Melee_ made up the best fantasy role playing game. The only
competition is the Hero system; GURPS is definitely a victim of the second-
system effect.)

Edit: Yes, I'm apparently old. I'll return you now to your regularly scheduled
discussion.

~~~
aidenn0
How complex is TFT? 1980 is a bit of a nexus for RPGs with too-many rules
(e.g. the first edition of rolemaster was published that year).

~~~
mcguire
Fundamentally, it is (was?) very simple---the basic rules were in two pocket
games. Characters had three basic characteristics, strength (also a proxy for
endurance and damage tolerance), dexterity, and intelligence, plus skills and
assorted other

The TFT wiki page says, " _A revival of TFT and associated MicroQuest
adventures is underway
at[http://www.darkcitygames.com.*"](http://www.darkcitygames.com.*") The
"Legends" rules
([http://www.darkcitygames.com/docs/Legends.pdf](http://www.darkcitygames.com/docs/Legends.pdf))
[PDF] there look a lot like the basic mechanics of TFT.

I managed to miss Rolemaster, although I liked the titles, particularly "Claw
Law." :-) But I know what you mean about complexity; too much "realism" leads
to things like Ben Sergeant's Car Wars cartoon (lower left, here
[https://i.ebayimg.com/images/g/HSYAAOSwTglYlP-b/s-l300.jpg](https://i.ebayimg.com/images/g/HSYAAOSwTglYlP-b/s-l300.jpg)):
"My goodness! 08:00:06, already?"

------
moultano
ELO is a stochastic gradient descent approximation of logistic regression.

You can do much better just by actually running the logistic regression over
the games. In this framework, incorporating any per-game bias such as the
characters chosen is a trivial variable to add to the model and fit jointly.

Our ranking systems are holdovers from a time when the calculations had to be
done by hand. If the whole set of games fits in ram, there's no need to use
ancient optimization methods.

~~~
gowld
[https://en.wikipedia.org/wiki/Glicko_rating_system](https://en.wikipedia.org/wiki/Glicko_rating_system)

~~~
moultano
Even that is still assuming you can only update parameters once per game, and
only for the players in the game. If I've played a large number of games
against someone, and the win-rate is 50/50, and then that player plays in a
tournament, my skill should move up or down in accordance with their
performance in that tournament.

~~~
dmoy
Not necessarily. At least I don't know how this works in smash, but in
competitive fencing I'd see people go 50-50 consistently locally, but one
would always do _drastically_ better at nationals, year after year after year.

Right like there are A rank fencers, and then there are A rank fencers who
actually have a shot at placing on the points table.

I'm not sure why.

~~~
YokoZar
If you told me these facts about a random video game I'd guess the following:

\- A high rank player can consistently execute a strategy that wins against
the majority of players most of the time ("beats the meta")

\- The above has a counter strategy, but this strategy often fails against the
majority of the players ("loses to the meta")

When these two players meet, they go 50-50, but have very different results in
tournaments. Alternatively, one player is generally bad but exploits a
particularly hard to observe weakness in the first.

I know nothing about fencing, but I suspect something similar is going on
here.

~~~
dmoy
Yea I suspect you may be right. The ones I saw who did better in tournaments
tended to have more controlled, standard style. Nothing too fancy.

------
aquova
A very interesting read. I only somewhat follow competitive Melee, but the
lack of a formalized "chess-like" ranking system has always been interesting
to me. I was surprised about the author's discussion about the double
elimination system. I don't know much about ranking systems, but I must
imagine by now someone has developed some sort of system that supports double
elimination. All-in-all a very interesting and well written piece.

------
swolchok
Title would make more sense if it was "Making Sense of Super Smash Bros.
Melee". Not everyone plays this game.

~~~
dang
OK, we've added that. Thanks!

------
Anderkent
> You can also try predicting it match by match and use percent chance to win
> (which is what online chess clubs like ICC use), but this leaves a lot to be
> desired in practice and also simply misses the point entirely: ELO is
> structured around players having a roughly equal number of games each
> tournament, and double elimination means that placements and number of
> matches played are always different. ELO, and it's commonly used variants
> like Glicko-2 or trueskill simply aren't well-suited for the format used in
> Melee tournaments.

I can't follow this argument; the point of doing this match-by-match and
percentage-to-win -wise is exactly so that the number of games and placement
do not matter. You won a round against someone with higher ELO? Your elo
increases, their decreases. Doesn't matter if this was one game out of 20, or
three.

~~~
joshuamorton
Essentially, it rewards players who lose early over those who lose late. In a
double elimination tournament, two people, one in losers and one in winners at
the same point, the loser will play 2x the games of the winner.

So if a player wants to optimize for ranking, its actually in their best
interest to throw round one of a tournament, play more games, and have their
skill update more times.

The number of games matter because with more games you have more chances to
win and update your score.

~~~
kendallpark
This exactly. I play an online game that uses ranking, and your best bet for
breaking a 1500 is actually playing the game at odd hours when there are only
a small amount of players online. Because of the distribution of the player
pool, you're more likely to match with lower-ranking players (as there are
limited number of similarly-ranked players). Then you slowly but surely creep
up your ranking with very little risk.

~~~
Anderkent
'Breaking a 1500' and maximising rating are way different goals though. If you
want as high a rating as possible, playing lower-rated players is probably not
going to get you there - you're only getting a small increase per game.

~~~
kendallpark
But you're taking on less risk. If you play other people around your ranking
it's easy to actually lose.

~~~
Anderkent
Sure; you're reducing variance at the cost of reducing expected gains in
ranking.

------
cthor
Has this been tried?

(1) Figure out a matchup discrepancy matrix

e.g. Peach vs Puff winrate is 0.43

(2) Use an Elo head-to-head variant where the Elo update function takes
matchup discrepency into account

e.g.

\- A vs B has an expected 0.9 winrate

\- A is Peach and B is Puff

\- Elo update is done expecting A to have a winrate of 1 - (1 - 0.9) * (0.5 /
0.43) = 0.884

~~~
joshuamorton
The problem is that the matchup disparity matrix is difficult to derive. For
example, Puff-Fox is widely considered to be fox favored, possibly as much as
60/40 (this is fairly big, peach-icies, a ridiculously bad matchup is
considered 70-30, and peach-puff, considered near-unwinnable, is 80-20, yes
these ratings are bad) in general. However, Hungrybox, the current rank-1
player, plays puff, and has a positive winrate over something like all of the
top 20 Fox players in the world.

The next best Puff player is #38, and doesn't have any wins against top 10
foxes. Is HBox just the best player ever, consistently winning a "bad"
matchup, or is Puff a better character than people commonly believe? Who's to
say?

~~~
cthor
> The problem is that the matchup disparity matrix is difficult to derive.

Well, TFA had no bones about calculating one.

> Is HBox just the best player ever

The current data says pretty definitively, yes.

If other players can learn how to get his winrates vs Fox, then the matchup
matrix would end up reflecting that. The matchup matrix doesn't need to
reflect the perfect ("objective") state of the matchup, just the current one.

(The system I'm talking about would look more suspicious if HBox _wasn 't_
considered the best, because it would probably put him at #1 anyway.)

~~~
joshuamorton
What is TFA?

I didn't do a good job of clarifying what I meant. Hbox is obviously the #1
player right now. The question is if he's just totally on another level of
every other player, or if we're underestimating puff as a character.

Note that this is a really deep question. There are strong arguments (parry)
that in the "20XX" _yoshi_ would be the most viable character right after fox.
Given that, is Amsa overrated because he's underperforming how his character
should, or underrated since he's overperforimg the "average" Yoshi player?

The system you describe basically just ends up rewarding above average players
who use unusual characters. Should Abate be ranked top 20? Probably not, but
considering how much he outperforms the "average" luigi (same thing for Amsa,
does he deserve to be, say, top 10), he probably would be.

~~~
cthor
TFA is slashdot slang: the f'ing article

It really depends on what you want the ranking to mean.

If you want it to mean: "If all the players in the world played in a
tournament, what would the expected result be", then a normal Elo-like rating
system (e.g. glicko-2) should be fine, because all the data available is from
real tournaments, and it's not really feasible for players to strategically
dodge bad matchups to pad their ratings.

But one criticism TFA has of this method is matchup discrepancy. I'm not sure
that's _actually_ important (players choose their mains freely), but if it is
can't you just correct for it?

I think you're right that this correction would create an undesirable result.
That just means that the matchup discrepancy criticism isn't good.

------
soyiuz
What about a ranking system similar to Tennis or Downhill skiing? It basically
awards points for tournament results (rewarding active, top-placing
participants), unlike chess where all ranked games count.

~~~
gilcardenas
I personally was thinking this too. I think the main obstacle to this is that
there is no main organizing body for melee.

Because anyone can host a tournament, that makes it very tricky. You can
assign a points breakdown for points for the top 64/128 based on number of
entrants, prize money but that could inflate people's rankings for doing well
in an easy region.

For example, there are very few top 100 ranked players in Europe. Under this
system, the 4th-8th best players in Europe could get a huge rankings boost
over American counterparts that perform worse in American tournaments where
there are many more skilled players. Tennis benefits from that fact that top
50-100 players are usually required to play in most major tournaments. There's
not enough money in melee for that to even be a possible requirement for
players. (Another example would be small strong regions like Florida or SoCal
would be treated equally to weaker regions like Texas/Arizona for local
events)

Invitationals would also throw things off, as they often have a large prize
pool, but only 16 players invited. With melee, these would need to be treated
as an exhibition (worth no points) which would probably lower the stakes for
players, lower seriousness, etc. or only sanction certain well known
invitationals which might reduce outside investment in Melee.

Another common complaint to this is how it favors seeded players. Although
this would have some impact initially, I think this would level off over time
once an official ranking was adopted by all tournaments and individual
tournament organizers lose seeding powers. In fact, I would expect this to be
even less of a factor than in tennis, since in tennis being a top 100 player
gets you auto invited to most major tournaments. In smash, anyone can compete
at any major tournament, regardless of rank.

------
lakechfoma
A little OT but I'd like to know what part about getting map info is too
difficult to automate. Are they lacking the recordings or what? I'd love to
see the maps included in the dataset.

~~~
joshuamorton
Yes, most matches aren't recorded (at Genesis 5, a recent tournament, there
were ~1400 Melee singles entries, for ~2800 matches. Of those, maybe 10% were
recorded, most of those among the top 128 players attending.)

------
ReverseCold
I actually implemented glicko-2 as an 'elo' system for my school's competitive
melee group.

This is making me reconsider, although one thing of note is that you choose to
play who you want in our setup.

Overall I think this leads to fair rankings, since 'worse' players lose to
'better' players most of the time. As such, the people we think should be in
the top and bottom spots have them at the end of the season.

~~~
broodbucket
One thing I noticed when I did the same thing for my region is that players
wouldn't enter tournaments if they were just going to sandbag, because they
didn't want to hurt their ranking.

