It frustrates me to see people in the smash community treat measures like elo as "the truth" because they "don't have any human input". This simply factually incorrect - these so-called objective measures have as much human input as anything else, codified into the constants and design choices of their algorithms. Designing these things is as much an art as it is a science, and the choice on how to weigh placements, upsets, losses, consistency, peaks, and the like are all just that - choices, made by a human sitting in a chair with Sublime Text 3 open.
I feel like this is applicable nigh everywhere. From social media timeline sorting, to industrial processes, to Melee rankings. Using an algorithm doesn't eliminate the human element from a system, it only abstracts it away.
There is a large contingent of radical empiricists in machine learning who assume "big data + automation = truth", especially on HN, and this is a message they need to hear more of.
People have been advocating radical empiricism in some increasingly uncomfortable contexts recently, and I hope it's just that it's the only thing they were taught and the only way they know how to think about their craft. The alternative is that an increasing number of people really do want machines to triumph over human judgment and morality.
> with a = 77617 and b = 33096. On an IBM S/370 main frame he computed ƒ in (1.1) using single, double, and extended-precision arithmetic, to produce the results:
> Single precision: ƒ = 1.172603...
> Double precision: ƒ = 1.1726039400531...
> Extended precision: ƒ = 1.172603940053178...
> This suggests a reliable result of approximately 1.172603 or even
1.1726039400532. In fact, however, the correct result (within one unit of the last digit) is
It's worth remembering that computers are very fast, but also very, very stupid. They do exactly what you tell them (or teach them, if we're talking about MCL), which isn't always what you want. Just because you've got a very scientific looking algorithm doesn't mean it's correct. Just because you're aggregating all your data doesn't mean the results are truly meaningful.
This might be a case where classical econ ("social choice") can give some helpful perspective. Arrow's impossibility theorem is the most famous impossibility result in this area; it says that no algorithm can take in a set of rankings (e.g. match outcomes) and produce an aggregate ranking in a way that satisfies a small set of fairness criteria. This is classically interpreted as saying that any aggregation method must be "unfair" in one way or another.
> Arrow's impossibility theorem is the most famous impossibility result in this area; it says that no algorithm can take in a set of rankings (e.g. match outcomes) and produce an aggregate ranking in a way that satisfies a small set of fairness criteria.
Ignoring the parenthetical "e.g. match outcomes", this is a correct description of Arrow's impossibility theorem. I don't see how match outcomes could possibly be an example of a set of rankings in the sense of the theorem, though.
Yeah, that was unclear. I'm thinking of a match outcome between teams A and B as a partial ordering on all the teams. Classically Arrow's deals with only total orderings as input -- I'd been thinking that it extends to partial orderings, but hmm, I'm not sure what research says about if we restrict the inputs to be just pairwise orderings/outcomes.
Arrow's theorem is mathematics, but even citing it is greatly misleading. All Arrow proved was that, sometimes, voting becomes rock-paper-scissors between the three most popular candidates. That isn't "unfair".
It especially doesn't mean that all voting systems are equally bad, and yet the most common reason for people to cite Arrow's theorem is to try and dispute the idea that we can do better.
Unlike your parent comment, your comment is total nonsense. Arrow proved that, given a set of ordinal preference rankings held by several individuals, the concept of an aggregate preference ranking describing the overall "will of society" is not well defined; subject to four or five unobjectionable constraints, no function determining such an overall ranking exists.
Voting systems which obey the assumptions of the theorem frequently break down in ways that cannot be described as "rock-paper-scissors between the three most popular candidates", as when strategic voting caused the papal conclave of 1334 to unanimously elect the least popular candidate to the papacy.
Were you perhaps thinking of Condorcet's work when you referred to rock-paper-scissors? He, not Arrow, famously wrote about how voter preferences were not necessarily transitive.
That "rock-paper-scissors" cycle among the top preferences is the Smith set, and every sensible voting "violates" Arrows theorem by saying that sometimes it's ok to acknowledge that cycle.
One of Arrows assumptions, "Irrelevance of Independent Alternatives", is far too strong. If you relax it to "Local Independence of Irrelevant Alternatives", then Arrows theorem no longer applies. It's just not a relevant theorem.
Claiming that allowing a voting system to recognize a rock-paper-scissors cycle among the top candidates is the same as allowing dictators is obvious rubbish. But if you interpret Arrow's theorem naively, you might be tempted to make such a claim.
I'm in the middle of reading a pretty interesting book, where that is one of the core arguments - Weapons of Math Destruction, by Cathy O'Neil - I recommend checking it out if you are curious to learn more about the distinction that you made, and more of how we tend to abuse math through algorithms that are in some way designed by humans.
In modeling memory discrimination in psychology experiments, people often recommend d-prime from Signal Detection Theory (https://en.wikipedia.org/wiki/Sensitivity_index). Other recommend simple "theory-free" discrimination scores (Hits-False Alarms). This is frustrating because all measurement designs carry theoretical distributional and metric assumptions, just because one is explicit (as in signal detection theory: d' = Z(hit rate) − Z(false alarm rate), where function Z(p), p ∈ [0,1], is the inverse of the cumulative distribution function of the Gaussian distribution.) does not mean the simple (hit rate minus false alarm rate) is theory free an unadulterated. The theoretical assumptions are just different.
I'd take it even one step further and apply this to literally anything on the internet! Whether it's news (doesn't even have to be true these days), data (omitted/modified or not), opinions (externally-motivated), research (funded by god-knows-who), literally ANYTHING you view is in some way processed, designed, delivered, or created by a human being.
And even when AI becomes practical, no more than 1000 people can realistically be involved in its design and implementation. Our supposed 'objective machines' will in fact be designed to the ideals of those designing them: a generally non-diverse group of people. Food for thought!
It's "a truth", which is probably as good as you can get, given that "the truth" is always a subjective thing.
The benefit to algorithm isn't that it's infallible (it may well be), but rather than it's consistent. It's accurate, even if not correct. Considering how much of human judgement is inconsistent, there's value in quantifying it in a standard way.
Reasonable minds can argue about whether than quantification is correct or fair.
I think my point is less concerned with a value judgement about algorithms, and more concerned with prompting critical thinking about how and who develops the algorithms, to what ends, the vectors by which subjectivity seeps into them, and how that affects their behaviors.
Primarily because "correctness" is likely to be subjective or difficult to define, while "accuracy" is not.
That said, the terms more often used here are "precise, but not accurate" (http://blog.minitab.com/blog/real-world-quality-improvement/...) but note that this uses a different meaning for accuracy than parent used. For example, say you use 22/7 to derive the value of pi. You can easily calculate it to 60 or more decimal places. You'll be very precise. However, you won't be accurate because your methodology of using 22/7 is flawed. It's also very easy to create more precision: just keep calculating. On the other hand, it can be very difficult to create more accuracy: How do we even measure pi to be able to confirm the ratio is correct?
It's not. But many "legacy" systems—using the term broadly to mean people and process—are inconsistent and not any more correct as a whole. A standardized system is easier to monitor and easier to correct, so I see this as an improvement overall for that reason.
This is true for many statistical analyses as well. The worst is when people smuggle in their non-quantifiable base assumptions and pretend like the resulting inference is some objectively true reality.
> For our purposes, Bloodgood serves as a great example of "closed pool" rating abuse. You get inflated ratings by being the best player in your playerpool, even if your playerpool is a relatively weak one.
In Melee there people that end up as local kings that don't do well in nationals. There are also people that are exceptionally good on a national level but simply don't travel (aka "Hidden Bosses").
Nintendo is very hands-off with Melee so tournament organization remains in the hands of the community. There is no single major overseer of Melee tournaments. Anyone can hold a tournament and throw the bracket onto Challonge or Smash.gg. I imagine if ELO was implemented as part of seeding, people would start gaming the system.
> The way seeding gets done is that players get placed into broad tiers, and then those tiers are then fed into pools, attempting to avoid region conflicts or repeat matches from recent tournaments.
This is where the human-in-the-loop part of seeding shines. Mid-tier players are entering national tournaments for the experience. They will not win, and their reg fee is essentially donating money to the winner's pot. But what they gain from the experience is tournament matches with players that they are not familiar with. Many of them will only get two games in-bracket, so it's a huge waste for them if they end up playing against buddies from their own region.
The community actively polices good seeding. There is often an outcry if say too many Nor Cal players get shoved on the same side of a bracket.
Hidden Bosses always get exposed at nationals because no matter how good or talented they are, they will get destroyed by players who are used to competing against other national threats. We've encouraged our local hidden boss (#1 in TN) to attend more nationals, but work schedules get in the way. Just like in chess, Melee is only profitable if you are one of the best in the world, and life gets in the way.
I also play chess at a competitive level (>2000, Expert in the US) and play Melee at a low competitive level (playing in local meetups, winning a few matches). I've had many arguments about ranking systems, ELO, etc with my fellow Smashers, and I reached similar conclusions. This is a great writeup.
There are huge differences between the Swiss system used in chess (which works great for ELO, since seeding is done by rating and players are not eliminated) and the double elimination system used in Melee tournaments. I don't think it's possible to have an objective ranking system in Melee because of the intricacies of this issue (seeding influences final placement, low-seeded players will hit a wall where they lose to high-seeded players earlier, etc).
Am I old? For a second there, I thought they were talking about Melee (https://en.wikipedia.org/wiki/Melee_(game)), "... a simple man-to-man combat boardgame designed by Steve Jackson, and released in 1977 by Metagaming Concepts."
Fundamentally, it is (was?) very simple---the basic rules were in two pocket games. Characters had three basic characteristics, strength (also a proxy for endurance and damage tolerance), dexterity, and intelligence, plus skills and assorted other
I managed to miss Rolemaster, although I liked the titles, particularly "Claw Law." :-) But I know what you mean about complexity; too much "realism" leads to things like Ben Sergeant's Car Wars cartoon (lower left, here https://i.ebayimg.com/images/g/HSYAAOSwTglYlP-b/s-l300.jpg): "My goodness! 08:00:06, already?"
ELO is a stochastic gradient descent approximation of logistic regression.
You can do much better just by actually running the logistic regression over the games. In this framework, incorporating any per-game bias such as the characters chosen is a trivial variable to add to the model and fit jointly.
Our ranking systems are holdovers from a time when the calculations had to be done by hand. If the whole set of games fits in ram, there's no need to use ancient optimization methods.
Even that is still assuming you can only update parameters once per game, and only for the players in the game. If I've played a large number of games against someone, and the win-rate is 50/50, and then that player plays in a tournament, my skill should move up or down in accordance with their performance in that tournament.
Not necessarily. At least I don't know how this works in smash, but in competitive fencing I'd see people go 50-50 consistently locally, but one would always do drastically better at nationals, year after year after year.
Right like there are A rank fencers, and then there are A rank fencers who actually have a shot at placing on the points table.
If you told me these facts about a random video game I'd guess the following:
- A high rank player can consistently execute a strategy that wins against the majority of players most of the time ("beats the meta")
- The above has a counter strategy, but this strategy often fails against the majority of the players ("loses to the meta")
When these two players meet, they go 50-50, but have very different results in tournaments. Alternatively, one player is generally bad but exploits a particularly hard to observe weakness in the first.
I know nothing about fencing, but I suspect something similar is going on here.
I agree in principle, and having new data affect the interpretation of old results was one of the goals for the rating system for a game I run [0]. But while I believe it's the right thing to do if the goal is to predict results more accurately, there are downsides.
Basically players want rating systems to be reward loops; they hate systems where their rating can change randomly, and they want the system to be very volatile in response to their own results. If they go on a statistically insignificant winning streak, they want their ratings to shoot up. Not a rating system to go "meh, it's probably just random chance".
I think if the system provides reliable results, people will come around. There are a lot of preferences that players have, but I think they ultimately come to respect systems that work.
A very interesting read. I only somewhat follow competitive Melee, but the lack of a formalized "chess-like" ranking system has always been interesting to me. I was surprised about the author's discussion about the double elimination system. I don't know much about ranking systems, but I must imagine by now someone has developed some sort of system that supports double elimination. All-in-all a very interesting and well written piece.
> You can also try predicting it match by match and use percent chance to win (which is what online chess clubs like ICC use), but this leaves a lot to be desired in practice and also simply misses the point entirely: ELO is structured around players having a roughly equal number of games each tournament, and double elimination means that placements and number of matches played are always different. ELO, and it's commonly used variants like Glicko-2 or trueskill simply aren't well-suited for the format used in Melee tournaments.
I can't follow this argument; the point of doing this match-by-match and percentage-to-win -wise is exactly so that the number of games and placement do not matter. You won a round against someone with higher ELO? Your elo increases, their decreases. Doesn't matter if this was one game out of 20, or three.
Essentially, it rewards players who lose early over those who lose late. In a double elimination tournament, two people, one in losers and one in winners at the same point, the loser will play 2x the games of the winner.
So if a player wants to optimize for ranking, its actually in their best interest to throw round one of a tournament, play more games, and have their skill update more times.
The number of games matter because with more games you have more chances to win and update your score.
This exactly. I play an online game that uses ranking, and your best bet for breaking a 1500 is actually playing the game at odd hours when there are only a small amount of players online. Because of the distribution of the player pool, you're more likely to match with lower-ranking players (as there are limited number of similarly-ranked players). Then you slowly but surely creep up your ranking with very little risk.
'Breaking a 1500' and maximising rating are way different goals though. If you want as high a rating as possible, playing lower-rated players is probably not going to get you there - you're only getting a small increase per game.
Well you don't have the option for off peak hours in a tournament setting. There's also obviously a high increase in risk as you progress towards the finals of a tournament.
Then shouldn't your ranking fall again when you play in the full pool? If you're a strong enough player to maintain the higher ranking, you should reach it regardless.
>So if a player wants to optimize for ranking, its actually in their best interest to throw round one of a tournament, play more games, and have their skill update more times.
That's only the case if you believe your current ELO underestimates your real ability respective to the opponents you'll meet in the lower bracket.
Also if you lose your first game against a low-ranked player, you'll immediately lose a lot of points; then the wins against other low-rank players will not give you many points back.
If you're within a calibrated ELO system, your expected change in rating should be 0 for a match, and then having more matches doesn't actually help you.
The problem is that the matchup disparity matrix is difficult to derive. For example, Puff-Fox is widely considered to be fox favored, possibly as much as 60/40 (this is fairly big, peach-icies, a ridiculously bad matchup is considered 70-30, and peach-puff, considered near-unwinnable, is 80-20, yes these ratings are bad) in general. However, Hungrybox, the current rank-1 player, plays puff, and has a positive winrate over something like all of the top 20 Fox players in the world.
The next best Puff player is #38, and doesn't have any wins against top 10 foxes. Is HBox just the best player ever, consistently winning a "bad" matchup, or is Puff a better character than people commonly believe? Who's to say?
> The problem is that the matchup disparity matrix is difficult to derive.
Well, TFA had no bones about calculating one.
> Is HBox just the best player ever
The current data says pretty definitively, yes.
If other players can learn how to get his winrates vs Fox, then the matchup matrix would end up reflecting that. The matchup matrix doesn't need to reflect the perfect ("objective") state of the matchup, just the current one.
(The system I'm talking about would look more suspicious if HBox wasn't considered the best, because it would probably put him at #1 anyway.)
I didn't do a good job of clarifying what I meant. Hbox is obviously the #1 player right now. The question is if he's just totally on another level of every other player, or if we're underestimating puff as a character.
Note that this is a really deep question. There are strong arguments (parry) that in the "20XX" yoshi would be the most viable character right after fox. Given that, is Amsa overrated because he's underperforming how his character should, or underrated since he's overperforimg the "average" Yoshi player?
The system you describe basically just ends up rewarding above average players who use unusual characters. Should Abate be ranked top 20? Probably not, but considering how much he outperforms the "average" luigi (same thing for Amsa, does he deserve to be, say, top 10), he probably would be.
It really depends on what you want the ranking to mean.
If you want it to mean: "If all the players in the world played in a tournament, what would the expected result be", then a normal Elo-like rating system (e.g. glicko-2) should be fine, because all the data available is from real tournaments, and it's not really feasible for players to strategically dodge bad matchups to pad their ratings.
But one criticism TFA has of this method is matchup discrepancy. I'm not sure that's actually important (players choose their mains freely), but if it is can't you just correct for it?
I think you're right that this correction would create an undesirable result. That just means that the matchup discrepancy criticism isn't good.
What about a ranking system similar to Tennis or Downhill skiing? It basically awards points for tournament results (rewarding active, top-placing participants), unlike chess where all ranked games count.
I personally was thinking this too. I think the main obstacle to this is that there is no main organizing body for melee.
Because anyone can host a tournament, that makes it very tricky. You can assign a points breakdown for points for the top 64/128 based on number of entrants, prize money but that could inflate people's rankings for doing well in an easy region.
For example, there are very few top 100 ranked players in Europe. Under this system, the 4th-8th best players in Europe could get a huge rankings boost over American counterparts that perform worse in American tournaments where there are many more skilled players. Tennis benefits from that fact that top 50-100 players are usually required to play in most major tournaments. There's not enough money in melee for that to even be a possible requirement for players. (Another example would be small strong regions like Florida or SoCal would be treated equally to weaker regions like Texas/Arizona for local events)
Invitationals would also throw things off, as they often have a large prize pool, but only 16 players invited. With melee, these would need to be treated as an exhibition (worth no points) which would probably lower the stakes for players, lower seriousness, etc. or only sanction certain well known invitationals which might reduce outside investment in Melee.
Another common complaint to this is how it favors seeded players. Although this would have some impact initially, I think this would level off over time once an official ranking was adopted by all tournaments and individual tournament organizers lose seeding powers. In fact, I would expect this to be even less of a factor than in tennis, since in tennis being a top 100 player gets you auto invited to most major tournaments. In smash, anyone can compete at any major tournament, regardless of rank.
One player does actually use a tennis ranking system. This works pretty well, but it still has a few issues, mainly because the top players are so consistent that you get two tiers and it's difficult to differentiate among that second tier)
A little OT but I'd like to know what part about getting map info is too difficult to automate. Are they lacking the recordings or what? I'd love to see the maps included in the dataset.
Yes, most matches aren't recorded (at Genesis 5, a recent tournament, there were ~1400 Melee singles entries, for ~2800 matches. Of those, maybe 10% were recorded, most of those among the top 128 players attending.)
I actually implemented glicko-2 as an 'elo' system for my school's competitive melee group.
This is making me reconsider, although one thing of note is that you choose to play who you want in our setup.
Overall I think this leads to fair rankings, since 'worse' players lose to 'better' players most of the time. As such, the people we think should be in the top and bottom spots have them at the end of the season.
One thing I noticed when I did the same thing for my region is that players wouldn't enter tournaments if they were just going to sandbag, because they didn't want to hurt their ranking.
It frustrates me to see people in the smash community treat measures like elo as "the truth" because they "don't have any human input". This simply factually incorrect - these so-called objective measures have as much human input as anything else, codified into the constants and design choices of their algorithms. Designing these things is as much an art as it is a science, and the choice on how to weigh placements, upsets, losses, consistency, peaks, and the like are all just that - choices, made by a human sitting in a chair with Sublime Text 3 open.
I feel like this is applicable nigh everywhere. From social media timeline sorting, to industrial processes, to Melee rankings. Using an algorithm doesn't eliminate the human element from a system, it only abstracts it away.