Hacker News new | comments | show | ask | jobs | submit login
Prediction of the FIFA World Cup 2018 – A random forest approach (arxiv.org)
48 points by ajonnav 5 months ago | hide | past | web | favorite | 57 comments

Projection: An octopus, or possibly another cephalopod, would outperform this RF predictor.

I reckon a cephalopod would outperform it, even if it was unwell. I bet you sick squid.

I, for one, welcome our new cephalopod overlords.

In fact, if you can get me (6^8) * (2^15) cephalopods, I am certain of it.

If you get 2^15 squids, just play it safe and sell it at the fish market. They run ~10$/kg.

I think one out of 32 squids will do nicely.

I always wonder if those people have tried to fit their approach to past results. Could they simulate past World Cup results to be sure their method is sane by using player/team's archived statistics?

  3.4 Combining methods
  [...] are now compared with regard to their predictive performance.
  For this purpose, we apply the following general procedure:

  1. Form a training data set containing three out of four World Cups.

  2. Fit each of the methods to the training data.

  3. Predict the left-out World Cup using each of the prediction methods.

  4. Iterate steps 1-3 such that each World Cup is once the left-out one.

  5. Compare predicted and real outcomes for all prediction methods.

Do they say they only did this once? Or do they imply they did hyperparam tuning, pipeline optimization, etc using one or more (or all) of the world cups included for the cv?

Like the advise on investment products - past performance is not an indicator for future returns.

No, but the model should fit. Otherwise, why should it fit now?

Exactly. Past performance is not everything but it certainly is a prerequisite for prediction the future.

Unless they take into account modern things like travel time (was probably a bigger factor decades ago with fewer connecting and long-distance flights), ref nationality (used to be a huge point of contention for accusations of favoritism, now probably less so because of video), VAR (maybe some teams are better at getting away with dirty tricks than others, but now can't with video review), etc.

I'm sure an accurate model has to be updated as culture and technology changes and impacts the game itself.

CTRL Fed the PDF and searched for Keyword "MESSI" and didn't find any. So obviously "Random" forests.

I once had run my own prediction of the FIFA World Cup. I used elo ratings from http://eloratings.net/Europe which i converted into a model giving probabilities for any outcome between two teams. For each group A-F and for each possible pair of teams (a,b) in that group, i estimated the probability for them to qualify in that order using monte carlo simulation. Doing this explicitly should reduce overall estimation variance. Using that data i calculated the probaility for each team to become world champion using exact formulas. The results looked reasonable to me. What i have learned was, that the chances of extreme outsiders typically are extremly over estimated by the book makers. I have checked this paper for comparison (Table 8) and think the results are in line with my observation. I remember vaguely having heard, that there is a similar observation for stock options, which are extremly out of the money, there are typically much too expensive to buy. The bottom line is, never to bet on Korea or Japan getting FIFA world champion :-)

How about betting on Greece winning the Euro?

I don't think any ML models can take into consideration the meltdown like Brazil vs. Germany in 2014 or player getting sent-off and changing the dynamics of the game. Football/soccer (along with field hockey) is much more unpredictable than any other team sport.

Uncertainty and randomness are absolutely things that models take into consideration. That's the statistical part of statistical modeling. If unpredictability were a deal breaker, we wouldn't be able to model coin flips.

Yes, even I use random forest for some of the predictions at my new project. Somehow I feel doing statistical modeling for the sake of modeling/predictions is not the right approach. I'm still new (year or two exp.) into ML/statistical modeling so I'm much more conservative in my approach.

Predicting things and forecasting is an enormous part of ML and stats. Especially in the more statistical side of the community, there are very principled effective ways to do this, even in the presence of weird outliers like the germany/brazil meltdown.

Take a look at robust regression and influence functions to see some of the interesting flavor of one way to look at outlier weirdness.

... or the infamous Zidane head-butt, Mardona's "hand of god", etc.

[Somewhat related] I remember that someone posted a worldcup simulator during the 2014 championship. https://news.ycombinator.com/item?id=7941898

After playing a little with it, my takeaway was that the FIFA index is worthless (too political?) so I'd not pay attention to it. Also, the ESPN index was one of the better predictors, so I'd look almost only at it.

FIFA has the reputation of favoring the big market countries, and scandals happen. Did the model here include the probability of poor and biased referees? ? Anyway I'm looking forward to the World Cup, but wish it wasn't so early on the West Coast. The first matches I'm especially looking forward to Spain vs Portugal & Mexico vs Germany Should be fun !

Compared to betting odds (https://www.betfair.com/exchange/football/competition/561474...) this approach thinks that Brazil is much less likely to win.

I do not think that the germans are the top favorite.

First of all, not a lot of teams won the cup twice in a row

Second it's way harder if everybody is focused on you (because you won the last time)

I'm pretty sure barley any data can reflect these two statements. (ok the first one can, but the second one is a more emotionell effect)

> First of all, not a lot of teams won the cup twice in a row

That is just because winning itself is low probability. But repetition is not a factor.

They were saying the same about Real Madrid and champions league, and we all know how it turned out.

There's a big difference between a 1 year cycle for the champions league and the 4 year cycle of a world cup in terms of player aging/form though.

They won it 3 times in 4 years which I would say is pretty comparable to a World Cup timing

No Team actually has done it before...

That is not correct.

* Italy won in 1934 and 1938

* Brazil won in 1958 and 1962

Looks like Brazil did in 1958 and 1962.

In contrast, a fully human approach using a prediction market: https://alphacast.cultivateforecasts.com/challenges/167-2018...

This is a perfect example of abuse of ML. Last world cup ML predictions were a joke. I guess we won't learn the lesson. And if by chance any of these predictions work, we'll take it as a valid thing.

In 2006 world cup I had predicted to colleagues that Italy will win just because I liked them. I stood by this throughout and viola! It happened.

Last WC my prediction was Argentina (reached the final).

I feel intuitions do matter. This time: Argentina again

TLDR; We don't know who's going to win either, maybe one of the really good teams?

Oh gawd. Please stop.

The Economist is doing it. 538 has done one. Everyone is doing it.

They all wind up with similar predictions.

Germany, Brazil, France and Spain are favourites.

This is what the betting odds, transfermarkt and the mean salaries of the teams also tell you.

There are limits to prediction. The WC is a good place for people to learn that.

My biggest problem with these predictions is retroactive analysis of the prediction itself. If I have Brazil at a 99% chance to win the WC and they win, is my model the best?

I guess the idea is that over time, you measure prediction performance, but models are constantly changing and there's enough time in between WC tournaments that it'll take at least a century to have a good sample size of predictions.

If you take something more immediate like predicting tomorrow's stock prices or even the market (up/down), I wonder how their prediction models would fare.

Yep, using many predictions you can calculate a 'brier score' which is a measure of your predictive performance.


See also: https://predictionbook.com

People are so impressed now with predictive models based on any kind of correlation. Like: "we predict what kind of wine you will like based on your favorite candy bar" or "how long you'll live based on what you eat for breakfast". Only thing is, it's hard to find any two variables that aren't correlated, to some degree. Actuarial mathematicians have been doing this long enough to know how limited the "predictive power" really is.

> the mean salaries of the teams

This is new to me. Is the mean salary of a national team predictive? And are you using the mean salary the nation pays or their club salaries?

It's an established metric that uses club salaries. Is it predictive? The top 4 on that metric are Brazil, Germany, France, Spain.

Did I miss this year’s World Cup? How can something be predictive for an event that hasn’t happened?

Weather forecasts predict weather that hasn't happened yet. What is the point in predicting something that has happened?

I can say it’s going to rain 10 days from now based on whether it’s cloudy today. I wouldn’t say it’s a good prediction until it rains 10 days from now.

In other words: how effective has this salary prediction worked out in the past?

There's a significant margin of error, but at a high level, the higher someone's salary, the better player you can expect them to be. There are some $10M players who aren't as good as $2M players, but a $30M player is likely considered far superior to a $2M player.

Yeah, I’m familiar with the concept of salaries.

My questions were: 1) is this an established metric that’s commonly used to predict football (soccer) results and 2) what salary are we talking about, the amount the nation plays to its team’s players (e.g. Brazil pays more than England) or the amount a nation’s players earn on their club teams (e.g. Neymar earns more at PSG than Rooney does for Everton)?

(1) At a high level, yes, they look at it. There is a very strong relation between winning and team pay in soccer. There is a reason why Real Madrid has won 4 out of the last 5 champions leagues. Similarly, PSG and Munich are by some distance the best paid teams in their leagues and win their respective leagues pretty much every year. This market today is quite efficient, sure there are always going to be some undervalued middling players here and there, but the very best players do not go unnoticed and so they are paid well and bought by the richest teams.

(2) the discussion is naturally about how much they earn for their clubs presently. National teams pay no salary and many top players do not, and in some cases, have never played in their birth country's own league.

Club salaries.

For all the hype with World Cup and I love the event, it’s pretty much these 4 (Italy if they qualified) that has won the event. Last time a country outside of the big four(five) is Argentina back in 86

Makes for a bit of a dull ending sometimes.

The winner for an individual WC is hard to pick and there are usually 4-5 big teams that could win but they have changed a little bit.

France hadn't won a WC and kept choking until 1998. Spain were the same until 2010.

Italy and Argentina have been in and out of the favourites as well.

England were #1 in Elo rankings in the late 1980s and got to the semis in 1990 and were only knocked out with the help of a handball in 1986.

With a bit of luck Holland might also have been able to win a WC. Finals in 74, 78 & 2010.

But outside a few big teams it is pretty unlikely that there will be a real surprise winner. There are too many big good teams so that one or two surprises isn't enough to win. For example South Korea shocked Italy in 2002 but then Germany beat them. Or Croatia beat Germany in 1998 but then got beaten.

It's pretty rare that the championship is the best game/series in a tournament/playoffs. Look at the history of Super Bowls, NCAA tournaments, NBA playoffs, etc.

More about the journey than the destination.

Argentina is also part of that top (6) - they’re the Liverpool of international football.

According to Betfair, the probability of one of those teams winning is around 70%. So there's still 30% chance of an upset (most probably from Belgium).

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact