Hacker News new | past | comments | ask | show | jobs | submit login
Movie Review Aggregator Ratings Have No Relationship with Box Office Success (minimaxir.com)
49 points by minimaxir on Jan 10, 2016 | hide | past | favorite | 47 comments

This is a classic case of learning nothing by averaging together two different populations (as acknowledged in the article, critically-acclaimed indies and mainstream blockbusters). Its like averaging the height of an NBA team with a kindergarten class. It would probably be more informative to group movies by size of release (number of theaters they open in), as well as advertising budget. Just by eye there seems to be a correlation for well reviewed blockbusters to do better than poorly reviewed ones.

Right. For movies over $10m budget, there is a positive correlation, and because the OPs graphs are log, it isn't a minor one.

Indies form the second clump high in the rankings.

The negative correlation is caused, as the density contours show, by a knot of highly reviewed indies showing up in the data, confusing the positive correlation of the studio movies.

Alright, I double-checked, with appropriate filters on data, and RT scores:

Indie cluster alone:

log-corr: -0.1243865

corr: -0.1673577

Blockbuster cluster alone:

log-corr: 0.230726

corr: 0.2570628

Maybe I should have separated the clusters, and I'll definitely do that for further analysis (even though at best, the correlation is weak in either direction, and the log doesn't change much). I did, however, want to address the general reliance on RT score for any movie, so I stand by the post.

> the general reliance on RT score for any movie

The reliance of RT score for what? What are people routinely extrapolating from the RT score that you think is not valid?

Extrapolating budget from the RT score is definitely invalid: you've shown that. But who does that?

People might go the other way: extrapolate quality from a big budget (it must be good, it's a big blockbuster), but as you say, that correlation is valid (though noisy).

So not quite sure what you're standing by, sorry. I don't want to sound rude and argumentative. I'm not trying to be a jerk, I'm very happy to concede your analysis is applicable where it is, I just genuinely don't understand what you're referring to here.

Very frequently I've heard the question "is this a good movie?" and the response is "what does the RT score say?" Many online stores include the RT score inline when you buy a movie too.

This avenue doesn't immediately distinguish whether a movie is a blockbuster or indie or how much money it made. But then again, some people do read movie descriptions thoroughly...

Commercial success and quality ("good movie") are obviously very different things. Critics care about the latter. Quality is hard to quantify, but that's no excuse to take such a poor proxy as sales. If you want to test whether Rotten Tomato scores are meaningful, I say you'd have to test whether they are meaningful to a particular individual with particular tastes; i.e., test the correlation of someone's judgments against a critic's.

[edit: Read the article again, deleted previous response, sorry]

So you think box office is a better correlate of 'good movie' than RT score? I didn't get that from the post, but if that's what you're saying, then yes, I concede you have shown that correlation is not valid.

Maybe not better, but more important.

I'll update the post with a footnote tomorrow clarifying things.

Having worked for a short time in the movie industry (at DreamWorks Animation), I thought this was obvious. And it was certainly true for DreamWorks' movies. Even people working directly on the movies were terrible at predicting how they would perform at the box office.

The reason for this is simple: movie critics are self-selecting movie aficionados. Their tastes just don't reflect the tastes of the public at large. When you aggregate critical reviews, you're only aggregating the opinions of a tiny sample of the movie-going public, and that sample largely shares the same tastes.

I also collect data, from the very high-brow critics in Cannes, and recently also Sundance for comparison.


What is evident is that very often critics gather to champion objectively awful movies. Even for professional movie critics standards awful movies. Looks for example at the surprising success of this years "Carol" and "The Assassin" in Cannes and their disappointing acceptance in the real world. Well, "Carol" is one of those special needs cases, which still has some champions left over. And compare it to the exceptional Cannes movies this year: Dheepan, Umimachi Diary and Embrace of the Serpent. The critics didn't appreciate them and didn't see it coming. However the professional festival programmers and jury in the followup festivals saw it.

But the same happens almost every year. "ADIEU AU LANGAGE" (Godard) highest rated movie 2014, "HOLY MOTORS" (Leos Carax) 2012, "Le Havre" (Aki Kaurismaki) 2011, "FILM SOCIALISME" (Jean-Luc Godard) 2010. The jury votes are usually much better than the critics votes.

Comparing quality to quantity (advertising budget, imdb ratings) makes no sense at all.

And specifically, I suspect that a lot of the high box office films are big, loud action films which appeal to the theater-going demographic far more than they do to movie critics.

The British film critic Mark Kermode wrote about this at length in his book The Good, The Bad and The Multiplex. Spend enough money on a movie and it is almost guaranteed to be profitable, because sheer extravagance turns it into a "motion picture event".

https://www.youtube.com/watch?v=cDmluUKLbYc http://www.theguardian.com/books/2011/sep/08/good-bad-mulitp...

I don't find this convincing because it feels like there must be confounding variables here. For example, high budget movies may be judged more harshly overall than indie movies, which are clearly not going to make as much money.

It would be interested to see how the picture looks when controlled for movie budget, time of year released etc.

As a dedicated reader of movie critiques I can say with absolute certainty that well known actors and directors are in general held to a higher standard.

This analysis is just a first step. I plan to look at the impact of specific actors/directors/publishers on Box Office Gross as well.

I actually looked at this briefly in my Master's thesis. https://lup.lub.lu.se/student-papers/search/publication/1626...

Rotten Tomatoes was the only one that had any hint of meaningfulness. The others (IMDB, MetaCritic had none). Twitter volume, sentiment and number of theaters it opened in were pretty good indicators though.

And ad or promotional spend? This is a missing factor, surely?

A movie will have a reputation: good, bad, or indifferent. But a movie also has to be known. Ad spend gets the movie in front of eye-balls at least insofar as it gets to be known to exist. But also word-of-mouth, which is like free ad spend. This is why there are slow burners and mega-flops.

Incidentally this is why I have refused to see Titanic, Avatar, and the latest Star Wars in the theatre. If you scream in my face I'm not going to watch your movie in the theatre. Hype turns me off seeing something, but that's just me not wanting to roll with the crowd perhaps, and not a function of hype itself. Seems like most people don't react this way though.

Didn't investigate it at all. Not really sure if it would or wouldn't be meaningful. Just thinking off the top of my head, Word of Mouth should be generated by ad spend. It's measuring ad efficacy in a way. Which might be better than ad spend (total dollars) for predicting success.

> If you scream in my face I'm not going to watch your movie in the theatre

You've caught my interest here - Is this out of dislike for advertising as a whole? Do you apply this principle to non-movie situations? (TV, food products etc)

Movie reviews are evaluations of the quality of a movie, not the marketability of a movie. I'd be surprised if there were a direct relationship.

Next article: Evaluation of Paintings by Well-Regarded Contemporary Critics Has No Relationship to Future Dorm Room Poster Sales Figures.

Edit: Also, a measure of success that doesn't find a way to divide by the budget somewhere is not a good measure. A lot of flops have made 100 million dollars, they just cost 120 million to make.

Gross vs budget is huge, just look at any Blum House film

Not finding a significant effect when you fail to control for confounding variables is not evidence of no effect. This is ridiculously bad analysis. His contour maps basically disprove the headline. How is it on the front page?

Just real quick here. Box Office Mojo has data on 702 movies released in 2014 (I only wanted to do one year because there's no API I know of for this and I was doing it quickly, I chose 2014 instead of 2015 because there's still many 2015 releases that are still earning money and so I wanted the most recent "complete" year). The correlation between total box office gross and theaters shown in is very robust, at .79. So if you look at the top Metacritic movie from 2014, Boyhood, you can see it has a low gross, ranked 100th on Box Office Mojo's data. But if you look at the theaters shown in, it was only in 775 theaters. Mr. Turner, the second movie on Metacritic's rankings for 2014, was only shown in 120 theaters. Birdman, the Best Picture winner and the 16th ranked movie on Metacritic's list, was in 1213 theaters. Meanwhile, you have to go to the 21st movie on the top grossing list to find a film that was in fewer than 3,000 theaters. It's a classic example of Simpson's paradox[1]. You have two populations, wide-release movies and small-release pictures, and within each population, there's a positive correlation between critical approval (as measured by Metacritic) and box office gross, which is picked up by the contour maps. But because the highest Metacritic scores go to the movies released in 2000 theaters or less, without controlling for the number of theaters a movie shows in, it looks like there's a negative relationship between critical acclaim and box office totals.

1) https://en.wikipedia.org/wiki/Simpson%27s_paradox

That's a fair counterpoint. I'll see if I can find theater showing data (that was not in my dataset, and as you note, it's hard to get), and see if I can normalize.

ticket sales count how many people thought the movie was good enough to watch (amd that depends highly on marketing spend) not how good they thought it was.

I would like to see this analyses run on box office as a percentage of budget, not just top line revenue.

A few years ago I was listening to "How to listen to and Understand Great Music"[1] and the speaker made an interesting point about art.

Let's say you take a friend from another country to a baseball game. He's never seen baseball and knows nothing of the game. He can certainly enjoy himself -- there will be lots of colors, activity, and it's quite a spectacle. He may even want to come back.

As he learns more and more about baseball, however, he will have a completely different experience. It's the same game, but now you see a lot more of the complexity and nuance.

As a film fan, I use the meta-ratings almost exclusively. I'll go and watch popular films. Sometimes I'll even enjoy them. But there's a lot going on in cinema. I like understanding the nuance and detail, and I think well-mace movies make for better experiences.

1. http://www.thegreatcourses.com/courses/how-to-listen-to-and-...

As someone who's not from a baseball country: I strongly doubt that friend would enjoy himself. It's not exactly action packed. I think the decisive factor isn't knowing more about the game but knowing more about the meta-game.

Sure, you will experience it differently if you know what good tactics look like but most people enjoy it because they are emotionally involved with the teams and players. This seems to be true with most sports, even for the more "knowledgeable" fans.

Aside from carefully constructed surveys, the best indicator of US box office success that I have found is Wikipedia page views http://journals.plos.org/plosone/article?id=10.1371/journal....

Outside the US it's a harder problem http://arxiv.org/abs/1405.5924

There is no effing way this is true. Everyone I know under 40 checks the RT score of a movie and factors that into their decision to see it.

I think it is safe to say that fully 1/2 of Americans would seem almost alien to you if you were to interact with them in their natural habitat(s). It's true for me, and I was raised among them, before fleeing as soon as I was old enough to drive away.

I mean no offense to anyone. But there's a lot of folks out there living a very different lifestyle than the average HN'er.

Branding is important.

Although I said I wouldn't disclose the data, here is a spreadsheet of the Top Movies by Box Office Revenue with a RT score <= 20%: https://www.icloud.com/numbers/0005voEEmzex1t4xnMHD3u3HQ#box...

Films like Alvin and the Chipmunks and Grown Ups are objectively bad, but they still make money because people don't care. Yes, these movies likely had a lower budget, and that can be weighed against box office revenue, but I explained my concerns with that in the article.

Selection bias, much?

I mean, something has got to account for the fact that Adam Sandler still has an audience.

Marketing budget (or maybe degree of awareness of the movie's existence), theatre/screen count, and audience score predict how well a movie will do. Appeal and marketing, rather than movie quality, has always been what draws people to the movie theatre.

Normally, I'm all about research, experiments and data crunching that confirm or contend ideas we pretty much all take for granted but this makes me wish there was some very polite way to say, "well duh!"

Given the "There is no effing way this is true" comment in this very thread, this may not be as "well duh" as you think.

That's why data analysis is always important, even if it's dumb.

A lot people don't realize, but RT also shows a mean rating (out of 10) which is the same type of scoring as Metacritic. It might not be exposed in the api mentioned in the article though.

Interesting! My question is: How do the p-values of your Pearson correlations look like? Correlations alone are vague. Please insert :-)

I deliberately avoided using P-values since that metric is used to imply a causal relationship between review scores and box office gross, which is definitely not true.

In the evaluation of correlations, it can always be informative to know the confidence interval for r, with all caution towards p-value interpretation.

Surely, correlation provides information on association rather than cause and effect (causation should rather be modeled with Granger and other regression models). Sample sizes and variances will certainly contribute to different p-value outcomes. This is because p-values reward low variance more than the magnitude of impact (Type I/II error etc.). If you have p-values, better report them and add a footnote on how to interpret them.

> However, the significance or p-value reflects the probability that the correlation does not imply a causal relation.

Technically, in this case, a significance test would answer the question "is the Pearson correlation statistically significant from 0?" In this case, we would expect it to fail since it clearly isn't, and is therefore the test is less helpful/important. (even if it passed, the conclusion would be "correlations are low in magnitude and therefore do not matter" as noted in the post anyways)

Finding the exact P-value of a Pearson correlation requires setting up bootstrapping, which is not something I have handy at the time but will work on in future posts.

Again, I'm not looking at R^2 and the P-value of a linear regression, which is different.

It's just a recommendation to improve the reporting, no general defense of p-values. Pearson does not imply to analyze a causal relationship. I see the point it's not linear (then you would have had a fitted linear reg, I assume) but still can tell you that missing p-values may cause arching eyebrows :)

In small sample sizes, correlation can easily be significant, often at the cost of low confidence. To the opposite, in large sample sizes, the magnitude of the effect may be lower but at higher confidence. In both cases, results have to be interpreted with caution. The recent p-value debate points towards a lot of issues here. For instance, there have been medical studies overestimating correlations in small sample sizes while other authors seemed to underestimate their long-term large-sample results with correlations in the ballpark of 0.15 (p<0.05).

You need to stop saying stuff. You're incredibly wrong here.

Could you please state the problem instead of making ad hominems? I'm genuinely curious.

The short version is that p-values have nothing to do with establishing a causal mechanism. It's a test of statistical significance, it doesn't try to say if that significance is because x causes y or y causes x or some unknown variable z causes both y and x.

The long version: So, in this case, we have two variables, Metacritic score (or Rotten Tomatoes percentage) and box office gross. We have measured the correlation between them for some number N of movies. N is smaller than the total population of movies that could be evaluated; if nothing else, the analysis doesn't consider movies that haven't been released yet. So the movies evaluated are considered, for the purposes of a p-value test, to be a sample of size N from a hypothetical infinite population of movies with box office gross and Metacritic scores. The p-value test also assumes that there's something called a null hypothesis, which is that there is no relationship between Metacritic scores and box office gross. The null hypothesis is the hypothesis that all other hypothesises (is that the right plural? I don't know) are evaluated against.

What the p-value measures (in this case, it can be applied to other statistics as well) is the probability of seeing a correlation of that size or greater given a random sample of N observations if the null hypothesis is true. What's notable about this is that the p-value is not testing anything about any hypothesis other than the null hypothesis -- it can be used as evidence about the likelihood of the null hypothesis being true, but past that it has nothing to say about any OTHER hypothesis. Which is why it's wrong to say that a p-value test is evidence of a causal relationship -- it's not even trying to test that.

Yes, this is correct.

I just realized I misread the GP's comment: thought it said "P-value" in general instead of "P-value of Pearson correlation," as a result I thought he was referring to a regression P-value.

As far as I understand the p-value of a Pearson correlation is exactly the same as the p-value of a simple linear regression.

You understand correctly. I have no idea what he's even trying to say now.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact