
IMDB Top 100K Movies – Analysis in Depth, Part 1 - lauriswtf
http://bugra.github.io/work/notes/2014-02-15/imdb-top-100K-movies-analysis-in-depth-part-1/
======
sigil
You mention scraping, but did you actually scrape these? I just discovered
that IMDB publishes textual data dumps. They appear to be pretty complete.
Terms of use are non-commercial, but I'd love to see more analyses like this!

[http://www.imdb.com/interfaces](http://www.imdb.com/interfaces)

~~~
tmoertel
If you'd like to see more analyses, I gave a short talk about analyzing IMDB
data with R and Perl a few years ago. Here are the slides:

[http://community.moertel.com/~thor/talks/pgh-pm-perl-
and-r.p...](http://community.moertel.com/~thor/talks/pgh-pm-perl-and-r.pdf)

A blog post on the topic:

[http://blog.moertel.com/posts/2006-01-17-mining-gold-from-
th...](http://blog.moertel.com/posts/2006-01-17-mining-gold-from-the-internet-
movie-database-part-1.html)

------
3rd3
The conclusion that runtime correlates with rating is not quite clear to me.
It looks to me that the scatter plot gets just more sparse as runtime
increases and the rating distribution remains more or less the same.

Somehow the number of movies exploded last(?) year.

~~~
bugra
It is true that as runtime increases, the number of movies decreases. However,
if you look at the mainstream runtimes(>80 and <100), generally they get quite
a variety of ratings for a given runtime if not uniform. On the other hand,
the movies that have higher runtimes generally get higher number of votes.

You could also observe the same behavior from rating vs #votes graph; as #
votes increases, the number of movies decreases. However, rating and # votes
correlate quite strongly.

~~~
_mulder_
I wonder why long movies are consistently rated higher? My guess would be for
two reasons. The first being audience selection. Long films are (IMO) more
likely to attract, and retain for the entire length, an audience with a prior
interest in the film, and therefore more likely to be an audience who know
beforehand if the movie is generally good or not.

Secondly, psychology will come into play. The more time an audience invests in
a film, the more likely they are to seek a positive reward for their time so
they don't feel like they have got a bad deal[1]. Thus, they're more likely to
rate the film higher than it perhaps otherwise would be. I also believe this
holds true for 'art house' films that are difficult to follow and perhaps less
enjoyable than a more mainstream film. Audiences will rate them higher to
reassure themselves that they haven't just wasted 2 hours watching something
boring that they don't understand.

Some links for further reading:
[http://en.wikipedia.org/wiki/Irrational_escalation](http://en.wikipedia.org/wiki/Irrational_escalation)
[http://en.wikipedia.org/wiki/Post-
purchase_rationalization](http://en.wikipedia.org/wiki/Post-
purchase_rationalization)

------
mcphilip
Great work!

I did a quick and dirty project[1] involving IMDB and Neo4j when I had some
time off between jobs over the holidays. I used screen scraping to get the
list of IMDB ids for the AFI top 100 movies and then made calls to MyMovieAPI
to pull down IMDB data about each AFI film. I wasn't aware of the
imdb.com/interfaces at that point, but it wasn't really my goal to do the
"best" possible implementation since it was just a learning experience. For
those interested, there's a simple overview of the project[2] that shows what
(i thought) were interesting questions about the data: for instance, which
actors, if any have appeared in 2 or more of the top 25 AFI films?

After looking at imdb.com/interfaces, I'm not sure that it has what I'm
looking for. My plan on expanding this project at some point in the future is
to start with data from Freebase[3] since it's already presented in a
normalized format and then filling in missing details via IMDB as necessary.

My ultimate goal is to generalize the N-degrees-to-Bacon trivia question to
work with any two actors, but that requires getting a lot more data to work
with.

All in all, it's a fun dataset to play with.

[1][https://github.com/mcphilip/film-graph](https://github.com/mcphilip/film-
graph)

[2][http://htmlpreview.github.io/?https://github.com/mcphilip/fi...](http://htmlpreview.github.io/?https://github.com/mcphilip/film-
graph/blob/master/film-graph-overview.html)

[3][http://www.freebase.com/film](http://www.freebase.com/film)

------
facepalm
Funny that overrated movies is dominated by Twilight. I suspect boy-friends
who were forced to watch them together with their girl-friends are
responsible.

------
brownbat
Buğra talks about looking at directors and actors next.

I'd really like to see whether directors or writers have a bigger impact on
quality of films. Like a smallish number of critics, including Pauline Kael,
I'm deeply suspicious of the auteur theory that everyone kind of
unquestioningly accepts.

“A filmgoer seeking out pictures written by, say, Eric Roth or Charlie Kaufman
won’t always see a masterpiece, but he’ll see fewer clunkers than he would
following even a brilliant director like John Boorman, or an intelligent actor
like Jeff Goldblum. It’s all a matter of betting on the fastest horse, instead
of the most highly touted or the prettiest.” - David Kipen

[http://en.wikipedia.org/wiki/Schreiber_theory](http://en.wikipedia.org/wiki/Schreiber_theory)

------
Juha
> Therefore, it may be safe to assume this ranking more or less holds true for
> non-top 250 movies as well.

This may not hold true. A while ago I was looking into it and they seemed to
use more complex weighed average without exact details (possibly using
internal user scoring). This may affect the final rating in many ways. More
detailed analysis here: [http://www.quora.com/Movies/What-algorithm-does-IMDB-
use-for...](http://www.quora.com/Movies/What-algorithm-does-IMDB-use-for-
ranking-the-movies-on-its-site?show=1)

------
vikp
Interesting stuff, although it would be nice to see more analysis and less
tables/charts. Some regression lines would also be good, and help in
interpreting correlations.

I was wondering how your post got 3 million facebook shares, then I realized
that you left in the default data-href attribute for the facebook docs. You
might want to change that.

~~~
bugra
I will do the regression analysis in the second part of the post, thanks for
the suggestion!

Thanks for the bug feedback around facebook widget as well, I will fix it.

------
anjc
I'd love if someone (who isn't as lazy as me) could figure out a sophisticated
way to show the actual good movies from a year, rather than the popularly good
ones. Sentiment analysis? Trend recognition? I don't know, but, I feel like
Imdb and Rotten Tomatoes are now effectively useless for new movie reviews.

------
infinitybeyond
I am having two problems with your site. In FF the data isn't centered. In
chrome and FF I don't see anything in the preformatted code block. Newest
version of FF and Chrome on Win 8.1. FF is on the left, Chrome on the right.

[http://i.imgur.com/udHv4pH.png](http://i.imgur.com/udHv4pH.png)

~~~
smortaz
same here, macos, chrome and safari both show empty code blocks...

------
Implicated
Would be very interested to see the correlation of rating to
director/actor/actress/budget?

~~~
bugra
I am also very interested in the correlation of rating vs director. However, I
do not have the budget information for the movies. It would be great if I
could find budget information and combine them with the data I have. I have
not though that, really good suggestion. Thanks!

~~~
matiasb
[http://www.the-numbers.com/movie/budgets/all](http://www.the-
numbers.com/movie/budgets/all) helps?

------
eCa
Interesting!

A couple of comments:

* The first two tables could be joined, with the movies from the first table bolded to distinguish them as "best rated".

* Should be: "not average runtimes(>70 and <120)" (not the other way around)

* The lables of the certificate graphs are on the wrong axis.

------
roshansingh
Gangs of Wasseypur was released in two parts as two separate movies and IMDB
has added the runtime of both the movies. However both movies were equally
good :)

~~~
hitlin37
True, I think the author should correct the result for long movie list.It
gives wrong impression about the movie. Here is the quick result from google
about movie duration of GoW:
[https://www.google.se/search?q=gnags+of+wasseypur+duration&o...](https://www.google.se/search?q=gnags+of+wasseypur+duration&oq=gnags+of+wasseypur+duration&aqs=chrome..69i57j0l3.8677j0j1&sourceid=chrome&ie=UTF-8#q=gangs+of+wasseypur+duration&spell=1)

------
JFrolich
Great analysis, and nice matplotlib visualisations. Would it be possible to
share the 'in' code to produce the graphs for learning purposes? :)

------
llimllib
ipython notebook rocks, these sorts of analyses are super easy to cook up.

------
matdrewin
Curious to know what tools you used to gather and build out the stats?

------
JoeAltmaier
Runtime vs Rating is essentially a heatmap; hard to draw conclusions.

------
JoeAltmaier
Best rated movies are bimodal: war movies, and gangster movies.

------
bemmu
Where can one get a list of more top movies than 250?

~~~
bugra
IMDB recently released their data in which I took the ranking equation, you
may find it useful:
[http://www.imdb.com/interfaces](http://www.imdb.com/interfaces)

------
chaddeshon
Melancholia is not 450 minutes long.

~~~
bugra
It is not Melancholia(2011) but Melancholia 2008, here is IMDB link which says
450 minutes:
[http://www.imdb.com/title/tt1269566/](http://www.imdb.com/title/tt1269566/)

In the table, I also gave the release year of the movie in order to avoid name
conflicts.

~~~
chaddeshon
I see. Sorry. I saw your release year and thought about name conflicts, but
still missed it.

Actually, my first thought was that Melancholia might have been 450 minutes
long. Because it felt that long when I watched it.

------
matiasb
nice!

------
flibertgibit
I'd like to see more breakdown by release year, e.g. # of movies in each
category by release year.

I think it would be interesting to look at those stats next to economic stats,
etc.

I'd also like to see a more granular breakdown of attributes of each movie
(movies relating to technology, movies with a workers' union being a strong
component of the film, race relations, international relations, etc.) and the
# of each of those per year, but that would be much more work.

