
IMDB Top 100K Movies – Analysis in Depth, Part 2 - fernly
http://bugra.github.io/work/notes/2014-02-23/imdb-top-100-movies-analysis-in-depth-part-2
======
izzydata
When did IMDB go online? This seems like one of those things where nostalgia
factor is pretty big. If people were rating those movies from a few decades
ago at the time they were in theaters then these graphs might look pretty
flat.

That or it is becoming increasingly harder to entertain people with new ideas
as they watch more and more movies. People could be getting desensitized to
entertainment.

~~~
thrownaway2424
IMDB is practically the oldest thing on the web. The site launched in 1993 but
it imported the archives of relevant usenet (rec.arts.movies) and mail
archives going back to the 80s.

------
beloch
Impressive work! I have questions/ideas:

Is there any way to extract movie ratings vs time from your data-set? I've
subjectively noticed a few patterns in movie ratings, and would love to see
quantitative confirmation.

1\. Rating quality tends to spike at or shortly after a film's release, but
then drops off gradually as the glow of the PR campaign wears off, or perhaps
because people who go to see movies in their opening week are more predisposed
to like them (e.g. fanboys, etc.). I usually expect a film that has just been
released to be similar in quality to a film that's been on home video for a
while with a rating half a point or more lower. (I am ashamed to admit that I
watch enough movies that I can usually peg it's IMDB rating to within a
quarter of a point independently from how much I personally enjoyed the film.
)

2\. Rating quality for older films tends to be held back by poor quality home
video releases and often improves significantly, but gradually, after quality
transfers are released. e.g. I distinctly remember seeing Zulu several years
ago and thinking, "Gee, that film was way better than the IMDb rating!". Back
then, it was in the high 6 range. Now it's up to 7.8!

I suspect IMDB does not keep track of times associated with votes, or at least
does not provide that data publicly, so the best you could do would be to
crawl IMDb periodically. It shouldn't take more than a year's worth of data to
see if point #1 holds up. #2 is a lot more difficult because the effect is
more gradual, and you'd need to start bringing in other data sources, like
audio/video quality ratings from home video review sites that do tend to be
somewhat unreliable. I find #2 to be a question of interest though. If you
found any other form of correlations with titles whose ratings improve
substantially after quality home video releases, you would have potentially
found a way to identify under-appreciated films. If you pulled that off, in
addition to discovering some good flicks to watch, companies like Kino and
Criterion would probably start knocking at your door!

~~~
burntsushi
> Is there any way to extract movie ratings vs time from your data-set? I've
> subjectively noticed a few patterns in movie ratings, and would love to see
> quantitative confirmation.

I'm not the OP, but I'm familiar with the data IMDb provides. The short answer
to your question is: not easily.

IMDb provides a plain text dump of a subset of their data here: ftp://ftp.fu-
berlin.de/pub/misc/movies/database/ \--- it's updated (usually) once a week.
So to get temporal data, I guess you'd have to track it yourself.

 _However_ , diffs are provided for each data update: ftp://ftp.fu-
berlin.de/pub/misc/movies/database/diffs/

It looks like this could give you temporal data at the granularity of once per
week.

------
ape4
Even though the internet is mainstream, I think imdb's rating are still a bit
"nerdy". Granny doesn't go to imdb to rate a movie.

eg Batman: The Dark Knight Rises gets 8.6
[http://www.imdb.com/title/tt1345836/](http://www.imdb.com/title/tt1345836/)

~~~
pmr_
No one ever is considering what Granny thinks when movie or television ratings
are concerned. The age groups normally considered are 18-34 or 18-49, because
those are the ones which far outweigh all others in terms of commercial
interest. Also, all other age groups don't make up a significant part of the
movie going crowd so even if they would contribute ratings it wouldn't make
much of a difference.

------
pessimizer
"Yet, [old movies] are getting better although not as high as old movies."

1) Don't forget that the raters are self selecting. For very recent movies,
the ratings will only be from people who were attracted by hype, and were
excited enough to immediately rate a film.

2) All of those ratings will eventually drop as people outside of the target
audience(outside of the natural consumers of initial hype) rate the film,
reverting the rating to the mean.

3) Most people that are aware of a particular older film are people who
remember it because it was/is a favorite, or people who encountered it
recently and found it interesting enough to research it on imdb, both probably
tending to skew positive.

If you were able to collect the changes in ratings over time, I'd bet that dip
of low ratings would always be around the same distance from the present.

I might be off on this, though, because there will always be a qualitative
separation between movies that were ratable at the time of release, and movies
released pre-imdb.

~~~
bugra
1\. Yes, I agree with you and I wrote on the first post is that that is mostly
due to selection bias. [http://bugra.github.io/work/notes/2014-02-15/imdb-
top-100K-m...](http://bugra.github.io/work/notes/2014-02-15/imdb-
top-100K-movies-analysis-in-depth-part-1/) 2) That is true but I think that
holds for __all__ of the movies not a subset of them. 3) I am not quite sure
about that but I think that would be quite interesting look at the temporal
information correlation to the rating of the movie.

------
brownbat
Really enjoying the analysis.

I hope you'll get a chance to check out writers after actors. It'd be a really
interesting test of Schrieber theory, which argues that the writer is a better
predictor of a film's quality than the director.

[http://en.wikipedia.org/wiki/Schreiber_theory](http://en.wikipedia.org/wiki/Schreiber_theory)

(I had a similar comment last time, sorry for hounding the same issue, just
not sure if you saw it.)

~~~
bugra
I missed the comment in the previous one, sorry about that. I am not familiar
with the theory but if I could get the writers information(right now I only
have the directors) that would be interesting aspect of data to investigate.

------
AimHere
One thing I was hoping this series would shed light on was an oddity in
demographic voting patterns with the ratings system.

From my own (anecdotal) observation, there's a class of films where the
'Females 45+' rating is a big outlier in the voting, with a large proportion
of '1' votes. it's as if there's a gang of a few hundred users consistently
downvoting films who are classed by IMDB as being part of that demographic - I
don't know if it's a default setting that some bots or trolls have used or if
there really are a bunch of cinephobic old women out there just watching and
hating on films, but it did spark my curiosity as to what the pattern was. It
doesn't happen in every film, and it's rare that you see any other demographic
sticking out like this.

The best examples are critically acclaimed films with low voter counts. More
popular, or less well-received films tend to have a lot of negative noise to
begin with, so the effect is less noticeable, if it's there at all.

Here's some examples:

Army of Shadows :
[http://www.imdb.com/title/tt0064040/ratings?ref_=tt_ov_rt](http://www.imdb.com/title/tt0064040/ratings?ref_=tt_ov_rt)

Spartacus :
[http://www.imdb.com/title/tt0054331/ratings?ref_=tt_ov_rt](http://www.imdb.com/title/tt0054331/ratings?ref_=tt_ov_rt)

Ikiru :
[http://www.imdb.com/title/tt0044741/ratings?ref_=tt_ov_rt](http://www.imdb.com/title/tt0044741/ratings?ref_=tt_ov_rt)

Passion of Joan of Arc :
[http://www.imdb.com/title/tt0019254/ratings?ref_=tt_ov_rt](http://www.imdb.com/title/tt0019254/ratings?ref_=tt_ov_rt)

Breathless :
[http://www.imdb.com/title/tt0053472/ratings?ref_=tt_ov_rt](http://www.imdb.com/title/tt0053472/ratings?ref_=tt_ov_rt)

Z :
[http://www.imdb.com/title/tt0065234/?ref_=fn_al_tt_2](http://www.imdb.com/title/tt0065234/?ref_=fn_al_tt_2)

Barry Lyndon :
[http://www.imdb.com/title/tt0072684/ratings?ref_=tt_ov_rt](http://www.imdb.com/title/tt0072684/ratings?ref_=tt_ov_rt)

~~~
bugra
That is exactly why I generally prefer median-like average methods versus
mean-like averages in these type of crowdsourced systems. Generally speaking,
average votes are distributed evenly whereas the ratings on the edges are not.
The movies that you gave as examples could be due to preference of voters for
particular type of movies.

~~~
AimHere
I don't think it would make much odds for IMDB. Sure you'll maybe negate the
overly high proportion of extreme votes, but I don't think that effect is
large. I did once take the top films on IMDB and remove the extreme votes, and
the effect was marginal - a few films shuffled a place or so. In exchange, if
you use the median, you could have polarising films - like Twilight, having
their ratings change markedly over time, as people decide to start warring
over it and the numbers swamp the small number of ambivalents.

For your other point, there will be voting patterns for various demographics
and certain groups of people prefer certain movies. But I think the phenomenon
I'm describing is a bit too extreme to be merely a bunch of old women
disproportionately taking a dislike to Ikiru and then rushing onto the
internet to tell the world they hate it. It is the smallest demographic IMDB
has, so it's not hurting the ratings of the films in general, it's just an
oddity.

------
bemmu
Next time please also include a simple list of top 1000 movies, it would be
helpful (without movies that received too few votes, of course).

~~~
AimHere
IMDB provides that by itself:

[http://www.imdb.com/search/title?groups=top_1000&sort=user_r...](http://www.imdb.com/search/title?groups=top_1000&sort=user_rating&view=simple)

------
wslh
Do you think that doing text mining on the films subtitles will add
interesting information?

~~~
bugra
So, I tried to do cluster the outlines but generally they are quite short and
does not reveal much about what the movies are about but rather the plot. So,
I am a little skeptical if it yields good results for subtitles.

~~~
wslh
Ok, it was just an idea.

------
jedberg
All of this data assumes that IMDB is a complete record of all films.

I don't believe it is, and I think the further back you go, the more likely it
is that only well-regarded films are in the database.

~~~
pessimizer
As databases go, I'd bet it's as complete as it gets. There are lots of awful
films from the 20s and 30s in there.

