Hacker News new | past | comments | ask | show | jobs | submit login
Your own movie database in 5 minutes with IMDb API and Perl (bobbelderbos.com)
70 points by bbelderbos on Nov 21, 2011 | hide | past | favorite | 44 comments



The terms and conditions seem pretty clear that you can't save site content except for page caching: http://www.imdb.com/help/show_article?conditions

>IMDb grants you a limited license to access and make personal use of this site and not to download (other than page caching) or modify it, or any portion of it, except with express written consent of IMDb. This site or any portion of this site may not be reproduced, duplicated, copied, sold, resold, visited, or otherwise exploited for any commercial purpose without express written consent of IMDb. This license does not include any resale or commercial use of this site or its contents or any derivative use of this site or its contents.

(and no, I'm not trying to be a negative nancy. I've wanted to analyze imdb for awhile but have known that it's not possible to do so without breaking TOC)


There is some data available for personal, non-commercial use. http://www.imdb.com/help/show_leaf?usedatasoftware

However, licensing the database is super expensive: "We offer licensing packages that start at US$15,000 per year." http://www.imdb.com/help/licensing/contact


Right, which brings us to the question: who actually owns the data, i.e. where did IMDb get its data? Is a movie's crew public information? Will the studios give this information to me if I ask them?

I think it would be interesting an very useful to have a db of information of all Hollywood movies (my guess ~50-60K) and make it freely available.


It doesn't matter who owns the information. If you break the Terms of Service, you don't get to use their service to access the information.


I think it matters a lot if your intention goes beyond playing with the data into using it commercially. The question is: Can I download the IMDb data (by any means necessary) and use it for my startup. To me their License rules this out.

My question was: if the data is public, can IMDb enforce this hold on it. Probably not, as there's precedent of courts not siding with Museums who tried to shut off access images to the objects they hold, citing the effort required to take photos, etc.


Well, facts are generally not copyrightable. But the specific representation of the data on the IMDB website or through the API probably is copyrightable. So making an unauthorized copy might be illegal, and redistributing that raw data is definitely a violation of copyright.

But if you get the data from IMDB, and "substantially transform" the actual representation into something else, they can't claim an infringement just because you copied their facts.

But then again, if you break the ToS, they might get you on "unauthorized access of a computer system" with commercial intent, so huge fines etc. even if there is no copyright violation.


Who would be paying for this DB to be hosted? It's always interesting to talk about making things "free", but in the end someone is paying for it.


Well, we're not talking of GB of data here, or are we? Let me, see: I think about 1K movies are produced every year, halve that and multiply by 100, you get about 50K movies total. Let's double that to account for shorts, independents, etc. and we get 100K movies in the db. How much info is for one movie? The crew of Titanic from IMDb as text is around 85K. Let's say less than 256K per movie. So we're looking at about 13GB of data. With lossless compression, e.g. LZW and some austerity, say ~5GB.

That's not tiny, but not huge. And the amount of download bandwidth will not be much, due to long-tail effects. A company like Google or Amazon, who stand to benefit a lot from such a db can easily accommodate this.


Amazon already have a database like that, called imdb. ;)


Yeah, I meant one that wouldn't charge ~$15K for commercial access. One that would charge per db access would be nice.

But it's not exactly the money (after all fifteen grand, although excessive is not prohibitive) but the unreliability factor: if you're building a business on a db API there should be a warranty that the company won't decide to abandon it or cut your access unreasonably.


My point was that, seeing as Amazon own imdb, they probably don't charge themselves ~$15k for access.


anyone know of a good place to get Movie data (and possibly Cast/Crew) info for free or low cost for a commercial project?


Wikipedia has loads of movie data. You can use the API to extract it, or use dbpedia.org for a more friendly view.


freebase.com


themoviedb


There's also a CPAN DIstribution for that: https://metacpan.org/release/TMDB


Why not just download original db files from http://www.imdb.com/interfaces ? I mean parsing the whole db through the API may not be a good idea. There's a tool (http://www.jmdb.de/) which parses the tables you need to mysql or postgre.


Wow. Props to AMZN. How long have these files been available?

I prefer to use my own choice of parsing tools and database software and really appreciate provision of raw files like this.

But I'm not sure what the purpose of requiring attribution is if the data can only be used for personal use. Assuming we comply with the license and do not share the data, who else besides the user is going to see it?

If a user builds a better movie database with this data, he must not share it with the world or even his neighbor. Sorry, them's the rules.


Since IMDB started, it was originally a volunteer contributed project.


Interesting.

So, 1. the text files have long been publicly available, since before the acquisition and 2. it was not a commercial project prior to acquistion.

Who wrote the non-commercial license? Was that also written before the acquisition?



User-generated content. Make a list of factual information. Accept submissions from users. Watch it grow. Then partner with a large company to sell products. And sell licenses to the data for $15K/year. (Give the top contributers a "free" membership to something to keep them from suing you.) Yeah, that could work.

Movie information, restaurant reviews, video clips, you name it. The miracle of user-generated content.


Why are you giving props to Amazon?


Because I believe they acquired IMDb many years ago and would have the final say in how the data may be used.


Files've been there before acquisition


Unfortunately that Perl code is not a good representative of how one should write Perl in 2011.


Do you care rewriting it for us(me)? Just curious for how perl should look like now


I'm a complete cowboy, and would probably write something like: http://pastebin.com/mvjTG9XG

Edit: spot where I missed a comma :-P


Don't quote the variables, use placeholders. It's much simpler, and it's safer.

$dbh->do( 'INSERT INTO movie_collection VALUES ( ?, ?, ... )', undef, @{ $data->{movie} }{@fields} );


  > knee-jerk reaction
Fixed that for you.


Sure. Where's the dbh coming from in this example though? And how do I get it out again nicely to print the SQL to the command line?


It doesn't exist in the example, which is only creating the INSERT statements, presumably to pipe into, e.g., the mysql command line client:

mysql -u user -p < inserts.sql

If the $dbh were in the example, then:

(a) you could avoid that (eek!) archaic escapeSingleQuote() sub, e.g., my $released = $dbh->quote($released), but much better:

(b) as mentioned, use the SQL placeholders to avoid quoting altogether, but

(c) if you really want to print the INSERT's, just do method (a). Start by declaring the $dbh, e.g. for MySQL:

my $dbh = DBI->new( 'DBI:mysql:my_db', 'db_user', 'db_pass');


Made the SQL portion a little more readable: http://pastebin.com/MNzD0NSE


Ah, much better.

[It doesn't need saying but one could be even lazier and instead of generating a timestamp, one could use NOW() or an automatic timestamp field.]


I don't have time for that..

Looks good though, much more readable.


If anyone is looking for a movie database with a less strict terms of use, check http://www.themoviedb.org/

This is the service "used by many popular media centers like Moovida, XBMC, Plex, MythTV and MediaPortal."

They have a pretty nice API too. http://api.themoviedb.org/2.1/


Argh, a screenshot of code.


The code is linked in a few places:

http://bobbelderbos.com/src/moviecollection/getMovieData

It is annoying though; a link doesn't do those using screenreaders (or search engine bots!) much good.


Interesting. I've got a big XBMC media setup, and scraping metadata from a local source could certainly speed things up nicely instead of downloading from IMBD constantly.

See:

http://lifehacker.com/5536963/the-ultimate-start-to-finish-g...

In particular:

http://lifehacker.com/5505849/how-to-whip-your-movie-and-tv-...


I found this really cool. last night I knew nothing about Perl or mySQL, and just working through little things like this can really get you going on learning code and just grasping a basic understanding of the syntax. But no PHP code linked, just a video shot? I was hoping to get that part going as well.


thank you. I uploaded a the php code of the example site you saw in the video: http://bobbelderbos.com/src/moviecollection/site_php


Awesome, thanks. I saw lots of other interesting hacks on your blog too, I intend to play around with some of them, it's really the best motivator for me when it comes to learning programming.


I am glad to hear that. Just let me know (via my contact form) if you have any questions, feedback or interesting topics I could investigate. thanks


thanks all for your feedback. this is just for personal use. thanks for the perl suggestions, I am just learning




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: