Hacker News new | past | comments | ask | show | jobs | submit login

Very interesting to see the dataset being made available. Whenever I want to do this kind of analysis, I always stumble at 'how to get the data?'. In their paper, it is mentioned that "We created a dedicated computer program to carry out the navigation and data collection tasks required to gather the metadata for all available videos...". I would love to see this program. More broadly, can anyone help me with best resources (pref python) where one can learn to crawl/scrape this type of information?

Hi, I didn't release the code of the crawler, first, because it was not well-crafted enough to be released (quick and dirty linear programming), and second, because any change in the site you crawl calls for recrafting your code.

I used python, sometimes with Beautifulsoup, sometimes with lxml, both are very good for crawling. I would say BS is easier, and LXML cleaner.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact