
Announcing YouTube-8M: A Large and Diverse Labeled Video Dataset for Research - runesoerensen
https://research.googleblog.com/2016/09/announcing-youtube-8m-large-and-diverse.html
======
JosephRedfern
If, for some reason, you wanted a list of all of the video IDs (I couldn't
easily find such a list), then I wrote a crappy scraper to pull them out:
[https://gist.github.com/JosephRedfern/d60bdc584d84b1451cc605...](https://gist.github.com/JosephRedfern/d60bdc584d84b1451cc6052e955b755c).

I can post a URL to the output once it's finished running, if it'd be of any
use to anyone. Oh, and be warned, there's a strong chance that it's buggy.
It's certainly not optimised (no threads).

EDIT: The script has now run. I've scraped ~10,000,000 Video IDs, but only
~5.5m of these IDs are unique, so there's probably a bug in my script
somewhere (but I need sleep now). Files containing IDs for various categories
are listed here:
[https://redfern.me/public/yt8m/](https://redfern.me/public/yt8m/), some notes
are here:
[https://redfern.me/public/yt8m/README.md](https://redfern.me/public/yt8m/README.md),
and .tar.gz'd archive is available here:
[https://redfern.me/public/yt8m/yt8m-ids-probably-
incomplete....](https://redfern.me/public/yt8m/yt8m-ids-probably-
incomplete.tar.gz).

~~~
garysieling
I'd love a list of IDs - I'm doing a research project that is a search engine
for lectures ([https://www.findlectures.com](https://www.findlectures.com))
and I'm interested to see if there is any overlap.

It seems like it'd be interesting to explore their tagging compared to what is
in video transcripts.

~~~
JosephRedfern
I've updated my original comment with some URLs.

~~~
garysieling
Awesome, thanks!

------
chirau
This is wonderful. Though I was wish i could just specify columns that I need
and download those. Or limit number of rows. 1.5 TB is quite a bit.
Regardless, this is wonderful.

Would I be violating any law, copyright if I formatted it and put it on my
server for that kind of consumption or via JSON?

~~~
magicalist
On
[https://research.google.com/youtube8m/download.html](https://research.google.com/youtube8m/download.html)
it says:

> _The code and dataset are licensed by Google Inc. under license Apache 2.0._

------
iverjo
This is nice :) Kudos to the Youtube guys for releasing this. I'm a data
scientist in a startup where one of the things I do is create multi-label
models for classifying YouTube videos. My current model has 90 % precision and
69 % recall, while Youtube-8M has 78 % precision and 14 % recall, with respect
to the human raters. I guess one of the reasons is that my model only has
around 100 categories, while Youtube-8M has 4800. It's like comparing apples
with pears, but still interesting.

~~~
tiplus
Sounds interesting, do you guys have a blog at mashtime? What kind of
hardware/software do you use for training? Tensorflow? on AWS or bare metal
GPUs?

~~~
iverjo
We don't have a blog yet. We're using Azure for the hardware and mainly
scikit-learn for the training (we train only on metadata at the moment). Will
probably start using Tensorflow soon.

~~~
doozler
Would you say Tensorflow is a good way to get started with Machine learning?

~~~
iverjo
I'd say scikit-learn is a better way to get started with machine learning.
Check this out, for example:
[https://www.youtube.com/watch?v=cKxRvEZd3Mw](https://www.youtube.com/watch?v=cKxRvEZd3Mw)

------
edent
I don't see anything about the rights of video owners? Have people
(inadvertently) licensed their content to be used in this way?

~~~
scott_karana
I wish they'd addressed that too.

I'd guess the reasoning is, because it's a list of _public_ URLs, there's no
expectation of privacy.

~~~
shmel
Probably you are right. However, I am wondering how many of those videos will
be deleted by owners, Google or just blocked in "some" countries a year from
now. They could have published it separately to avoid this.

------
tdaltonc
How good do labels need to be for you to be able to get good results on
something like this? There's a lot of data, so that's great, but the labels
seem a bit spotty.

------
lifeisstillgood
Oh man.

I am searching (thrashing) around for my next "big" project. i have been
thinking of drones measuring roof / building quality and the CV/ML
requirements are fairly high - getting my teeth stuck into these would really
give me a better feel for training my own system.

The problem is, how do I feed my family while taking the six months to do it
all?

~~~
timClicks
If you're serious about this concept, create a drone company that takes real
estate photos. That will give you hands on experience with the regulations,
quality control issues, etc while giving you time to build up your training
set.

~~~
lifeisstillgood
Hmmm ...

------
lolive
Did someone make a RDF dump of that? (Aligned with dbPedia ;)

