
Introducing the Open Images Dataset - hurrycane
https://research.googleblog.com/2016/09/introducing-open-images-dataset.html
======
imh
Lawyers are funny:

>Today, we introduce Open Images, a dataset consisting of ~9 million URLs ...
having a Creative Commons Attribution license* .

Then the footnote below:

>* While we tried to identify images that are licensed under a Creative
Commons Attribution license, we make no representations or warranties
regarding the license status of each image and you should verify the license
for each image yourself.

I think this might be the most blatant instance I've ever seen of, "We have to
write this even though it's essentially impossible for you to actually follow
our directions."

~~~
ktta
Lawyers _aren 't_ so funny when they sue you for trivial things.

There are so many instances where people try to take advantage of things. This
is just protection. And it's not like they're providing nine images. It's nine
million.

~~~
imh
The funny part is the suggestion that you go through each and every one of the
9 million images and verify their licenses. It's basically impossible. It's
funny that, by suggesting we do the impossible, their butts are covered.

------
transcranial
Interesting that the base data consists of URLs. I guess it makes sense given
copyright issues. Anybody know what the ballpark expected half-life of such
URLs?

~~~
joelthelion
I suppose that will simply leave the job of downloading the images and making
a torrent out of them to someone else less identifiable...

~~~
visarga
Or shared by sneakernet.

------
diyseguy
Any guesses on how large the resulting dataset would be if you actually
downloaded all the images? I imagine the urls will get removed in a hurry as
everybody starts automating it.

~~~
krasin
(disclosure: I am one of the contributors to the dataset)

~1TB for 640x480 thumbnails, ~3TB for 1600x1200 thumbnails.

The originals are about 20TB, though.

------
devindotcom
First video, now images - wonder if speech and others are on the way?

It's nice that they're doing this, helps advance the art I think. But it also
puts a lot of smaller operations in unis sort of under the Google system in
that they're best compared to Google's ML work and others using these
datasets. It's a small way of stacking the deck to make Google and DeepMind
more embedded in the community.

That said, its utility for others surely outweighs the strategic advantage
gained here, so I for one welcome these libraries. A lot of work goes into
them. Hopefully others will release theirs as well.

~~~
imh
I _really_ want real-world speech data. I think speech-to-text could open the
doors to lots of creative tech to help hard of hearing folks, but the dataset
barrier is so massive :(

~~~
Houshalter
I wonder why semi supervised learning hasn't taken off more. There is so much
unlabelled data out there. E.g. in podcasts and youtube videos and television.
A small amount of it is labelled, like captioned television programs and
youtube videos. You could use those captions to train a weak model which could
provide labels for the unlabelled data.

You can correct many of it's errors with simple language models. For instance
the phrase "wreck a nice beach" has much less probability than "recognize
speech". So if the model isn't sure which one it is, you can assume it's the
more probable one. Then train on that, and it will get even better at
recognizing those words.

~~~
rm999
>You can correct many of it's errors with simple language models. For instance
the phrase "wreck a nice beach" has much less probability than "recognize
speech". So if the model isn't sure which one it is, you can assume it's the
more probable one. Then train on that, and it will get even better at
recognizing those words.

Why not just build that heuristic into your initial supervised "weak" model?
Training on data labeled by a model introduces no new information, so you're
not gaining anything there.

~~~
Houshalter
It does introduce new information. This is how semi-supervised learning works.
Many words may be missed by the weak model, but can be inferred correctly from
their context. Then you have new labels to train the weak model on, to make it
better at those words.

The way I'm describing it is probably not the optimal way to do it. I don't
know if there is a better way. But the point is it must be possible to take
advantage of the vast quantities of unlabelled data we have. Humans brains
somehow do something similar.

Semi supervised learning is really cool. I saw one successful example where
they labelled just a few emails as spam and not spam. Then they trained used
their weak classifier to label thousands of unclassified emails, and then used
those as training data for an even stronger model. It actually works:
[http://matpalm.com/semi_supervised_naive_bayes/semi_supervis...](http://matpalm.com/semi_supervised_naive_bayes/semi_supervised_bayes.html)
[https://en.wikipedia.org/wiki/Semi-
supervised_learning](https://en.wikipedia.org/wiki/Semi-supervised_learning)

------
zappo2938
I'm glad I'm getting a return on all the effort clicking street signs and
store fronts on reCaptcha.

------
pilooch
I've put an efficient downloader here for the interested crowd:
[https://github.com/beniz/openimages_downloader](https://github.com/beniz/openimages_downloader)
It's a fork of the one script I used to grab Imagenet.

------
dharma1
Is there a link to the trained model somewhere?

~~~
goddamnsteve
Even I've been searching for some links for the trained models. May be its too
early for now.

~~~
krasin
Stay tuned:
[https://github.com/openimages/dataset/issues/3](https://github.com/openimages/dataset/issues/3)

------
rocky1138
Are there any other libraries that are similar?

------
Omnipresent
Looking forward to someone trying tensorFlow CNN on this

