
100M Creative Commons Flickr Images for Research - kneth
http://yahoolabs.tumblr.com/post/89783581601/one-hundred-million-creative-commons-flickr-images-for
======
GrantS
One of the most incredible parts is that they've already run feature detection
on all 100M images/videos and extracted 50TB of:

"SIFT, GIST, Auto Color Correlogram, Gabor Features, CEDD, Color Layout, Edge
Histogram, FCTH, Fuzzy Opponent Histogram, Joint Histogram, Kaldi Features,
MFCC, SACC_Pitch, and Tonality"

The good part about this for researchers is not only that this saves dozens of
CPU-years of computation (back of the envelope, it would take 15 years for my
laptop to extract those SIFT features alone), but that any differences in
learning/recognition performance on the dataset can be attributed to the
algorithms in question, uncomplicated by which researcher engineered the best
features for the dataset. On the other hand, it's a challenging dataset to
work with because you can't just download it and process it locally as has
been traditionally done. I'll be interested to see how many take advantage of
it.

~~~
kastnerkyle
Why the heck would you do MFCC on images? Mel filters try to replicate the
perception of human ears on audio. This looks like buzzword soup to me
(SACC_Pitch, Tonality? What the heck?!? These seem like audio features - where
are the formulas!).

I also don't know about your other conclusion, there is no reason you couldn't
download this dataset given enough time/bandwidth/storage to process locally.
Most people who will work on this could reasonably store a large chunk
locally, if not all (~10 TB). This also assumes that you can't reduce/compress
the info any further than what flickr provides and that you require access to
the entire dataset - if any of the images are 1024x1024 or larger most feature
extractions do not need that kind of fidelity. Heck, you could probably make
use of grayscale only to reduce the size by a factor of 3 - ~ 17 TB is
feasible (though still pretyt insane) to store locally.

ImageNet (~1.2 TB) only took me 45 days on a residential (<20 MB connection),
and I wold assume that this dataset would be downloaded by entities with much
higher download b/w. I would also assume that many algorithms, like the type
that attack CIFAR10 et. al., would also be willing to reduce the
dimensionality and recompress, further reducing storage overhead. How big is
each image?

Also, where are the hyperparameters they used to calculate all of these
features? Extracted features aren't really that useful without
context/reproducibility.

All that said, I think most of these features are decent and the dataset is
amazing, but I would rather see them release the raw data set and its
PCA/ZCA/other transform - maybe Gabor filtered etc. as well. Lower level
preprocessing is more useful for doing representation learning IMO - these
higher level features are not that useful for ML algorithm developers. SIFT is
patented for heaven's sake! How are we supposed to build algorithms on top of
things like that...

I am excited about the dataset but feel that there could be more done to truly
enable researchers. This feels like a "look how much data we have/look how
awesome and used flickr is" thing to me.

~~~
DanBC
> Why the heck would you do MFCC on images? Mel filters try to replicate the
> perception of human ears on audio.

First page of Google hits.

MFCC Based Face Identification
[http://www.img.cs.titech.ac.jp/~akbari/pmwiki/uploads/Site/S...](http://www.img.cs.titech.ac.jp/~akbari/pmwiki/uploads/Site/Sangeeta-
rep.pdf)

Identification of satellite images based on mel frequency cepstral
coefficients
[http://ieeexplore.ieee.org/xpl/login.jsp?tp=&arnumber=538327...](http://ieeexplore.ieee.org/xpl/login.jsp?tp=&arnumber=5383270&url=http%3A%2F%2Fieeexplore.ieee.org%2Fxpls%2Fabs_all.jsp%3Farnumber%3D5383270)

~~~
kastnerkyle
Frankly, both papers you linked give no justification for using the mel
filterbank instead of any other filtering/frequency reduction before doing the
IDCT to get the cepstrum. This is what I mean! Why take a filterbank designed
for audio and apply it to images except to use the buzzwords? In any case, at
least I know it is used by researchers (even if I don't agree with the why).
Thanks!

------
clickok
It seems like Yahoo is a little bit worried about possible exploitation. From
the Terms of Use:

 _2.3. You may derive and publish summaries, analyses and interpretations of
the Data, but only in a manner where it is impossible to reconstruct the Data
from the publication. Small excerpts of the Data may be displayed to others or
published in a scientific or technical context, solely for the purpose of
describing your research and related issues and not for any commercial or
anti-competitive purpose. Unless Yahoo! expressly requests no attribution, all
publications resulting from research carried out using the Data must display
an attribution to Yahoo!. This attribution must reference &quot;Yahoo!
Webscope,” the web address
[http://webscope.sandbox.yahoo.com](http://webscope.sandbox.yahoo.com), and
the name of the specific dataset used, including version number, if
applicable. This attribution should preferably appear among the bibliographic
citations in the publication. If Yahoo! expressly requests no attribution, you
agree not to mention Yahoo! in connection with the Data. Yahoo! invites you to
provide a copy your publication to Yahoo!._

This[0] seem fairly restrictive, considering that I can just crawl flickr and
get all that data and more, were I so inclined. Also kinda interesting, in
this passage and the rest of the TOU: they repeatedly use `&quot;`
interchangeably with actual quotation marks ("), suggesting that _nobody at
Yahoo has proofread their own live TOU_. Still, the dataset seems really cool.

[0] ...and other parts of the agreement, but I don't want to spoil it for you,
nor post its entirety as a comment.

~~~
kastnerkyle
If you can crawl flickr and get 50TB of data, do it.... it is more than a
"were I so inclined" situation. I have had a very hard time crawling and
indexing large datasets like this - companies tend to protect their data!

------
spingsprong
"Yahoo is hosting a contest to build the system best capable of identifying
where a photo or video was taken without using geographic coordinates."

Does this strike anyone else as being a bad idea?

~~~
fancy_pantser
I can't think of any general reasons this is bad, just very narrow cases on
the individual level. What are your fears?

~~~
spingsprong
Because of the inevitable photolocationfinder dot com that will immediately
come into existence if they ever succeed.

Then everyone who hates someone or likes someone way too much, will only need
a couple of photos from twitter or wherever, to know where to look for that
person in real life.

~~~
Houshalter
It's not likely that a computer vision system is going to be that much better
than humans at the task. Maybe it will be able to guess your latitude by the
color of the sky or something crazy like that, but not give you an exact
address.

~~~
kastnerkyle
It is not unreasonable to believe that an algorithm could key in on
architectural peculiarities of a given region. On top of that, if there are
any people who _are_ in the photo who share their address on facebook,
twitter, foursquare et. al. it is game over.

~~~
Houshalter
I didn't consider that. Reminds me of _What Makes Paris Look Like Paris_ :
[http://graphics.cs.cmu.edu/projects/whatMakesParis/](http://graphics.cs.cmu.edu/projects/whatMakesParis/)

------
cclogg
"From the old world of unprocessed rolls of C-41 sitting in a fridge 20 years
ago"

Hey I still do that! :(

I wonder if my (or anyone's) film photos on Flickr are completely useless
metadata-wise. Because they are all scanned so they just say "NORITSU KOKI EZ
Controller". There seems to be a large portion of people (on Flickr) shooting
film still but I wonder if it's only a small percentage overall.

------
jitendraag
Just when I was happy using Flickr's API for creative commons image search -
[http://www.outreachpanel.com/free-images/](http://www.outreachpanel.com/free-
images/)

They gave me this huge data to play with :)

In past, I have had issues with CC images that were also tagged with 'getty'.
I hope they have taken care of that issue.

------
chatman
No access to non university based researchers. Useless for me.

------
liminal
The data is only available to university researchers.

------
raphar
Isn't this on torrent yet?

