

Image Search Breakthrough: 0.98+ Precise Images of 500,000+ topics - PaulHoule
http://rdf.ookaboo.com/

======
dangrossman
In TechCrunch-style blogs, every post has a large image at top (often just a
stock photo) that's also used to represent the post as a thumbnail in post
lists.

You could automate providing these images for new posts by mashing up this
annotated database of CC-licensed images with the OpenCalais API. OpenCalais
analyzes text to extract named entities (companies, technologies, people, even
social tag recommendations) -- so match those terms with the image database
til you get a hit and you've got an image with no work done by the author.

Maybe something I'll play with later tonight.

I wrote a PHP wrapper for the OpenCalais API some years ago.
<https://github.com/dangrossman/PHP-OpenCalais>

~~~
cakeface
For an example of how this is done with google images check out this project
<http://logolifter.com/>

~~~
dangrossman
Am I missing something or is this site specifically designed to enable
copyright infringement? I don't think the little disclaimer saves it.
Contributory infringement is illegal in the US, and this tool lifts images
from Google Image Search, 99.9% of which are not licensed for distribution. If
this is your site, I'd consider replacing the image source with something like
Flickr whose API which supports search-by-license.

~~~
PaulHoule
Google Image Search indexes Ookaboo, not the other way around.

All of the images were published under a public domain or creative commons
image elsewhere; there is a small probability that this information is
incorrect, in which case we remove images as per DMCA

~~~
dangrossman
Hi Paul. I was referring to LogoLifter, the comment I replied to, not Ookaboo.
LogoLifter's images are from Google Image Search, not from Ookaboo.

------
nl
Please fix the title ("Image Search Breakthrough: 0.98+ Precise Images of
500,000+ topics").

Unless I'm misreading this radically, this is a learning dataset of 1,000,000
images, labelled with a precision of 98%.

This _isn't_ a breakthough in image search. It is helpful for anyone working
on image search, but appears to have been manually labelled.

~~~
fauigerzigerk
You are right, the title is grossly misleading, but I don't think this was
manually labelled. They have this to say about the association of pictures and
topics:

 _Ookaboo attains high precision within it's point-of-view by having a very
permissive definition of "depicts," which means roughly "this image could
appear in an encyclopedia entry for..." For example, in the case of
dbpedia:Pablo_Picasso, we'd accept a portrait of the artist, images of his
works, or a house that he lived in._

This leads me to believe that the topic associations were derived
automatically based on information found on the page in which the picture was
embedded.

~~~
nl
Yes.

To be clear, I suspect a large number of the images come from
Wikipedia/Wikimedia, and are manually categorised there. This software scraped
the images and the categories.

------
ique
So is this meant for actually getting images or as a learning set for ML?

------
slig
Related: you can grab 100k images files with labels[1] from the old ESP
game[2].

[1] <http://www.cs.cmu.edu/~biglou/resources/> [2]
<http://en.wikipedia.org/wiki/ESP_Game>

------
julian37
This looks quite impressive and useful. Could someone explain what the
precision metric refers to? The documentation states:

 _Ookaboo attains high precision within it's point-of-view by having a very
permissive definition of "depicts," which means roughly "this image could
appear in an encyclopedia entry for..." For example, in the case of
dbpedia:Pablo_Picasso, we'd accept a portrait of the artist, images of his
works, or a house that he lived in._

But that doesn't really explain how the number 0.98 was arrived at.

EDIT: made this into a straightforward question rather than speculation.

~~~
loboman
1000000 * 0.02 = 20k

This implies there are 20k wrong images in the dataset, sampling can find some
of them

~~~
mb22
doesn't it mean _could_ be 20k wrong images?

~~~
loboman
Yes, you are right. 20k is an approximate number.

------
mark_l_watson
This looks very useful. One of my customers needs labeled images that can be
used with attribution. Thanks for posting this.

