
Show HN: CC Search – search engine for 300M CC-licensed images - kgodey
https://search.creativecommons.org/
======
kgodey
I'm Director of Engineering at Creative Commons and part of the team that is
working on CC Search.

We've been working on the product for over a year and we are just out of beta
today! One of CC's goals is to encourage the use and remixing of CC-licensed
content, and we hope that CC Search will help make that content more
discoverable. The current version is very much an MVP and only searches
images, but we plan to add more content types in the future and index the ~1.4
billion works out there under a CC license. We would love any feedback you
might have.

Also, CC Search, the associated API, and the scripts we use to index data are
all open-source and developed completely openly. Our sprints and roadmap are
public and we welcome contributions from the community.

 _Relevant links:_

CC Search code: [https://github.com/creativecommons/cccatalog-
frontend/](https://github.com/creativecommons/cccatalog-frontend/)

CC Catalog API code: [https://github.com/creativecommons/cccatalog-
api/](https://github.com/creativecommons/cccatalog-api/)

CC Catalog code:
[https://github.com/creativecommons/cccatalog/](https://github.com/creativecommons/cccatalog/)

2019 Vision: [https://creativecommons.org/2019/03/19/cc-
search/](https://creativecommons.org/2019/03/19/cc-search/)

Roadmap:
[https://docs.google.com/document/d/19yH2V5K4nzWgEXaZhkzD1egz...](https://docs.google.com/document/d/19yH2V5K4nzWgEXaZhkzD1egzrRayyDdxlzxZOTCm_pc/)

Active sprint:
[https://github.com/orgs/creativecommons/projects/7](https://github.com/orgs/creativecommons/projects/7)

Backlog:
[https://github.com/orgs/creativecommons/projects/10](https://github.com/orgs/creativecommons/projects/10)

How to contribute to CC projects:
[https://creativecommons.github.io/contributing-
code/](https://creativecommons.github.io/contributing-code/)

~~~
detaro
If I put images in the internet, how can I tell your search about them/about
the license?

~~~
kgodey
We plan to implement an API for users to submit CC-licensed works that they'd
like to add to our database later this year. Once we have that, you can call
our API when you publish a new CC-licensed work and we'll automatically add it
to our catalog. We have to work out some details like how we can verify that
the content is actually CC-licensed.

Encouraging use of ccREL embedding like aldenpage mentioned in another
response is also in our long-term vision since that would require no
additional actions from the creators.

~~~
antpls
You could have the same (only existing?) process than other content platforms
like YouTube, Github, etc. In practice, most people are fair and post the
content with the right license, but anyone can send a DMCA request.

An annual transparency report is probably mandatory with that process, like
these ones :

[https://github.blog/2019-01-23-2018-transparency-
report/](https://github.blog/2019-01-23-2018-transparency-report/)

[https://transparency.facebook.com](https://transparency.facebook.com)

------
scanny
Out of curiosity, are there any checks run on these images to see if they
really are CC?

This search result for 'satellites' is a screenshot of a google maps result,
which includes the copyright information for the imagery within the CC image
[see the bottom right attribution text in the imagery].

How can that really be in CC?

[https://search.creativecommons.org/photos/3f0eddb0-55b4-46e0...](https://search.creativecommons.org/photos/3f0eddb0-55b4-46e0-ba91-4be614e71520)

How could we be sure that the images themselves are genuinely licensed rather
than someone copying it from another source and slapping on the licence?

~~~
kgodey
We do our best to verify the license but cannot 100% guarantee that the image
is CC-licensed. We have custom scripts that we use to ingest content from each
content provider. For Flickr, we use their API which contains license
information for each work. If you click through to the source image, you can
see that the user who uploaded the image did license it under CC BY-SA 2.0.

We don't yet have a way of dealing with content where the user incorrectly
licensed it, I think we could add a "report image" function that would help us
identify and remove this type of content. There's a disclaimer if you scroll
down that says "Verify at the source: Flickr" with an explanation of why we
can't guarantee the license, maybe we should make that more prominent.

~~~
adrianN
Maybe the Flickr user did buy a license from Google for that particular
screenshot that allowed them to publish it as CC BY-SA. How would you verify
this? What's the legal situation here? Can Google sue me when I use that
image, or do they have to sue the Flickr user?

~~~
luckylion
Google can sue you, you can in turn probably sue whoever "gave" you the image
and said it was licensed CC BY-SA. Just like you can't keep a stolen item that
somebody sold you for cash, you're not free from liability when you're
publishing an image you didn't have rights to. You'll always need to apply
judgement before publishing, and I'd do a reverse image search at the very
least.

~~~
pbhjpbhj
But, in general if you inadvertently buy an item that was stolen; on eBay,
say; then you wouldn't be prosecuted for it, you just would lose ownership.

In the parallel situation here the 'selling' company [CC search] is saying
"this isn't stolen", and you're using it in good faith. Seems like your
liability _should_ be zero. If CC_search also did due diligence then their
liability should probably be zero too.

Media mega-corps probably see things differently so I imagine their copyright
laws mightn't be as liberal.

~~~
luckylion
Yeah, it's a civil issue, I don't think anybody would try to jail somebody.
You will be in breach of license, so the rights holder won't care why you're
violating copyright, only that you do. I don't know of any judicial system
where "good faith" is applied to civil suits (i.e. you eating my food thinking
it was yours gets you out of buying me a new plate).

Similar to the hypothetical ebay case: you're going to be asked to reimburse
the owner (or return his property), you collecting from the seller is a
different story.

~~~
mojomark
Would be interesting if moving forward, when a user wants to take an image
they know they will likely share as CC, they could tap a button to turn on 'CC
tagging' such that the camera app applies a digital signaure to the image.
Perhaps then, at least some image sources could be validated as being
intentionally released under CC by operator of the image source at the time.

------
Sommer
Definitely appreciate the programmatic access, but in terms of straight search
results it's very hard to improve on Google Image Search with the Usage Rights
filter.. Info:
[https://support.google.com/websearch/answer/29508?p=ws_image...](https://support.google.com/websearch/answer/29508?p=ws_images_usagerights&hl=en)

~~~
kgodey
Thanks for the feedback! A big chunk of our upcoming work is going to be
towards improving search relevance. We also plan to add content types other
than images this year, starting with open textbooks and audio, so that will
differentiate us.

~~~
Sommer
That would be fantastic. Also like to +1 the reverse image search support,
finding CC images visually similar to a restricted license image would be
really cool feature that Google does not seem to support.

------
rixrax
Thank You for doing this. I hope one day we will see a similar search for CC
licensed music where you can search based on license type, and e.g. music
style, beats per minute, length, audio format,...

I was recently trying to find music for a side project and searching for CC
licensed music appears to have gone from somewhat onerous to very in sites
like Jamendo.com, free music archive, etc. This is off topic but if someone
can point to a website with CC licensed music where it’s possible to search
based on license type (e.g. BY-NC-SA) I would love to know.

~~~
kgodey
Thanks for your feedback! We plan to add audio to CC Search later this year
(probably in Q4). We will have searching by license type built in from the
beginning and those all sound like great filters to add.

------
dang
A related blog post is [https://creativecommons.org/2019/04/30/cc-search-
images/](https://creativecommons.org/2019/04/30/cc-search-images/).

(Via
[https://news.ycombinator.com/item?id=19791801](https://news.ycombinator.com/item?id=19791801),
which we merged into this thread.)

------
tomcam
Remember that for commercial use you still need a model release in the USA
when faces are identifiable. It is totally separate from the licensing of the
image.

~~~
pbhjpbhj
What's the relevant USC for that please?

~~~
tomcam
Don’t know what you mean by USC

------
not2b
I arbitrarily searched for "hula hoop" and most of the top images have people
in them. People using this facility to find images to freely use would be
advised to avoid images of recognizable people, because you don't know if they
signed model releases or consent to being a part of your web site.

~~~
kgodey
Thanks for the feedback! We will figure out if there is a way to communicate
this better.

------
z92
Excellent job. Thanks for all the hard work.

A few usability issues. 1/ Doesn't work without javascript. and 2/ At screen
resolution 800x600 the search field gets completely hidden. Shown bellow.

[https://imgur.com/a/bPkdOoT](https://imgur.com/a/bPkdOoT)

~~~
kgodey
Thanks for your feedback!

------
pbhjpbhj
It's interesting to me how many images are lost when you filter by "right to
modify". I wonder if "CC-SA" should be the default licence, ie should be what
"CC" means.

Similarly if the default CC license were "NC" then I imagine many more shared
images would be excluded from commercial use.

My suggestion is that people probably wouldn't mind modification of their CC
images as a default.

I usually use CC-BY-SA.

Edit: looking afresh at the CC license material it appears my understanding of
the licenses is weak (eg see look down-thread), that the default does allow
modification. Which makes it weirder that people would go out there way to
specify that their pretty poor quality images could only be used as-is add not
modified.

~~~
puranjay
I've always wondered: what does "modification" include? Is resizing
modification? Cropping? If I slap some text on the raw image, would that be
constituted as modification as well?

~~~
pbhjpbhj
[https://creativecommons.org/faq/#when-is-my-use-
considered-a...](https://creativecommons.org/faq/#when-is-my-use-considered-
an-adaptation) might help somewhat.

------
empressplay
There are _a lot_ of CC licensed images coming up through Flickr search that
are nowhere to be found in CC Search and since the vast majority of CC
Search's images come from Flickr, you would be better off just searching on
Flickr...

~~~
kgodey
Thanks for the feedback! We are working with Flickr to ensure that we are able
to index all their CC-licensed images; currently there's an issue with their
API that hides some images. It should be resolved by the end of the summer.

We have images from a lot of other collections, such as the Met, Rijksmuseum,
Behance, Thingiverse etc. Flickr has more images than any of them by a couple
of orders of magnitude, though.

------
exabrial
Definitely my favorite:
[https://search.creativecommons.org/photos/9988a06f-c7d1-466d...](https://search.creativecommons.org/photos/9988a06f-c7d1-466d-bd73-ad6817da075b)

Thank you for this.

~~~
djsumdog
creepy

~~~
nogbit
Even more creepy is the related images for that link. None of them are related
in any way shape or form to it or eachother.

~~~
jtbayly
Actually they are. They are all "awkward."

------
aurelwu
Nice, I have been mostly using bing as search engine for public domain // CC0
images and it worked reasonable well but a search engine specialized on this
area is something which is very useful. What I immediately noticed is that
when you search for something including "icon" it shows lots of images of
religious icons which make sense in a way but it probably not was most people
want if they search for "fireball icon" or "mail icon" or "car icon".

~~~
kgodey
Thanks for the feedback! A big chunk of our upcoming work will focus on
improving search results.

------
Amorymeltzer
Pretty neat! Another great resource is the 53 million or so files on Wikimedia
Commons —
[https://commons.wikimedia.org/wiki/](https://commons.wikimedia.org/wiki/) —
which don't appear to be searched by this. Commons has plenty of material that
is PD, so I suppose that is technically out of scope, but it also restricts
licenses to CC-BY or CC-BY-SA whereas some of these are NC.

~~~
kgodey
Thank you! Wikimedia Commons is our next big priority. We do also index PD
content.

------
sbr464
Is there a search API available or would it be against the terms to access the
site programmatically?

~~~
kgodey
We do have an API available, although we're not encouraging use of it until we
build a few more features in.

Docs:
[https://api.creativecommons.engineering/](https://api.creativecommons.engineering/)

Terms of Service:
[https://api.creativecommons.engineering/terms_of_service.htm...](https://api.creativecommons.engineering/terms_of_service.html)

Code: [https://github.com/creativecommons/cccatalog-
api/](https://github.com/creativecommons/cccatalog-api/)

~~~
sbr464
Thanks for the info.

------
cannedslime
It is always nice to have search engines for free stuff. I have minor gripes
with the engine.

\- Browser navigation doesn't work. Going back doesnt take you to your
previous search query.

\- Search seems to be solely keyword based? And keywords are kind of hit and
miss on many images.

~~~
kgodey
Thanks for the feedback, we'll add browser navigation to our list of things to
fix. Our search is still pretty naive, a big chunk of our upcoming work will
be focused on improving results.

------
DOsinga
This is cool and it is great that the software is all open source, but
randomly clicking around it does seem that an awful lot of images all come
from Flickr, with the images even being hosted by Flickr. It seems to me the
interface could make the source more clear.

~~~
kgodey
Thanks for your feedback! We don't host any of the images, we link people
directly to the source. Flickr does have orders of magnitude more images than
other providers. We will add making the source more prominent to our list of
issues to fix.

------
200_OK
This is great. Is this all CC0 or do some of the images have conditions?

My wishlist would include also including videos and music/audio and an API to
access it. I'm sure that's a big ask though.

~~~
kgodey
All content is not CC0, we index content shared under all CC licenses and
public domain. You can filter by license or usage rights.

Adding audio is planned for Q4. Video will likely be in 2020. We already have
an API, although we're not encouraging use of it until we build a few more
features in.

Docs:
[https://api.creativecommons.engineering/](https://api.creativecommons.engineering/)

Code: [https://github.com/creativecommons/cccatalog-
api/](https://github.com/creativecommons/cccatalog-api/)

------
s_y_n_t_a_x
I'd remove the unnecessary animations when loading the images.

~~~
kgodey
Thanks for the feedback! This is on our list to do soon, we've had other
complaints.

------
mikece
Neat... though searching on "F-14" returned a lot of yoga images instead of
the naval fighter (searching "F14" returned what I was looking for).

~~~
kgodey
Thanks for the feedback! Currently the search query parsing is pretty naive
and some search terms work far better than others. A big chunk of our work
over the next few months is going to be focused on improving search relevance.

~~~
propogandist
have you considered using something like the reCaptcha approach, where you
have users classify/improve results?

------
reiinakano
I did a search for Flamingo and there were no flamingos...

------
jpincheira
Unsplash has 0,95M, and CC Search 300M. Love this. No more flickr search for
when I needed to find CC photos. <3

Thanks for much for your work @kgodey and to your team.

~~~
kgodey
Thank you! :)

------
nategri
Looks like its currently experience the hug. Unless a search for 'cat'
returning no results is expected behavior.

~~~
O1111OOO
> Looks like its currently experience the hug.

Same thing happened to me (no results) and that was my initial thought but, in
my case, I had forgotten to check uMatrix. Some vital CDNs and subdomains were
allowed but uMatrix was blocking api.creativecommons.engineering

------
endergen
Where is it claimed there 300M photos? Would love a source before claiming
that in my design slack channel

~~~
kgodey
Here's the official blog post: [https://creativecommons.org/2019/04/30/cc-
search-images/](https://creativecommons.org/2019/04/30/cc-search-images/)

I've also seen the database :).

------
miguelmota
This is awesome. I wonder how long til Google implements a similar filter..

~~~
kkaranth
They already have it. You can choose your required "usage rights" in the
advanced search settings page(atleast for images)

~~~
miguelmota
Cool! Didn’t see it in mobile but did see the “Labeled for Reuse” filter
option

------
nvr219
Tried searching for penis and was impressed by the variety of SFW results

~~~
pbhjpbhj
There doesn't appear to be a visible "sfw" flag in the search interface, but I
did your search and searched "vulva" and got barely any photos of human
anatomy. That seemed unusual somehow, is there an active nsfw filter in place?

~~~
kgodey
We don't have any filtering on our end but all the sites that we index data
from have content policies. For example, we only index public images from
Flickr which automatically excludes mature content.

------
lapnitnelav
I am very pleased with the quantity of cats any given search returns.

You may proceed.

------
Mirioron
I had no idea this even existed. That's really cool.

------
carapace
Blank page with JS disabled.

~~~
kgodey
Thanks for the feedback. We're going to be working on improving accessibility
next month and will address this.

------
chmars
@kgodey: Your default HTML source code includes CC icons loaded from
creativecommons.org. That is a problem:

1\. It results in tracking and could violate the GDPR and other data privacy
laws.

2\. It is third-party content and might get blocked by ad / content blockers.

I suggest to offer a default HTML source code without CC icons.

~~~
kgodey
Thanks for the feedback chmars! CC Search is hosted at
search.creativecommons.org so assets loaded from creativecommons.org should
not count as third-party content or tracking. Let me know if I'm missing
something.

