Hacker News new | past | comments | ask | show | jobs | submit login
Show HN: CC Search – search engine for 300M CC-licensed images (creativecommons.org)
375 points by kgodey 51 days ago | hide | past | web | favorite | 87 comments

I'm Director of Engineering at Creative Commons and part of the team that is working on CC Search.

We've been working on the product for over a year and we are just out of beta today! One of CC's goals is to encourage the use and remixing of CC-licensed content, and we hope that CC Search will help make that content more discoverable. The current version is very much an MVP and only searches images, but we plan to add more content types in the future and index the ~1.4 billion works out there under a CC license. We would love any feedback you might have.

Also, CC Search, the associated API, and the scripts we use to index data are all open-source and developed completely openly. Our sprints and roadmap are public and we welcome contributions from the community.

Relevant links:

CC Search code: https://github.com/creativecommons/cccatalog-frontend/

CC Catalog API code: https://github.com/creativecommons/cccatalog-api/

CC Catalog code: https://github.com/creativecommons/cccatalog/

2019 Vision: https://creativecommons.org/2019/03/19/cc-search/

Roadmap: https://docs.google.com/document/d/19yH2V5K4nzWgEXaZhkzD1egz...

Active sprint: https://github.com/orgs/creativecommons/projects/7

Backlog: https://github.com/orgs/creativecommons/projects/10

How to contribute to CC projects: https://creativecommons.github.io/contributing-code/

If I put images in the internet, how can I tell your search about them/about the license?

We plan to implement an API for users to submit CC-licensed works that they'd like to add to our database later this year. Once we have that, you can call our API when you publish a new CC-licensed work and we'll automatically add it to our catalog. We have to work out some details like how we can verify that the content is actually CC-licensed.

Encouraging use of ccREL embedding like aldenpage mentioned in another response is also in our long-term vision since that would require no additional actions from the creators.

You could have the same (only existing?) process than other content platforms like YouTube, Github, etc. In practice, most people are fair and post the content with the right license, but anyone can send a DMCA request.

An annual transparency report is probably mandatory with that process, like these ones :



That should probably be linked somewhere prominently then (I was aware it exists, but unsure if it is still current, given that there have been little changes to that page in the last decade). And I guess I have to hope I end up in Common Crawl?

Hello, I work on this project! For the most part, we use Common Crawl to discover which websites carry the most CC content, and then integrate the platform through either their API, if available, or put together a bespoke scraper. If you put your content on one of these integrated platforms, eventually your work will appear in our collection.

In my mind, the dream is to have the user embed an asset from our servers on their web page (like an updated version of these old CC license buttons [0]), read the referrer headers from the server logs, and then dispatch crawlers that read ccREL [1] data embedded on the page, which would allow us to instantly index content as soon as it is published. Performing broad web crawls searching the web for ccREL data is also possible but probably not what we're looking to do in the near-term.

We have a ways to go before we are able to do this, since there's no easy way for end users to create and embed ccREL at the moment, and there are of course lots of other unanswered questions about how we would moderate incorrect attribution, how these tools might be abused, etc.

[0] https://licensebuttons.net/l/by-nd/4.0/88x31.png

[1] https://www.w3.org/Submission/ccREL/

Excellent project, this simplifies the process I currently have of searching many different sources and applying a cc filter (mostly Flickr and Google images).

The two image filters I use most often there are image size (larger than) and orientation (portrait or landscape). If these would be included here it would be perfect.

Thank you for the feedback!

Nice project! Thanks for working on it!

It would be great if there were more search filters. Especially size ("larger than") and date ("within x-y", "last 1 day/week/month/year").

And when you have filters anyway, filtering by EXIF data (geoposition, and typical photography details like camera model / aperture / focal length) would be pretty cool.

Thanks for the feedback! I've added those filters to our pipeline of ideas for our roadmap [0].

[0] https://docs.google.com/document/d/19yH2V5K4nzWgEXaZhkzD1egz...

This is nice, I can see this being used for computer vision researchers and practitioners to train CNNs. Have you thought of implementing a reverse image search engine? That would be a great feature or project to work on as it can help users find images they are interested in as tags tend to be noisy.

Thanks! I’ve added this to the pipeline of ideas on our roadmap [0].

[0] https://docs.google.com/document/d/19yH2V5K4nzWgEXaZhkzD1egz...

Do you mind me asking how many man-years the project took? Is there an article on/diary covering the dev decisions?

We've been working on it for a year and we have one data engineer, one backend engineer, and one frontend engineer.

We recently set up a tech blog[0] and we have upcoming posts that will talk about the architecture and decisions we made. Here's a post about the original proof-of-concept (we are no longer using that version): https://hackernoon.com/cc-search-developer-notes-and-reflect...

[0] https://creativecommons.github.io/blog/

Out of curiosity, are there any checks run on these images to see if they really are CC?

This search result for 'satellites' is a screenshot of a google maps result, which includes the copyright information for the imagery within the CC image [see the bottom right attribution text in the imagery].

How can that really be in CC?


How could we be sure that the images themselves are genuinely licensed rather than someone copying it from another source and slapping on the licence?

We do our best to verify the license but cannot 100% guarantee that the image is CC-licensed. We have custom scripts that we use to ingest content from each content provider. For Flickr, we use their API which contains license information for each work. If you click through to the source image, you can see that the user who uploaded the image did license it under CC BY-SA 2.0.

We don't yet have a way of dealing with content where the user incorrectly licensed it, I think we could add a "report image" function that would help us identify and remove this type of content. There's a disclaimer if you scroll down that says "Verify at the source: Flickr" with an explanation of why we can't guarantee the license, maybe we should make that more prominent.

Maybe the Flickr user did buy a license from Google for that particular screenshot that allowed them to publish it as CC BY-SA. How would you verify this? What's the legal situation here? Can Google sue me when I use that image, or do they have to sue the Flickr user?

If you are publishing the image on a website, or billboard, or book cover, or any other use, you are ultimately liable for any infringement. If you licensed the image from a traditional stock photo service, they have certain practices that ensure the artist does have the rights to license the image to you, and they make some limited guarantees to you to that end. If the images were sold to you without permission of the original artist, and you were successfully sued it is possible you could pursue the stock image company for these damages. The CC licenses, and sites like Flickr are not giving you any guarantee or warranting their use, and they expressly disclaim any liability.

Google can sue you, you can in turn probably sue whoever "gave" you the image and said it was licensed CC BY-SA. Just like you can't keep a stolen item that somebody sold you for cash, you're not free from liability when you're publishing an image you didn't have rights to. You'll always need to apply judgement before publishing, and I'd do a reverse image search at the very least.

But, in general if you inadvertently buy an item that was stolen; on eBay, say; then you wouldn't be prosecuted for it, you just would lose ownership.

In the parallel situation here the 'selling' company [CC search] is saying "this isn't stolen", and you're using it in good faith. Seems like your liability _should_ be zero. If CC_search also did due diligence then their liability should probably be zero too.

Media mega-corps probably see things differently so I imagine their copyright laws mightn't be as liberal.

Yeah, it's a civil issue, I don't think anybody would try to jail somebody. You will be in breach of license, so the rights holder won't care why you're violating copyright, only that you do. I don't know of any judicial system where "good faith" is applied to civil suits (i.e. you eating my food thinking it was yours gets you out of buying me a new plate).

Similar to the hypothetical ebay case: you're going to be asked to reimburse the owner (or return his property), you collecting from the seller is a different story.

Would be interesting if moving forward, when a user wants to take an image they know they will likely share as CC, they could tap a button to turn on 'CC tagging' such that the camera app applies a digital signaure to the image. Perhaps then, at least some image sources could be validated as being intentionally released under CC by operator of the image source at the time.

Thanks for the answer, that is completely understandable. I guess the user just needs to do a bit of due diligence before running with it.

The same question could be asked for code posted on Github, yet it doesn't seem to be a problem in practice. Everyone just trust the license header in the files.

Definitely appreciate the programmatic access, but in terms of straight search results it's very hard to improve on Google Image Search with the Usage Rights filter.. Info: https://support.google.com/websearch/answer/29508?p=ws_image...

Thanks for the feedback! A big chunk of our upcoming work is going to be towards improving search relevance. We also plan to add content types other than images this year, starting with open textbooks and audio, so that will differentiate us.

That would be fantastic. Also like to +1 the reverse image search support, finding CC images visually similar to a restricted license image would be really cool feature that Google does not seem to support.

Thank You for doing this. I hope one day we will see a similar search for CC licensed music where you can search based on license type, and e.g. music style, beats per minute, length, audio format,...

I was recently trying to find music for a side project and searching for CC licensed music appears to have gone from somewhat onerous to very in sites like Jamendo.com, free music archive, etc. This is off topic but if someone can point to a website with CC licensed music where it’s possible to search based on license type (e.g. BY-NC-SA) I would love to know.

Thanks for your feedback! We plan to add audio to CC Search later this year (probably in Q4). We will have searching by license type built in from the beginning and those all sound like great filters to add.

Remember that for commercial use you still need a model release in the USA when faces are identifiable. It is totally separate from the licensing of the image.

What's the relevant USC for that please?

Don’t know what you mean by USC

I arbitrarily searched for "hula hoop" and most of the top images have people in them. People using this facility to find images to freely use would be advised to avoid images of recognizable people, because you don't know if they signed model releases or consent to being a part of your web site.

Thanks for the feedback! We will figure out if there is a way to communicate this better.

Excellent job. Thanks for all the hard work.

A few usability issues. 1/ Doesn't work without javascript. and 2/ At screen resolution 800x600 the search field gets completely hidden. Shown bellow.


Thanks for your feedback!

It's interesting to me how many images are lost when you filter by "right to modify". I wonder if "CC-SA" should be the default licence, ie should be what "CC" means.

Similarly if the default CC license were "NC" then I imagine many more shared images would be excluded from commercial use.

My suggestion is that people probably wouldn't mind modification of their CC images as a default.

I usually use CC-BY-SA.

Edit: looking afresh at the CC license material it appears my understanding of the licenses is weak (eg see look down-thread), that the default does allow modification. Which makes it weirder that people would go out there way to specify that their pretty poor quality images could only be used as-is add not modified.

I've always wondered: what does "modification" include? Is resizing modification? Cropping? If I slap some text on the raw image, would that be constituted as modification as well?

It's whatever the courts decide. Very not helpful for individuals.

There are _a lot_ of CC licensed images coming up through Flickr search that are nowhere to be found in CC Search and since the vast majority of CC Search's images come from Flickr, you would be better off just searching on Flickr...

Thanks for the feedback! We are working with Flickr to ensure that we are able to index all their CC-licensed images; currently there's an issue with their API that hides some images. It should be resolved by the end of the summer.

We have images from a lot of other collections, such as the Met, Rijksmuseum, Behance, Thingiverse etc. Flickr has more images than any of them by a couple of orders of magnitude, though.

Definitely my favorite: https://search.creativecommons.org/photos/9988a06f-c7d1-466d...

Thank you for this.


Even more creepy is the related images for that link. None of them are related in any way shape or form to it or eachother.

Actually they are. They are all "awkward."

Nice, I have been mostly using bing as search engine for public domain // CC0 images and it worked reasonable well but a search engine specialized on this area is something which is very useful. What I immediately noticed is that when you search for something including "icon" it shows lots of images of religious icons which make sense in a way but it probably not was most people want if they search for "fireball icon" or "mail icon" or "car icon".

Thanks for the feedback! A big chunk of our upcoming work will focus on improving search results.

Pretty neat! Another great resource is the 53 million or so files on Wikimedia Commons — https://commons.wikimedia.org/wiki/ — which don't appear to be searched by this. Commons has plenty of material that is PD, so I suppose that is technically out of scope, but it also restricts licenses to CC-BY or CC-BY-SA whereas some of these are NC.

Thank you! Wikimedia Commons is our next big priority. We do also index PD content.

Is there a search API available or would it be against the terms to access the site programmatically?

We do have an API available, although we're not encouraging use of it until we build a few more features in.

Docs: https://api.creativecommons.engineering/

Terms of Service: https://api.creativecommons.engineering/terms_of_service.htm...

Code: https://github.com/creativecommons/cccatalog-api/

Thanks for the info.

It is always nice to have search engines for free stuff. I have minor gripes with the engine.

- Browser navigation doesn't work. Going back doesnt take you to your previous search query.

- Search seems to be solely keyword based? And keywords are kind of hit and miss on many images.

Thanks for the feedback, we'll add browser navigation to our list of things to fix. Our search is still pretty naive, a big chunk of our upcoming work will be focused on improving results.

This is cool and it is great that the software is all open source, but randomly clicking around it does seem that an awful lot of images all come from Flickr, with the images even being hosted by Flickr. It seems to me the interface could make the source more clear.

Thanks for your feedback! We don't host any of the images, we link people directly to the source. Flickr does have orders of magnitude more images than other providers. We will add making the source more prominent to our list of issues to fix.

This is great. Is this all CC0 or do some of the images have conditions?

My wishlist would include also including videos and music/audio and an API to access it. I'm sure that's a big ask though.

All content is not CC0, we index content shared under all CC licenses and public domain. You can filter by license or usage rights.

Adding audio is planned for Q4. Video will likely be in 2020. We already have an API, although we're not encouraging use of it until we build a few more features in.

Docs: https://api.creativecommons.engineering/

Code: https://github.com/creativecommons/cccatalog-api/

I'd remove the unnecessary animations when loading the images.

Thanks for the feedback! This is on our list to do soon, we've had other complaints.

Neat... though searching on "F-14" returned a lot of yoga images instead of the naval fighter (searching "F14" returned what I was looking for).

Thanks for the feedback! Currently the search query parsing is pretty naive and some search terms work far better than others. A big chunk of our work over the next few months is going to be focused on improving search relevance.

have you considered using something like the reCaptcha approach, where you have users classify/improve results?

Agreed, "ninja" came with a lot of dogs and cats (assumbly named Ninja). And "test" provides a weird assortment of not tests. I'd say filename keyword matching might be dangerous and some measure of image analysis should be made.

Presumably they're going to allow tag editing, maybe using a known-good-user concordance system, with meta grading (if your tags are always removed you lose trading rights).

I did a search for Flamingo and there were no flamingos...

Unsplash has 0,95M, and CC Search 300M. Love this. No more flickr search for when I needed to find CC photos. <3

Thanks for much for your work @kgodey and to your team.

Thank you! :)

Looks like its currently experience the hug. Unless a search for 'cat' returning no results is expected behavior.

> Looks like its currently experience the hug.

Same thing happened to me (no results) and that was my initial thought but, in my case, I had forgotten to check uMatrix. Some vital CDNs and subdomains were allowed but uMatrix was blocking api.creativecommons.engineering

Where is it claimed there 300M photos? Would love a source before claiming that in my design slack channel

Here's the official blog post: https://creativecommons.org/2019/04/30/cc-search-images/

I've also seen the database :).

This is awesome. I wonder how long til Google implements a similar filter..

They already have it. You can choose your required "usage rights" in the advanced search settings page(atleast for images)

Cool! Didn’t see it in mobile but did see the “Labeled for Reuse” filter option

Tried searching for penis and was impressed by the variety of SFW results

There doesn't appear to be a visible "sfw" flag in the search interface, but I did your search and searched "vulva" and got barely any photos of human anatomy. That seemed unusual somehow, is there an active nsfw filter in place?

We don't have any filtering on our end but all the sites that we index data from have content policies. For example, we only index public images from Flickr which automatically excludes mature content.

I am very pleased with the quantity of cats any given search returns.

You may proceed.

I had no idea this even existed. That's really cool.

Blank page with JS disabled.

Thanks for the feedback. We're going to be working on improving accessibility next month and will address this.

@kgodey: Your default HTML source code includes CC icons loaded from creativecommons.org. That is a problem:

1. It results in tracking and could violate the GDPR and other data privacy laws.

2. It is third-party content and might get blocked by ad / content blockers.

I suggest to offer a default HTML source code without CC icons.

Thanks for the feedback chmars! CC Search is hosted at search.creativecommons.org so assets loaded from creativecommons.org should not count as third-party content or tracking. Let me know if I'm missing something.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact