Based on the article, I’d expect Apple was retooling their CSAM scanner to try t...

dessant · on March 17, 2022

Internet Archive has an experimental API to perform reverse image searches.

https://archive.readme.io/docs/reverse-image-search-api

There is also RootAbout: http://rootabout.com/

You may have a better chance of finding the image by searching on a couple dozen search engines using my extension.

https://github.com/dessant/search-by-image#readme

jchw · on March 17, 2022

Thanks, this is awesome stuff. I’ll have to give it a shot sometime.

I’m also seeing a full text search API, which is again, incredible, especially if these indices are relatively complete.

gregsadetsky · on March 17, 2022

Agreed that it'd be great to have that phash image index. A full text search of the Wayback Machine's archives would be amazing to have as well..!

I've been putting away the idea of starting a server that would request archives from the Wayback Machine, parse text from the html documents, and create the world's-simplest-search-index i.e. just the location (document id) of every encountered word. There's a ton of problems with this "plan", but... having any search would be better than nothing?

jchw · on March 17, 2022

Honestly, I think possibly the biggest problem with indexing Wayback Machine is simply the size. I’m pretty sure it’s growing far faster than anyone can pull WARCs out for indexing, especially because well, it’s not exactly high throughput on the download side. I don’t blame anyone for that, but it does make the prospect of externally indexing feel a bit bleak.

At this point, I’d like it if there were just tools to index huge WARCs on their own. Maybe it’s time to write that.

gregsadetsky · on March 17, 2022

Right, the download speed is definitely an issue (and like you say, it's quite understandable considering the volume/traffic they deal with), and the continual growth is one of many factors I didn't consider.

I wonder if the IA would allow someone to interconnect directly with their storage datacenter, if one were to submit a well articulated plan to create this search index/capability.

Also, what do you mean by tools to index WARCs? Specifically, the gzip + WARC parsing + html parsing steps? Would the (CLI?) result be text extracted from the original html pages, i.e. something along the lines of running `strings` or beautifulsoup?

jchw · on March 17, 2022

Yeah, pretty much. Though being able to directly load data into a search cluster like Elastic would be nice.

KarlKemp · on March 17, 2022

For as long as it lasts, yandex is my go-to reverse image search favorite, by a rather large margin.

Try searching with a portrait… it is unlikely to find the person, unless there are images of that person in Russian social media. But it will find your identical twin behind the ironic curtain.