Hacker News new | past | comments | ask | show | jobs | submit login

Based on the article, I’d expect Apple was retooling their CSAM scanner to try to catch art thieves.

Jokes aside, I would like to use this opportunity to express something I really want: I really wish I could search Wayback Machine with perceptual hashes. Google Images has had search by image for a long time, but it seems to get rid of content after a while once it’s offline. Meanwhile, Internet Archive has a ton of images you basically can’t find elsewhere anymore, and depending on how it was archived, it may be very difficult to find it if you don’t already know the URL. For sake of preservation, that would be genuinely amazing. You could go from a single thumbnail or image and potentially find more images or better versions.

It’s not like being able to identify common objects and artifacts with a phone camera isn’t super cool, but its far from perfect and in some of its more novel use cases (such as helping blind people navigate) that can be troublesome. Nothing technically stops the aforementioned Internet Archive phash index except for the fact that there will probably never be enough resources to create or maintain such an index.




Internet Archive has an experimental API to perform reverse image searches.

https://archive.readme.io/docs/reverse-image-search-api

There is also RootAbout: http://rootabout.com/

You may have a better chance of finding the image by searching on a couple dozen search engines using my extension.

https://github.com/dessant/search-by-image#readme


Thanks, this is awesome stuff. I’ll have to give it a shot sometime.

I’m also seeing a full text search API, which is again, incredible, especially if these indices are relatively complete.


Agreed that it'd be great to have that phash image index. A full text search of the Wayback Machine's archives would be amazing to have as well..!

I've been putting away the idea of starting a server that would request archives from the Wayback Machine, parse text from the html documents, and create the world's-simplest-search-index i.e. just the location (document id) of every encountered word. There's a ton of problems with this "plan", but... having any search would be better than nothing?


Honestly, I think possibly the biggest problem with indexing Wayback Machine is simply the size. I’m pretty sure it’s growing far faster than anyone can pull WARCs out for indexing, especially because well, it’s not exactly high throughput on the download side. I don’t blame anyone for that, but it does make the prospect of externally indexing feel a bit bleak.

At this point, I’d like it if there were just tools to index huge WARCs on their own. Maybe it’s time to write that.


Right, the download speed is definitely an issue (and like you say, it's quite understandable considering the volume/traffic they deal with), and the continual growth is one of many factors I didn't consider.

I wonder if the IA would allow someone to interconnect directly with their storage datacenter, if one were to submit a well articulated plan to create this search index/capability.

Also, what do you mean by tools to index WARCs? Specifically, the gzip + WARC parsing + html parsing steps? Would the (CLI?) result be text extracted from the original html pages, i.e. something along the lines of running `strings` or beautifulsoup?


Yeah, pretty much. Though being able to directly load data into a search cluster like Elastic would be nice.


For as long as it lasts, yandex is my go-to reverse image search favorite, by a rather large margin.

Try searching with a portrait… it is unlikely to find the person, unless there are images of that person in Russian social media. But it will find your identical twin behind the ironic curtain.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: