Yep, just mentioned it to the Archive Team IRC. We're probably going to selectiv...

thebouv · on Aug 13, 2020

Rough idea: maintain an Awesome List of images worth saving, take submissions from public, use that list to automate what to pull?

maxfan8 · on Aug 13, 2020

Yeah, good idea — I’m not in these fields so it’s difficult for me to judge. Also, it sounds like we should be prioritizing niche images that only a handful of papers use rather than images that people rely upon regularly.

cosmie · on Aug 13, 2020

Couldn't you bootstrap a list by searching/parsing the Archive dataset itself? Searching for

A) "docker pull" commands and parsing the text that comes after it based on the command's syntax[1] to extract instructional references to images such as "docker pull ubuntu:latest, and

B) Searching for links/text beginning with "https://hub.docker.com/_/" to identify informational references to image base pages such as (https://hub.docker.com/_/ubuntu)

[1] https://docs.docker.com/engine/reference/commandline/pull/

maxfan8 · on Aug 13, 2020

Good idea! The base images are probably not in danger of being deleted though.

The other issue is that (to my knowledge) the amount of papers on IA isn't terribly impressive. I think maybe indexing and going through SciHub will be better since some of these fields slap paywalls in front of their papers.

However, that's a pretty large task as well. The other thing is that papers rarely say "to reproduce my work do . . .". Usually the best we've got is a link to a GitHub repo (if that). I'm not sure how effective that strategy will be since it's guaranteed to be an under-count of the docker images we'd need to archive. Perhaps in conjunction with archiving all images that fall under particular search queries, we'd get the best of both worlds.

I've you've got ideas, feel free to hop onto efnet (#archiveteam and #archiveteam-bs) (also on hackint) to share your thoughts.

contravariant · on Aug 13, 2020

Since images tend to be based on each other I wonder if someone's analyzed the corresponding dependency graph yet. In theory you should get quite far if you isolate the most commonly used base images.

NewJazz · on Aug 13, 2020

Are those not the images that are basically guaranteed to stay in Dockerhub?

toomuchtodo · on Aug 13, 2020

“Guaranteed” is a strong word.