> ImageNet has now removed many of the obviously problematic people categories – certainly an improvement – however, the problem persists because these training sets still circulate on torrent sites [where files are shared between peers].
This is the scariest part of the article. The idea that some central authority should be censoring and revising datasets to keep up with political orthodoxy, and we should be rooting out unauthorized torrent sharing of unapproved training data.
From a technical point of view, the common reason we pre-traini on imagenet is as the starting point for fine tuning for a specific use case. The diversity and size of the dataset makes good generic feature extractors. If you're using a ML model to identify people as kleptomaniac or drug dealer or other "problematic" labels, you're working on some kind of phrenology and it doesnt take an "AI ethicist" to know you shouldn't do it. But that's not the same as pretraining on imagenet, and certainly doesn't support trying to make datasets align with today's political orthodoxy.
This is the scariest part of the article. The idea that some central authority should be censoring and revising datasets to keep up with political orthodoxy, and we should be rooting out unauthorized torrent sharing of unapproved training data.
From a technical point of view, the common reason we pre-traini on imagenet is as the starting point for fine tuning for a specific use case. The diversity and size of the dataset makes good generic feature extractors. If you're using a ML model to identify people as kleptomaniac or drug dealer or other "problematic" labels, you're working on some kind of phrenology and it doesnt take an "AI ethicist" to know you shouldn't do it. But that's not the same as pretraining on imagenet, and certainly doesn't support trying to make datasets align with today's political orthodoxy.