This is less about 'self-supervised' learning, and more about ground truth.
I see a variant of this in medical AI. E.g. people generate 'fake' augmented Brain MRI datasets using image processing tricks, in order to provide more training data. But the 'fake' MRI datasets are clearly 'similar looking images', which may or may not be anatomically correct.
If you start accepting a flood of fake images as legitimate data for training AI, then it will be impossible to trust the predictions, however good they are in the majority of cases.
I think you’re describing introduction of noise to training sets, which is a staple of training.
Your definitions of “fake” and “legitimate” are circular, and miss the central point of large ML models: they can extrapolate from imperfect data because of the massive scale.
Yes, the predictions will be imperfect. That’s true today, of both ML models and human radiologists. It’s about reducing the error rate, not designing a perfect algorithm that is never wrong. I’m pretty sure Gödel or someone can explain why the later isn’t even possible, for machine or human.
That does not feel like a good example. The context here is largely generative models relaxing human creations. We don’t need to generate Brain MRIs. This use case you outline is a niche thing for trying to train better models. Not doing the thing we actually need to do at such a scale that humans aren’t doing the original task anymore.
I see a variant of this in medical AI. E.g. people generate 'fake' augmented Brain MRI datasets using image processing tricks, in order to provide more training data. But the 'fake' MRI datasets are clearly 'similar looking images', which may or may not be anatomically correct.
If you start accepting a flood of fake images as legitimate data for training AI, then it will be impossible to trust the predictions, however good they are in the majority of cases.