I miss the era of deep learning where we actually diagnosed failure modes instead of just scaling compute.
I spent the weekend auditing ViT vs. CNN decision boundaries using a custom perturbation pipeline. I bypassed the standard LIME segmentation (Quickshift produces too much high-variance noise) and injected a custom SLIC pipeline to force semantic superpixels.
The results show that "Clever Hans" is still very much an issue.
Spurious Correlations: The ViT predicted "Jeep" (p=0.99) purely by overfitting on the muddy terrain texture. The attention map showed it ignored the vehicle geometry entirely.
Hallucinations: EfficientNet hallucinated a "toaster" solely because it detected a white counter + flowers context.
Accuracy metrics are masking the fact that our models are just exploiting dataset biases. Full write-up on the surrogate loss implementation and visual audits here.
I spent the weekend auditing ViT vs. CNN decision boundaries using a custom perturbation pipeline. I bypassed the standard LIME segmentation (Quickshift produces too much high-variance noise) and injected a custom SLIC pipeline to force semantic superpixels.
The results show that "Clever Hans" is still very much an issue.
Spurious Correlations: The ViT predicted "Jeep" (p=0.99) purely by overfitting on the muddy terrain texture. The attention map showed it ignored the vehicle geometry entirely.
Hallucinations: EfficientNet hallucinated a "toaster" solely because it detected a white counter + flowers context.
Accuracy metrics are masking the fact that our models are just exploiting dataset biases. Full write-up on the surrogate loss implementation and visual audits here.