Hacker News new | past | comments | ask | show | jobs | submit login

Image augmentations are hard to add to training. It may seem easy, but it requires a lot of thought.

(To back up a bit: Image augmentations are how you solve that problem. "How do I make my model robust across different cameras?" It might be tempting to gather labeled data from a variety of cameras, but that doesn't necessarily result in a model that can handle newer, larger-res cameras. So one solution is to distort the training data with augmentations so that the model can't tell which resolution the input images are coming from.)

The other way to deal with it is to just downscale the camera's image to, say, 416x416. But that introduces a question: can different cameras give images that look different when downscaled to 416x416? Sure they can! Cameras have a dizzying array of features, and they perform differently in different lighting conditions.

To return to the point about image augmentations being hard to add: It's so easy to explain what your training code should do "Just distort the hue a bit" and there seem to be operations explicitly for that: https://www.tensorflow.org/api_docs/python/tf/image/adjust_h... but when you go to train with them, you'll discover that backpropagation isn't implemented, i.e. they break in training code.

I've been trying to build an equivalent of Kornia for tensorflow https://github.com/kornia/kornia which is a wonderful library that implements image augmentations using nothing but differentiable primitives. Work is a bit slow, but I hope to release it in Mel https://github.com/shawwn/mel (which will hopefully look less like a TODO soon).

But all of this still raises the question of which augmentations to add. Work in this area is ongoing; see Gwern's excellent writeup at https://github.com/tensorfork/tensorfork/issues/35

Training a model per camera isn't necessarily a terrible idea, either. In the future I predict that we'll see more and more "on-demand" models: models that are JIT optimized for a target configuration (in this case, a specific camera).

Robustness often comes at the cost of quality / accuracy (https://arxiv.org/abs/2006.14536 recently highlighted this). In situations where that last 2% of accuracy is crucial, there are all kinds of tricks; training separate models is but one of many.






> To return to the point about image augmentations being hard to add: It's so easy to explain what your training code should do "Just distort the hue a bit" and there seem to be operations explicitly for that: https://www.tensorflow.org/api_docs/python/tf/image/adjust_h.... but when you go to train with them, you'll discover that backpropagation isn't implemented, i.e. they break in training code.

Why not do the data augmentation during preprocessing (so that the transformations don't have to be done by differentiable transforms)? I.e., map over a tf.Dataset with the transformation (and append to the original dataset).


Why are you trying to backpropagate over data augmentations? I've never done that (or heard about it being done). Usually I just do the augmentations on the input samples and then feed the augmented samples to the network.

Differentiable augmentations aren't necessary unless the augmentations are midstream (so you have to propagate parameters above the augmentations, which is weird) or have parameters (at which point you aren't learning how to work on different views of the same sample, you are learning how to modify a sample to be more learnable, which is a different problem that you are trying to solve).

Don't get me wrong, augmenting samples to reduce device bias is a hard problem, but you might be making it harder than it needs to be.


The data augmentations we are interested in are in fact 'midstream', as they augment the examples before passing into the D or the classification loss but you must backprop from that back through the augmentation into the original model, because you don't want the augmentations to 'leak': the G is not supposed to generate augmented samples, the augmentation is there to regularize the D and reduce its ability to memorize real datapoints. It would probably be better to consider them as a kind of consistency or metric loss along the lines of SimCLR (which has helped inspire these very new GAN data augmentation techniques). It's a bit weird, which is perhaps why despite its simplicity (indicated by no less than 4 simultaneous inventions of it in the past few months), it hasn't been done before. You really should read the linked Github thread if you are interested.

Ah! I can see that in a GAN architecture. That makes much more sense.

It wasn't clear from your original post that you were augmenting generated images, not real data.


You're augmenting the real data too.

> Training a model per camera isn't necessarily a terrible idea, either. In the future I predict that we'll see more and more "on-demand" models: models that are JIT optimized for a target configuration (in this case, a specific camera).

Meta-learning, or perhaps learning camera embeddings to condition on, would be one way. Although that might all be implicit if you use a deep enough NN and train on a sufficiently diverse corpus of phones+photos.




Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact

Search: