Hacker News new | past | comments | ask | show | jobs | submit login
Illustrated FixMatch for semi-supervised learning (amitness.com)
237 points by amitness on April 3, 2020 | hide | past | favorite | 47 comments



The cold hard reality of machine learning is that most useful data isn't readily available to just be collected. Semi-supervised and weakly supervised learning, data augmentation, multi-task learning, these are the things that will enable machine learning for the majority of companies out there who need to build datasets and potentially leverage domain expertise somehow to bootstrap intelligent features in their apps. This is great work in that direction for computer vision.

Even the giants are recognizing this fact and are leveraging it to great effect. Some keywords to search for good papers and projects: Overton, Snorkel, Snorkel Metal


Also Flying Squid, another interesting project from Stanford - http://hazyresearch.stanford.edu/flyingsquid


Great summary! Reminds me a lot about Leon Bottou's work on using deep learning to learn causal invariant representations. (Video: https://www.youtube.com/watch?v=lbZNQt0Q5HA)

We can view the augmentations of the image as "interventions" forcing the model to learn an invariant representation of the image.

Although the blog post did not frame it as this type of problem (not sure if the paper did), I think it can definitely be seen as such and is really promising.


Interesting, thank you for sharing that. It reminds me of this approach called "PIRL" by Facebook AI. They framed the problem to learn invariant representations. You might find it interesting.

https://amitness.com/2020/03/illustrated-pirl/


I wish all papers were structured this way, by default.

That is, plenty of good diagrams, clear explanations and intuitions, no unnecessary mathiness.


This is a blog not a paper, it seems you wouldn't like the source material: https://arxiv.org/pdf/2001.07685.pdf

But you are correct! This way of showing off your work is much nicer than what ends up in the paper. The blog representation is a good opportunity to show off the results of the paper on a higher level. However, the "mathy" paper is still important so other experts in the field can understand the details of the technique


Hi,

Wanted to clarify that this is the summary article of the paper. I wrote it to help out people who might not have the math rigor and research background to understand research papers but would benefit from an intuitive explanation.

The actual paper is available here: https://arxiv.org/abs/2001.07685


I would argue that folks like you translating the heavy science into comprehensible ideas to those less deep into the field are doing just as much to advance science as the authors of these papers.

Seriously, this is fantastic work and I cannot compliment you enough on it.


Thank you. It's very encouraging to hear that.


Speaking of diagrams, I once read a short (probably no more than 3-4 pages) paper with one theorem and one diagram. The diagram was essential for me to understand the proof. The problem was the diagram was a diagram of the proof.


I wonder if a two step process could work better than this, first use a variational autoencoder or simple an autoencoder then use it to train the labeled sampled.

In (1) there is a full example of using the two step strategy but using more labeled data to obtain 92% of accuracy. Someone can try changing the second part to use only ten labels for the classifying part and share results?

(1) https://www.datacamp.com/community/tutorials/autoencoder-cla...

Edited: I found a deep analysis in (2), in short for CIFAR 10 the VAE semi-supervised learning approach provides poor results, but the author has not used augmentation!

(2) http://bjlkeng.github.io/posts/semi-supervised-learning-with...


Yeah, authors have tried mixing the strategy you described(self-supervised learning) for semi-supervised tasks.

Basic idea is to learn generic image representations without manual labeling and then finetune that on your small dataset. These are relevant articles I have wrote on it: https://amitness.com/2020/02/illustrated-self-supervised-lea...

https://amitness.com/2020/03/illustrated-simclr/


I wish there was a way to augment data as easily for free text, and other business data. I always see these few-shot learning papers for images, I suspect because it's easy to augment image datasets and because image-recognition is interesting to laypeople. The vast majority of data we deal with in business is text/numerical which is much harder to use in these approaches.


Agree with you on this. For text data, there was a paper called "UDA"(https://arxiv.org/abs/1904.12848) that did some work on this direction.

They augmented text by using backtranslation. Basic idea is you take text in English, translate that to some other language say French and then translate back the French text to English. Usually, you get back an English sentence that is different that the original English sentence but has the same meaning. Another approach they use to augment is to randomly replace stopword/low tf-idf(intuitively say very frequent words like a, an, the) with random words.

You will find implementation of UDA on GitHub and try that out.

I am learning these existing image semi-supervised technique right now and the plan is to do research on how we can transfer those ideas to text data. Let's see how it goes.


Haha that's how we used to generate blog spam content and comments :p


What are other techniques you use to generate spam? Maybe the research community can learn from you guys


Just replace a random subset of words with their nearest neighbors computed using a fancy word embedding model like BERT or GPT-2.

If you cannot do that due to them not including a vocabulary file, you're stuck using something like word2vec/FastText which is fine but not ideal if you're looking for grammatical correctness...


Yeah, since BERT has been trained specifically on this task of replacing random masked words, the resulting augmented sentence should sound natural compared to something like Word2Vec.


I don't know much about ML/Deep-Learning and I have a burning question:

Say we have 10 images as a starting point. Then we create 10,000 images from those 10 images by adding noise, filters, flip them, skew them, distort them, etc. Isn't the underlying data the same (or some formal definition of shannon information entropy)? Would that actually improve neural networks?

I've always wondered. Is it possible to generate infinite data and get almost perfect neural network accuracy?


> Is it possible to generate infinite data and get almost perfect neural network accuracy?

Basic answer is no, but the reason is kind of interesting.

Imagine that the input to the model is a list of facts, initially the facts are just:

   * Image 1 has class A
   * Image 2 has class B
   * etc...
The idea with data augmentation, in a roundabout way, is to add other facts:

   * Flipping the image does not change its class
   * Translating the image does not change its class
   * Adding a small amount of noise to the image does
     not change its class
   * etc...
However, it is tricky to express those facts as inputs to the model, but it is easy to generate new images based on those facts which the model should be able to learn from. It would likely be more efficient if those facts could be expressed directly though.

So, by generating more data the model can progressively learn those "class invariant transformations", but the model would only reach perfect accuracy if all class invariant transformations were taught.

Another way to "teach" a model these rules is to embed the rule into the structure of the model itself, eg the idea behind convolutional neural networks is to embed translation independence into the model, so that it doesn't need to be taught that from large batches of translated images.


This is quite common, it's often called data augmentation.

For example, most CNNs aren't invariant to skew, distortions, rotations, or even zoom level. So to train a neural net to recognize both 8x8 pixel birds and 10x10 pixel birds, you need to add images of both zoom levels.

Of course, this is a weakness, and there is a lot of research to try to rectify this, like Hinton's capsule networks.

For things like adding noise, in some cases it is used as regularization to make the model robust to noise, in others, such as GANs, models are trained to learn the difference between the generatively created images that are higher entropy from the true images, to refine the model.

But as the sibling noted, and you mention, the underlying data is somewhat the same, so yes, you do need a lot of diversity... but in practice these techniques can be helpful.


As others have noted, this is data augmentation, and it's incredibly useful to increase variation in training data to help decrease overfitting.

It's not a silver bullet. It won't capture the natural variations that happen in the real world.

But new forms of augmentation (like OP) are helping us get closer.

For example, MixMatch creates "mosaic" images by combining images across the training set [1]. In object detection, bounding box only augmentations are improving models by introducing variation [2].

And an anecdote: I work on https://roboflow.ai , and we've seen customers make production-ready results from datasets <20 images based on techniques like these.

[1] https://arxiv.org/abs/1905.02249 [2] https://arxiv.org/pdf/1906.11172.pdf


Your big problem with 10 images is going to be overfitting. By modifying an image and training on that too, you're effectively teaching the network that that sort of modification shouldn't change the label. It learns a kind of invariant. That invariant isn't the same as actually seeing the dog from another angle, but it's better than nothing.


Data augmentation will help prevent your model from overfitting a bit but the amount of useful information you get from naively augmented data will reach diminishing returns at some point.

Data augmentation alone (e.g. rotations / shift / crops / color perturbations / cutout... of a single photo of an husky dog) will never yield the added information that is contained in new pictures showing subtle variations of the phenomenom your are trying to model (e.g. a new photo of a Dalmatian dog if you have no Dalmatian dogs in your original training set).


For a standard convolutional net, the low-entropy formulation of for isntance rotation is not immediately accessible, which makes rotation a viable data augmentation and regularization strategy. Some designs try to account for natural symmetries by incorporating the related transformation as priors in the architecture.


This depends on how well those 10 images are representing the disribution of data for your actual task. With only 10 samples thats highly unlikely.

What you are talking about is data augmentation, a strategy we can use to expand our training dataset synthetically, mostly in a bid to prevent over-fitting.


This is already done. It's called data augmentation and is extremely helpful in computer vision


How do we generate more "information" from a limited given information? Doesn't that break some law of information theory?


With data augmentation, we're effectively injecting additional information about what sorts of transformations of the data the model should be insensitive to. The additional information comes from our (hopefully) well-informed human decisions about how to augment the data. By doing this, we can reduce the tendency for the model to pick up dependencies on patterns that are useful in the context of the (very small) training dataset, but which don't work well on new data that isn't in the training set.


it's less "generating more information" and more "presenting the same information in new ways". A more ideal model wouldn't need augmented data, but this is what works well with current architectures. It may be that practical constraints mean we never move away from augmentation, just as we'll never move towards single-layer neural nets, even though theoretically they can fit any model.


Short answer is no, certainly to the "perfect" part.

The core problem in ML is generalization; simply put - how well does your approach work with new data it hasn't seen before. Think of it this way: there is a large set of all the potential inputs you could see, and you only get to see small subset when you are training; what do you do so your general performance is best? Which of course you can't actually know but you can try and estimate.

There are two issues that can give you a lot of trouble here. The first is overfitting (you'll do much better on the training set than "in real life"), the second is bias in in your training samples. Data augmentation (what you are talking about) is one approach to reduce parts of the former effect, and done correctly it can help.

Take a simple example, imagine we were trying to recognize simple geometric shapes on images of a page - you want to find triangles, rectangles, ellipses, etc. I only give you a small set images, say 10s of shapes total.

Now you suspect that "in the wild" you can have triangles at all sorts of rotations, and sizes, but I've only given you a few examples. So you want your algorithm to learn the shapes, but not the sizes or orientations. If you just train on these, it may not recognize a triangle that is just 2x as big as any it has seen, or rotated 20 degrees left from one it has seen, etc.

One approach would be to try and find a rotation and/or scale invariant representation for your inputs - if you "know" that shouldn't matter you've now removed it from the problem. This can be hard or even mathematically impossible to do, depending on the problem space (e.g. there is no rotation and scale invariant manifold for photographic images). So another way you can approach it is empirically; to take the examples I gave you, and generate new examples in different poses and scales. You feed this into your training and should get a much more robust result, one that doesn't hew too closely to the training set (i.e. less over training).

So this sounds great, right? What could go wrong? There are a few issues. One is you are now enforcing things outside what you learn from the data, so if you are wrong you will make things worse.

More subtly when you do this you can tend to amplify any of the sampling biases you had originally. Imagine that I never gave you an equilateral triangle in the training set. It's quite plausible that by generating millions of inputs from a few examples, this category gets pushed closer to something symmetric, like a sphere , say.

Another issue that can be subtle is that the manipulations you are doing for data augmentation can easily introduce new things to the data that you don't see, and your training can pick that up. Consider, for example, rotations of these shapes. I told you we were doing this from images, i.e. discretely sampled grids. This means that other than certain symmetric rotations and flips, you can't do this without resampling. And you can't resample without smoothing. So if you take a dozen or so "crisp" examples and turn them into 10s of thousands of "smoothed" examples, what exactly are you teaching your model. I'm also waving my hands hear about how you are extracting "shapes" from "background" and in a NN context, what your inputs actually look like... but you can introduce issues here also.

There are lots of trade offs here. It's a useful technique, but unsurprisingly isn't a silver bullet.


I had a read through this and I couldn't really tell if there was something novel here?

I understand that perturbations and generating new examples from labelled examples is a pretty normal park of the process when you only have a limited number of examples available.


The novelty is in applying 2 perturbations to available unlabeled images and use them as part of training. This is different than what you are describing about applying augmentations to labeled images to increase data size.


My immediate question was "how do you use unlabeled images for training?" But then I decided to read the paper :) The answer is:

Two different perturbations to the same image should have the same predicted label by the model, even if it doesn't know what the correct label is. That information can be used in the training.


What if the model's prediction is wrong with high confidence? What if the cat is labeled as a dog for both perturbations? Then wouldn't the system train against the wrong label?


Nope,because of the way it works. So in the beginning when the model is being trained on the labeled data, it will make many mistakes. So it's confidence for either cat or dog will be low. Hence, in that case unlabeled data are not used at all.

As training progresses, the model will become better at labeled data. And so it can start predicting with high confidence on unlabeled images that are trivial/similar-looking/same distribution with labeled data. So, gradually unlabeled images get started being used as part of training. As training progresses, more and more unlabeled data are added.

The mathematics of the combined loss function and curriculum learning part talks about this.


It is not the same thing but kind of reminds of my naive and obvious(meaning something that came up when drinking beer) idea of generating bunch of variations of your labeled data in cases when you do not have enough.

Let's say you only have one image of dog, you generate bunch of color variations, sharpness adjustments, flips, transforms, etc. Voila you have 256 images of the same dog.

EDIT: I noticed that this is definitely a common idea as others have already pointed out.


I am not sure how this article got ranked so high. I am suspicious about reading these article written by non experts. I would prefer to go to authentic sources and read the original paper. Most of the time information in these articles are misleading and wrong.


Instead of speaking in generalities, can you point out what is wrong in the posted article?


The title is fraud. It fraudulently report 85% accuracy gain.But the inside it is something else. "FixMatch is a recent semi-supervised approach by Sohn et al. from Google Brain that improved the state of the art in semi-supervised learning(SSL). It is a simpler combination of previous methods such as UDA and ReMixMatch. In this post, we will understand the concept of FixMatch and also see it got 78% median accuracy and 84% maximum accuracy on CIFAR-10 with just 10 labeled images."


We should flag these fraudulent articles, I am not sure the author has any credibility.


Title is (slightly) wrong.

As the first paragraph says: "In this post, we will understand the concept of FixMatch and also see it got 78% accuracy on CIFAR-10 with just 10 images."

Reporting the best performance on a method that deliberately uses just a small subset of the data is shady as heck.


Agreed, also this model fully uses the other images, just not the way that traditional supervised learning would. With "just 10 labels" is more accurate. Impressive results, but this isn't some hyper-convergence technique that somehow trains on only ten images.


This depends a lot on the application.

It seems like a big win for images and other stuff where getting images is cheap, but labelling them is expensive. Less great for (say) drug discovery where running the experiments to generate the data points is the bottleneck.


I agree it's pretty sensationalistic, and I almost ignored it for that reason. But it turns out that it's actually well worth a read if you can get past that one flaw.


I did read it--that's how I noticed the number was wrong :-)


Ok, we've reverted the title to that of the page, in keeping with the site guidelines (https://news.ycombinator.com/newsguidelines.html). When changing titles, the idea is to make them less baity or misleading, not more!

(Submitted title was "Semi-Supervised Learning: 85% accuracy on CIFAR-10 with only 10 labeled images")




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: