
Illustrated FixMatch for semi-supervised learning - amitness
https://amitness.com/2020/03/fixmatch-semi-supervised/
======
hadsed
The cold hard reality of machine learning is that most useful data isn't
readily available to just be collected. Semi-supervised and weakly supervised
learning, data augmentation, multi-task learning, these are the things that
will enable machine learning for the majority of companies out there who need
to build datasets and potentially leverage domain expertise somehow to
bootstrap intelligent features in their apps. This is great work in that
direction for computer vision.

Even the giants are recognizing this fact and are leveraging it to great
effect. Some keywords to search for good papers and projects: Overton,
Snorkel, Snorkel Metal

~~~
najarvg
Also Flying Squid, another interesting project from Stanford -
[http://hazyresearch.stanford.edu/flyingsquid](http://hazyresearch.stanford.edu/flyingsquid)

------
jonpon
Great summary! Reminds me a lot about Leon Bottou's work on using deep
learning to learn causal invariant representations. (Video:
[https://www.youtube.com/watch?v=lbZNQt0Q5HA](https://www.youtube.com/watch?v=lbZNQt0Q5HA))

We can view the augmentations of the image as "interventions" forcing the
model to learn an invariant representation of the image.

Although the blog post did not frame it as this type of problem (not sure if
the paper did), I think it can definitely be seen as such and is really
promising.

~~~
amitness
Interesting, thank you for sharing that. It reminds me of this approach called
"PIRL" by Facebook AI. They framed the problem to learn invariant
representations. You might find it interesting.

[https://amitness.com/2020/03/illustrated-
pirl/](https://amitness.com/2020/03/illustrated-pirl/)

------
antipaul
I wish all papers were structured this way, by default.

That is, plenty of good diagrams, clear explanations and intuitions, no
unnecessary mathiness.

~~~
amitness
Hi,

Wanted to clarify that this is the summary article of the paper. I wrote it to
help out people who might not have the math rigor and research background to
understand research papers but would benefit from an intuitive explanation.

The actual paper is available here:
[https://arxiv.org/abs/2001.07685](https://arxiv.org/abs/2001.07685)

~~~
mabbo
I would argue that folks like you translating the heavy science into
comprehensible ideas to those less deep into the field are doing just as much
to advance science as the authors of these papers.

Seriously, this is fantastic work and I cannot compliment you enough on it.

~~~
amitness
Thank you. It's very encouraging to hear that.

------
manthideaal
I wonder if a two step process could work better than this, first use a
variational autoencoder or simple an autoencoder then use it to train the
labeled sampled.

In (1) there is a full example of using the two step strategy but using more
labeled data to obtain 92% of accuracy. Someone can try changing the second
part to use only ten labels for the classifying part and share results?

(1) [https://www.datacamp.com/community/tutorials/autoencoder-
cla...](https://www.datacamp.com/community/tutorials/autoencoder-classifier-
python)

Edited: I found a deep analysis in (2), in short for CIFAR 10 the VAE semi-
supervised learning approach provides poor results, but the author has not
used augmentation!

(2) [http://bjlkeng.github.io/posts/semi-supervised-learning-
with...](http://bjlkeng.github.io/posts/semi-supervised-learning-with-
variational-autoencoders/)

~~~
amitness
Yeah, authors have tried mixing the strategy you described(self-supervised
learning) for semi-supervised tasks.

Basic idea is to learn generic image representations without manual labeling
and then finetune that on your small dataset. These are relevant articles I
have wrote on it: [https://amitness.com/2020/02/illustrated-self-supervised-
lea...](https://amitness.com/2020/02/illustrated-self-supervised-learning/)

[https://amitness.com/2020/03/illustrated-
simclr/](https://amitness.com/2020/03/illustrated-simclr/)

------
starpilot
I wish there was a way to augment data as easily for free text, and other
business data. I always see these few-shot learning papers for images, I
suspect because it's easy to augment image datasets and because image-
recognition is interesting to laypeople. The vast majority of data we deal
with in business is text/numerical which is much harder to use in these
approaches.

~~~
amitness
Agree with you on this. For text data, there was a paper called
"UDA"([https://arxiv.org/abs/1904.12848](https://arxiv.org/abs/1904.12848))
that did some work on this direction.

They augmented text by using backtranslation. Basic idea is you take text in
English, translate that to some other language say French and then translate
back the French text to English. Usually, you get back an English sentence
that is different that the original English sentence but has the same meaning.
Another approach they use to augment is to randomly replace stopword/low tf-
idf(intuitively say very frequent words like a, an, the) with random words.

You will find implementation of UDA on GitHub and try that out.

I am learning these existing image semi-supervised technique right now and the
plan is to do research on how we can transfer those ideas to text data. Let's
see how it goes.

~~~
codegladiator
Haha that's how we used to generate blog spam content and comments :p

~~~
amitness
What are other techniques you use to generate spam? Maybe the research
community can learn from you guys

------
fermienrico
I don't know much about ML/Deep-Learning and I have a burning question:

Say we have 10 images as a starting point. Then we create 10,000 images from
those 10 images by adding noise, filters, flip them, skew them, distort them,
etc. Isn't the underlying data the same (or some formal definition of shannon
information entropy)? Would that actually improve neural networks?

I've always wondered. Is it possible to generate infinite data and get almost
perfect neural network accuracy?

~~~
Der_Einzige
This is already done. It's called data augmentation and is extremely helpful
in computer vision

~~~
fermienrico
How do we generate more "information" from a limited given information?
Doesn't that break some law of information theory?

~~~
psb217
With data augmentation, we're effectively injecting additional information
about what sorts of transformations of the data the model should be
insensitive to. The additional information comes from our (hopefully) well-
informed human decisions about how to augment the data. By doing this, we can
reduce the tendency for the model to pick up dependencies on patterns that are
useful in the context of the (very small) training dataset, but which don't
work well on new data that isn't in the training set.

------
edsykes
I had a read through this and I couldn't really tell if there was something
novel here?

I understand that perturbations and generating new examples from labelled
examples is a pretty normal park of the process when you only have a limited
number of examples available.

~~~
amitness
The novelty is in applying 2 perturbations to available _unlabeled images_ and
use them as part of training. This is different than what you are describing
about applying augmentations to labeled images to increase data size.

~~~
daenz
My immediate question was "how do you use unlabeled images for training?" But
then I decided to read the paper :) The answer is:

Two different perturbations to the same image should have the same predicted
label by the model, even if it doesn't know what the correct label is. That
information can be used in the training.

~~~
computerex
What if the model's prediction is wrong with high confidence? What if the cat
is labeled as a dog for both perturbations? Then wouldn't the system train
against the wrong label?

~~~
amitness
Nope,because of the way it works. So in the beginning when the model is being
trained on the labeled data, it will make many mistakes. So it's confidence
for either cat or dog will be low. Hence, in that case unlabeled data are not
used at all.

As training progresses, the model will become better at labeled data. And so
it can start predicting with high confidence on unlabeled images that are
trivial/similar-looking/same distribution with labeled data. So, gradually
unlabeled images get started being used as part of training. As training
progresses, more and more unlabeled data are added.

The mathematics of the combined loss function and curriculum learning part
talks about this.

------
sireat
It is not the same thing but kind of reminds of my naive and obvious(meaning
something that came up when drinking beer) idea of generating bunch of
variations of your labeled data in cases when you do not have enough.

Let's say you only have one image of dog, you generate bunch of color
variations, sharpness adjustments, flips, transforms, etc. Voila you have 256
images of the same dog.

EDIT: I noticed that this is definitely a common idea as others have already
pointed out.

------
master_yoda_1
I am not sure how this article got ranked so high. I am suspicious about
reading these article written by non experts. I would prefer to go to
authentic sources and read the original paper. Most of the time information in
these articles are misleading and wrong.

~~~
shookness
Instead of speaking in generalities, can you point out what is wrong in the
posted article?

~~~
master_yoda_1
The title is fraud. It fraudulently report 85% accuracy gain.But the inside it
is something else. "FixMatch is a recent semi-supervised approach by Sohn et
al. from Google Brain that improved the state of the art in semi-supervised
learning(SSL). It is a simpler combination of previous methods such as UDA and
ReMixMatch. In this post, we will understand the concept of FixMatch and also
see it got 78% median accuracy and 84% maximum accuracy on CIFAR-10 with just
10 labeled images."

~~~
master_yoda_1
We should flag these fraudulent articles, I am not sure the author has any
credibility.

------
mattkrause
Title is (slightly) wrong.

As the first paragraph says: "In this post, we will understand the concept of
FixMatch and also see it got 78% accuracy on CIFAR-10 with just 10 images."

Reporting the _best_ performance on a method that deliberately uses just a
small subset of the data is shady as heck.

~~~
colincooke
Agreed, also this model fully uses the other images, just not the way that
traditional supervised learning would. With "just 10 labels" is more accurate.
Impressive results, but this isn't some hyper-convergence technique that
somehow trains on only ten images.

~~~
mattkrause
This depends a lot on the application.

It seems like a big win for images and other stuff where getting images is
cheap, but labelling them is expensive. Less great for (say) drug discovery
where running the experiments to generate the data points is the bottleneck.

