
Electra: Pre-Training Text Encoders as Discriminators Rather Than Generators - CShorten
https://youtu.be/QWu7j1nb_jI
======
PaulHoule
Great stuff.

Talking w/ people who work in the intelligence-military-industrial complex
south of the Beltway I've learned some tricks for very high precision systems
and the method of having multiple classifiers, one out front that does the
easy cases, and one in back that handles the hard cases is one of the best.

They are doing something quite similar here.

~~~
dr_zoidberg
The equivalent in ML to that would be having a cascade of classfiers.

For example, the old facial detection based on Haar Cascades works like that,
where weak classifiers in the 1st tier look at the image. The cascade is
trained such that a "false" on any tier is only returned when there's very
little chance for a false-negative (ie: there was a face, but I didn't detect
it).

Now if a tier says "looks good to me, there might be a face" it moves on to
another tier that is more thorough. Still, they are geared towards "in doubt,
say True and move to the following tier".

A Haar cascade for face detection (it can be trained for object detection in
general too) takes quite a lot of training, and it has many many tiers (I'm
probably off with the numbers since it's been such a long time from the last
time I used them) think ~30 tiers with hundreds of weak classifiers each.

On the plus side, a cascade approach based on weak classifiers can usually run
super fast inference (since the classifiers are simple estimators and you work
harder only when there could be a positive). On the down side, they usually
take a lot of training time (you're training hundreds/thousands of
classifiers!). Another "bad" thing here is that, as better techniques came to
exist, little effort was made to improve training (and the tooling around it
-- if you look for examples of training with OpenCV that has the capacity to
do Haar Cascades training, you'll see it's a mess).

With time better techniques were developed, like HoOG classifiers, PCA based
classifiers, neural nets/convnets. In these more complex models you trade
computational complexity for flexibility/"quality results" (I'd say
performance, but that might be confused with inference or training speed).

In theory nothing prevents you from having a cascade-like approach with deep
learning. What is usually convenient is to have it embedded inside the
architecture (a multi-objective model) such that the weights of the whole net
are update upon training cases, instead of having isolated classifiers on each
tier. That allows the whole network to learn from every case, instead of only
training the 2nd, 3rd, ..., n-th tier on the cases that failed the previous
steps.

Edited: a few types and tried to be clearer

~~~
gwern
> In theory nothing prevents you from having a cascade-like approach with deep
> learning.

Sure, people do that in DL. There are plenty of 'adaptive' or 'anytime' or
mixture-of-experts architectures for classification. And lots of things take
that cascade approach naturally, like any object detection or localization NN:
you make a bunch of proposals and check those rectangles; the more computation
you have, the more you can check.

~~~
dr_zoidberg
Right, I meant the specific approach of making a cascade "outside the model"
by programming it by hand, as in:

    
    
        clf1 = FirstTierModel(...)
        clf2 = SecondTierModel(...)
        clf3 = ThirdTierModel(...)
        # and then imagine downwards
        pred1 = clf1.infer(data)
        if pred1 == False:
            return False
        pred2 = clf2.infer(data)
        if pred2 == Fase:
            return False
        # ... and so on
    

(did a bit of a python-y syntax)

Then I went on with how it can be embedded inside the architecture and that's
a better approach.

------
laingc
For those like me who hate videos, here’s a link to Open Review:
[https://openreview.net/forum?id=r1xMH1BtvB](https://openreview.net/forum?id=r1xMH1BtvB)

