
Google T5 scores 88.9 on SuperGLUE Benchmark, approaching human baseline - alexwg
https://super.gluebenchmark.com/leaderboard
======
hn_throwaway_99
I didn't know anything about SuperGLUE before (turns out it's a benchmark for
language understanding tasks), so I clicked around their site where they show
different examples of the tasks.

One "word in context" task is to look at 2 different sentences that have a
common word and decide if that word means the _same_ thing in both sentences
or _different_ things (more details here:
[https://pilehvar.github.io/wic/](https://pilehvar.github.io/wic/))

One of their examples, though, didn't make any sense to me:

1\. The pilot managed to _land_ the airplane safely

2\. The enemy _landed_ several of our aircrafts

It says that the word "land" does NOT mean the same thing in those sentences.
I am a native English speaker, and I honestly don't understand what they are
thinking the second sentence means. Shot them down? If so, I have never heard
"landed" used in that context, and it appears neither has Merriam-Webster.
Also, the plural of aircraft is just "aircraft", without the s.

~~~
rladd
My mother got a perfect 800 score on the GRE English test many years ago when
she wanted to go back to graduate school after her children were grown up
enough (highschool/college age).

She told me that the way she got her perfect score was by realizing when the
questions were wrong and thinking of what answer the test creators believed to
be correct.

She had to outguess the test creators and answer the questions wrong -- in the
"right" way.

This seems like a similar situation.

~~~
Huppie
I've had the 'pleasure' of taking some 'Microsoft certifications' at various
companies I worked at in the past and this sounds extremely familiar.

 _" I probably won't ever do it like that and/or there's a syntax error in all
four of the answers... but this is the answer you want to hear. It's wrong,
mind you, but it's what you want to hear."_

~~~
justinclift
Reminds me of the 1 question I got "wrong" on a DOS test (years ago) at TAFE.

The question was "How do you delete all files in the current directory?".
Using DOS 6.22 (I think, it's from memory).

My answer "del." was marked incorrect. Because the teacher didn't know enough
about DOS to understand that's the standard shortcut for "del _._ ". And the
teacher refused to even try out the command, lets alone fix the incorrect
mark. _sigh_

~~~
jeremyvisser
TAFE anecdote time!

In my TAFE class, I was asked to list two examples of operating systems.

I listed Linux and eComStation. The teacher had never heard of eComStation and
marked me wrong.

Refused to correct my mark even when I proved him right. I'm still bitter
about it a decade later.

~~~
justinclift
Swinburne TAFE as well? ;)

------
6gvONxR4sf7o
One thing to always point out in these cases is that the human baseline isn't
"how well people do at this task," like it's often hyped to be. It's "how well
does a person quickly and repetitively doing this do, on average." The
'quickly and repetitively' part is important because we all make more
boneheaded errors in this scenario. The 'on average' part is important because
the errors the algo makes aren't just fewer than people, they're different.
The algos often still get certain things wrong that humans almost never would.

This is really really super great, let's be clear. It's just not up to the
hype "omg super human" usually gets.

~~~
IshKebab
Regarding the _type_ of errors, it seems like the benchmark should be able to
take that into account. That is, get a load of humans to do the task on the
same specific examples, then for each example you know how hard it is, and
what acceptable answers are (I bet a lot of the ground truth is wrong or
ambiguous).

Then you can benchmark your AI but penalise it more heavily for getting things
wrong that are obvious to a human.

~~~
6gvONxR4sf7o
That would be ideal, if money weren't a factor. Since money is a factor, I
wonder what the tradeoff is between labelling each instance N more times
versus just getting N times more instances labeled.

------
pmoriarty
There was an article[1] posted to HN recently about these benchmarks, and it
was pretty skeptical.

Regarding SuperGLUE specifically, it asked:

 _" Indeed, Bowman and his collaborators recently introduced a test called
SuperGLUE that's specifically designed to be hard for BERT-based systems. So
far, no neural network can beat human performance on it. But even if (or when)
it happens, does it mean that machines can really understand language any
better than before? Or does just it mean that science has gotten better at
teaching machines to the test?"_

[1] - [https://www.quantamagazine.org/machines-beat-humans-on-a-
rea...](https://www.quantamagazine.org/machines-beat-humans-on-a-reading-test-
but-do-they-understand-20191017)

~~~
gradys
This feels hollow. Can't this be said about any benchmark? It seems natural
and proper that as one benchmark becomes saturated, we introduce harder
benchmarks.

I don't think anyone in the field thinks that once we match human performance
on benchmark X, we're officially done. It just means it's time for more
interesting benchmarks.

Over time, if it starts to become difficult to design benchmarks that humans
can outperform machines on, then that will prompt interesting conceptual work
about what exactly the difference between human and machine language
competency is. And then that will lead either to more sophisticated benchmarks
or alternatively gradually more sophisticated and persuasive arguments that
machines really have surpassed us in language competence.

I don't think we're yet at a point where we don't know how to make harder
benchmarks, and if and when we do hit such a point, I'd definitely bet the
result will be a conceptual advance in benchmark design rather than declaring
machine superiority once and for all. At least for the first few rounds of
this cycle.

~~~
not2b
From the Quanta article:

"But instead of concluding that BERT could apparently imbue neural networks
with near-Aristotelian reasoning skills, they suspected a simpler explanation:
that BERT was picking up on superficial patterns in the way the warrants were
phrased. Indeed, after re-analyzing their training data, the authors found
ample evidence of these so-called spurious cues. For example, simply choosing
a warrant with the word “not” in it led to correct answers 61% of the time.
After these patterns were scrubbed from the data, BERT’s score dropped from 77
to 53 — equivalent to random guessing."

~~~
nl
This is true, and absolutely a weakness of these tests.

 _However_ they don't publish how well a human performs on the dataset without
"not" in it.

They do initially note that _Even human beings don’t do particularly well on
this task without practice_

I've looked at the warrant task. It's pretty tricky! I'd bet real money that
untrained humans perform much, much lower than the 80% correct rate they get
on the full set on ones without "not". I don't think it would be as low as the
53% BERT gets, but it would drop significantly.

I find the HANS analysis[1] much more compelling, but again I'd note that
humans suffer on this dataset too (although again - not as badly as models
do).

[1]
[https://www.aclweb.org/anthology/P19-1334.pdf](https://www.aclweb.org/anthology/P19-1334.pdf)

------
RcouF1uZ4gsC
I think classifying this as human level is misleading.

Look at the sub-scores on the page. One score that looks very different from
humans is AX-b.

The SuperGlue paper provides more context about AX-b

[https://arxiv.org/pdf/1905.00537.pdf](https://arxiv.org/pdf/1905.00537.pdf)

AX-b "is the broad-coverage diagnostic task, scored using Matthews’
correlation (MCC). "

This is how the paper describes this test

" Analyzing Linguistic and World Knowledge in Models GLUE includes an expert-
constructed, diagnostic dataset that automatically tests models for a broad
range of linguistic, commonsense, and world knowledge. Each example in this
broad-coverage diagnostic is a sentence pair labeled with a three-way
entailment relation (entailment, neutral, or contradiction) and tagged with
labels that indicate the phenomena that characterize the relationship between
the two sentences. Submissions to the GLUE leaderboard are required to include
predictions from the submission’s MultiNLI classifier on the diagnostic
dataset, and analyses of the results were shown alongside the main
leaderboard. Since this broad-coverage diagnostic task has proved difficult
for top models, we retain it in SuperGLUE. However, since MultiNLI is not part
of SuperGLUE, we collapse contradiction and neutral into a single
not_entailment label, and request that submissions include predictions on the
resulting set from the model used for the RTE task. We collect non-expert
annotations to estimate human performance, following the same procedure we use
for the main benchmark tasks (Section 5.2). We estimate an accuracy of 88% and
a Matthew’s correlation coefficient (MCC, the two-class variant of the R3
metric used in GLUE) of 0.77. "

If you look at the scores, humans are estimated to score 0.77. Google T5
scores -0.4 on the test.

How did T5 get such a high score if it scored so abysmally on the AX-b test?

The AX scores are not included in the total score.

From the paper: "The Avg column is the overall benchmarkscore on non-AX∗
tasks."

If the AX scores were included, the gap between humans and machines would be
bigger than the current score indicates.

~~~
craffel
Hi, one of the paper's authors here. We didn't submit our model's predictions
for the AX-b task yet, we just copied over the predictions from the example
submission. We will submit predictions for AX-b in the next few days.

~~~
mannykannot
RcouF1uZ4gsC makes a compelling case for the results on this test to
potentially be a significant caveat to the results, and also to the claims of
achieving a near-human level of performance. If so, then why would you make
such claims before you have these results? Or at least mention this caveat at
the points where you are making the claim, such as in the abstract.

~~~
craffel
To be clear, here is the claim we make in the paper (we did not write the
title of this post to HN):

> For SuperGLUE, we improved upon the state-of-the-art by a large margin (from
> an average score of 84.6 [Liu et al., 2019c] to 88.9). SuperGLUE was
> designed to comprise of tasks that were “beyond the scope of current state-
> of-the-art systems, but solvable by most college-educated English speakers”
> [Wang et al., 2019b]. We nearly match the human performance of 89.8 [Wang et
> al., 2019b]. Interestingly, on the reading comprehension tasks (MultiRC and
> ReCoRD) we exceed human performance by a large margin, suggesting the
> evaluation metrics used for these tasks may be biased towards machine-made
> predictions. On the other hand, humans achieve 100% accuracy on both COPA
> and WSC, which is significantly better than our model’s performance. This
> suggests that there remain linguistic tasks that are hard for our model to
> perfect, particularly in the low-resource setting.

I'm not sure why the SuperGLUE/GLUE benchmark was designed to omit the AX-*
scores from the benchmark score. It may be that they have no corresponding
training set.

~~~
mannykannot
My mistake - I had overlooked the AX-* scores being expressly omitted from
these benchmarks. Maybe it is possible, then, that they could provide the
additional headroom for further research?

Regardless of the status of the AX-* tests, I am very impressed by your
results on the SuperGLUE benchmark.

------
throwaway_bad
Possibly dumb question: How do you ensure there's no data leakage when
benchmarking transfer learning techniques? Is that even a problem anymore when
the whole point is to learn "common sense" knowledge?

For example their “Colossal Clean Crawled Corpus” (C4), a dataset consisting
of hundreds of gigabytes of clean English text scraped from the web, might
contain much of the same information as the benchmark datasets, which I
presume is also scraped from the web.

~~~
craffel
Hi, one of the paper authors here. Indeed this is a good question. A couple of
comments:

\- Common Crawl overall is a sparse web dump, it is unlikely that the month we
used includes any of the data that are in any of the test sets.

\- In order for the data to be useful to our model, it would have to be in the
correct preprocessed format. ("mnli: hypothesis: ... premise: ...") with the
label in a format our model could extract meaning from. We introduced this
preprocessing format so I don't believe this would ever happen.

\- Further, most of these datasets live in .zip files. The Common Crawl dump
doesn't unzip zip files.

\- C4 is so large that our model sees each example (corresponding to a block
of text from a website) roughly once ever over the entire course of training.
Big neural nets trained with SGD are unlikely to memorize something if they
only see it once over the course of one million training steps.

~~~
throwaway_bad
> Big neural nets trained with SGD are unlikely to memorize something if they
> only see it once over the course of one million training steps

I am not so sure about that. Have you seen this thread:
[https://www.reddit.com/r/MachineLearning/comments/dfky70/dis...](https://www.reddit.com/r/MachineLearning/comments/dfky70/discussion_exfiltrating_copyright_notices_news/)

Apparently lots of sentence fragments were memorized in GPT-2 (including real
world URLs, entire conversations, username/emails and other PII).

~~~
craffel
It actually can be more pernicious than that:
[https://arxiv.org/abs/1802.08232](https://arxiv.org/abs/1802.08232)

However note that the dataset used to train GPT-2 is about 20x smaller than
C4. I'm not 100% sure how many times the training set was repeated over the
course of training for GPT-2, but it was likely many times. I stand by my
statement (that memorization is _unlikely_ with SGD and no repetition of
training data) but I would be happy to be proven otherwise.

------
Al-Khwarizmi
This surprised me a bit, on the creation of the corpus they use for training:

"We removed any page that contained any word on the “List of Dirty, Naughty,
Obscene or Otherwise Bad Words”."

I don't understand this decision. This list contains words that can be used in
a perfectly objective sense, like "anus", "bastard", "erotic", "eunuch",
"fecal", etc.

I can understand that they want to avoid websites full of expletives and with
no useful content, but outright excluding any website with even one occurrence
of such words sounds too radical. If we ask this model a text comprehension
question about a legitimized bastard that inherited the throne, or about fecal
transplants, I suppose it would easily fail. Strange way of limiting such a
powerful model.

~~~
Veedrac
They say they removed pages, not websites. Having false positives isn't a
problem when you're still left with 750GB of data—quality matters more than
slightly higher quantity at that point.

~~~
Al-Khwarizmi
Sorry, I was thinking about pages even though I said websites. Native language
interference (typically, we use the same term for pages and websites in my
language).

Anyway, my point is not a matter of quantity. The way they're doing it, they
have 750 GB of data, but they have exactly zero data that talks about
bastards, fecal transplants, etc. So they may have a hard time answering
questions about those specific subjects.

------
nopinsight
As someone working in the field, I congratulate the excellent accomplishment
but agree with the authors that we shouldn't get too excited yet (their quote
below after the four reasons). Here are some reasons:

1) Most likely, the model is still susceptible to adversarial triggers as
demonstrated on other systems here:
[http://www.ericswallace.com/triggers](http://www.ericswallace.com/triggers)

2) T5 was trained with ~750GB of texts or ~150 billion words, which is > 100
times the number of words native English speakers acquire by the age of 20.

3) Most or all of the tests are multiple-choice. Learning complex correlations
from sufficient data should help solve most of them. This is useful but human-
level understanding is more than correlations.

4) The performance on datasets that require commonsense knowledge, COPA and
WSC, are the weakest relative to humans (who score 100.0 on both).

Exploring the Limits of Transfer Learning with a Unified Text-to-Text
Transformer, p.32
[https://arxiv.org/pdf/1910.10683.pdf](https://arxiv.org/pdf/1910.10683.pdf)

"Interestingly, on the reading comprehension tasks (MultiRC and ReCoRD) we
exceed human performance by a large margin, suggesting the evaluation metrics
used for these tasks may be biased towards machine-made predictions. On the
other hand, humans achieve 100% accuracy on both COPA and WSC, which is
significantly better than our model’s performance. This suggests that there
remain linguistic tasks that are hard for our model to perfect, particularly
in the low-resource setting."

I’d like to emphasize that the work and the paper are excellent. Still, we are
quite far from human-level language understanding.

\---

We may need more advanced tests to probe the actual language _understanding_
ability of AI systems. Here are some ideas:

* Test for conceptual understanding in a non-multiple-choice format. Example: Write a summary for a New Yorker article, rather than standard news pieces (which tend to follow repeated patterns).

* Commonsense test with longer chains of inference than those needed for solving Winograd Schema and set in non-standard situations (e.g. fantasy world). This should greatly reduce the chance that an approach can simply detect correlations from huge datasets.

* Understanding novel, creative metaphors like those used in some essays by professional writers or some of the Economist's title articles.

~~~
VikingCoder
> 2) T5 was trained with ~750GB of texts or ~150 billion words, which is > 100
> times the number of words native English speakers acquire by the age of 20.

...but, humans evolved the ability to use language over hundreds of
generations... So... Maybe that's not such a bad thing?

~~~
msamwald
Indeed this is important to realize: Training such a generic model from
scratch does not only reiterate learning, but the entire evolutionary process
that led to the emergence of neural circuits actually capable of such
learning. That perspective makes many of the current achievements -- error-
prone as they might be -- even more impressive!

------
martincmartin
"The General Language Understanding Evaluation (GLUE) benchmark is a
collection of resources for training, evaluating, and analyzing natural
language understanding systems."

"We take into account the lessons learnt from original GLUE benchmark and
present SuperGLUE, a new benchmark styled after GLUE with a new set of more
difficult language understanding tasks, improved resources, and a new public
leaderboard."

------
YeGoblynQueenne
Assuming that the baseline human score was set according to the performance of
adult humans, then according to these results T5 has a language understanding
ability at least as accurate as a human child.

In fact it's not just T5 that should be able to understand language as well as
a human child, but also BERT++, BERT-mtl and RoBERTa, each of which has a
score of 70 or more. There really shouldn't be anything else on the planet
that has 70% of human language understanding, other than humans.

So if the benchmarks mean what they think they mean, there are currently
fully-fledged strongly artificially intelligent systems. That must mean that,
in a very short time we should see strong evidence of having created human-
like intelligence.

Because make no mistake: language _understanding_ is not like image
recognition, say, or speech processing. Understanding anything is an AI-
complete task, to use a colloquial term.

Let's wait and see then. It shouldn't take more than five or six years to
figure out what all this means.

~~~
YeGoblynQueenne
To clarify, I meant this comment as an expression of skepticism- I don't
believe that the SuperGLUE benchmark really evaluates language understanding,
or that BERT and friends are within a few percents of human language
understanding. I think SuperGLUE is just another benchmark that is measuring
something else than what it's supposed to be measuring (machine learning
benchmarks usually do).

It seems that the teams behind the attempts to beat such benchmarks are aware
of the weaknesses of the benchmarks though, so that's encouraging.

------
enisberk
I attended one of the talks(1) of the Sam Bowman. His talk was about "Task-
Independent Language Understanding" and he also talked about GLUE and super
GLUE; he mentioned that some models are passing an average person in
experiments. They did some experiments to understand BERT's performance (2).
(similar to article 'NLP's Clever Hans Moment') But they found a different
answer to question "what BERT really knows," so he was skeptical about all
conclusions. Check these out if you are interested in.

(1)[[https://www.nyu.edu/projects/bowman/TILU-
talk-19-09.pdf](https://www.nyu.edu/projects/bowman/TILU-talk-19-09.pdf)]

(2)[[https://arxiv.org/abs/1905.06316](https://arxiv.org/abs/1905.06316)]

------
ilaksh
The AIs in the benchmark are all trained exclusively on text, correct?

My assumption has always been that to get human-level understanding, the AI
systems need to be trained on things like visual data in addition to text.
This is because there is a fair amount of information that is not encoded at
all in text, or at least is not described in enough detail.

I mean, humans can't learn to understand language properly without using their
other senses. You need something visual or auditory or to associate with the
words which are really supposed to represent full systems that are complex and
detailed.

I think it would be much more obvious if there were questions that involved
things like spatial reasoning, or combining image recognition with that and
comprehension.

~~~
tialaramex
Mmm. The philosophical position that it's essential to be embodied in order to
have intelligence seems intuitively reasonable but is very much unproven. You
will find philosophers and cognitive scientists who are sure you're right, but
they don't have much hard evidence, and you will also find people like me who
are pretty sure you're wrong but likewise have no hard evidence.

In the specific remember that deaf-blind people exist, so if you're sure that
you "need something visual or auditory" then those people are not, according
to your beliefs, able to understand language. I think they'll disagree with
you quite strongly.

~~~
spappal
> remember that deaf-blind people exist [... ...] able to understand language

I got curious if/how deafblind people learn to communicate in the first place,
if they are completely deafblind from birth. If humans can learn not just
communication but language without either vision or hearing, that seems to
suggest either extreme adaptability or language learning being quite decoupled
from vision and hearing. From an evolutionary standpoint, I imagine that both
deafness and blindness are probably uncommon enough that language learning
could have explicit dependencies on both hearing and vision.

I found an old-looking video about communication with deafblind people. At the
linked timestamp is a woman who is deafblind since age 2.

[https://youtu.be/usaf3bVVvjY?t=840](https://youtu.be/usaf3bVVvjY?t=840)

------
alexwg
Paper: [https://arxiv.org/abs/1910.10683](https://arxiv.org/abs/1910.10683)

~~~
throwaway_bad
Twitter summary:
[https://threadreaderapp.com/thread/1187161460033458177.html](https://threadreaderapp.com/thread/1187161460033458177.html)

------
ArtWomb
"Attention is all you need", indeed. Of course, our instinct tells us there is
more to language inference than word proximity. And so results approaching or
exceeding expert-level human baseline raise more questions than providing
cause for popping champagne corks.

In Question Answering, which is also advancing rapidly with insights from
transformers and denoising auto-encoders, but still far from human baseline.
The ease with which these models can answer a sample question such as: "Who
was the first human in space", demonstrates both their efficacy and
limitations. Pre-trained on a large corpus of text, almost every document that
contains the the name "Yuri Gagarin" will in its near vicinity describe him in
relation to his pioneering accomplishment for which he became a cultural icon.

And for even more generalizable scenarios, such as "what might you find on a
Mayan monument"? It becomes imperative that an agent _explain_ its reasoning
in natural language as well to enable self-correcting backpropagation of error
correction.

Language may be considered low-dimensional relatively speaking. And sentence
prediction across quotidian tasks manageable in current state-of-the-art
architectures. But looking at how difficult it is to predict the next N frames
of video given a short input example demonstrates the intractability of the
problem in higher dimensional spaces.

Neural Models for Speech and Language: Successes, Challenges, and the
Relationship to COmputational Models of the Brain - Michael Collins

[https://www.youtube.com/watch?v=HVnFKmPaU8c](https://www.youtube.com/watch?v=HVnFKmPaU8c)

------
skybrian
They came up with the SuperGLUE benchmark because they found that the GLUE
benchmark was flawed and too easy to game. There were correlations in the
dataset that made it possible to get questions right without real
understanding, and so the results didn't generalize.

Could the same thing happen again with the better benchmark due to more subtle
correlations? These things are tough to judge, so I'd say wait and see if it
turns out to be a real result.

------
lettergram
Although those are some great results, I wish I could try it out locally...

[https://github.com/google-research/text-to-text-transfer-
tra...](https://github.com/google-research/text-to-text-transfer-transformer)

It drives me nuts that most of these papers / publications don't have code
where I can just run:

> python evaluate_model.py

Still exciting, just annoying that I'd have to set up google cloud to try this
out.

~~~
ehsankia
They often do setup Python notebooks / Colabs you can simply run, especially
with the data hosted on GCloud. Unfortunately not this time.

------
femto113
My experience with image classification benchmarks was that they approached
human levels only because the scoring only counts how much they get “right”
and doesn’t penalize completely whack answers as much as they should (like
getting full credit for being pretty sure a picture of a dog was either a dog
or an alligator). I suspect there’s something similar going on in these
language benchmarks.

------
riku_iki
> T5-11B (11 billion parameters)

So, this is largest language model so far?

~~~
lucidrains
Yes. The last one was 8.3B
[https://arxiv.org/pdf/1909.08053.pdf](https://arxiv.org/pdf/1909.08053.pdf)

------
pauljurczak
Use of Natural Language Understanding term in context of this benchmark is
preposterous. No understanding takes place there. Please stick to NLP (Natural
Language Processing) term for the next couple of decades. Thank you.

------
nightnight
This clearly demonstrates once again that Google is miles ahead of the
competition in AI. I mean, they just have the best data.

If you want to have an every day example of Google's AI skills: Switch you
phone's keyboard to GBoard, especially all iOS users, and you will face a
night and day difference to any other keyboard esepcially the stock one. When
using multiple languages at the same time the leap to other keyboards gets
even bigger.

GBoard is my phone's killer app and if Google dropped it for iOS I'd left the
same day to Android.

~~~
dingle_thunk
Have you tried the iOS 13 keyboard's built in swipe feature?

Are you aware of Swiftkey?

~~~
occamrazor
I think GP is talking about predictive text, rather than keyboard ergonomics.

------
rrival
Where do I take the SuperGLUE test?

------
woodgrainz
Several of the systems in this leaderboard utilize the BERT model, a clever
approach devised by Google for natural language processing. A nice laymen's
guide to BERT:

[https://towardsdatascience.com/bert-explained-state-of-
the-a...](https://towardsdatascience.com/bert-explained-state-of-the-art-
language-model-for-nlp-f8b21a9b6270)

------
vagab0nd
This is cool. Since they released a 11B pre-trained model, can we finally
reproduce "unicorn-level" text generation now?

~~~
penagwin
My understanding is that a lot of these really high performance models that
reach for every percentage-point possible require an absurd amount of hardware
- specifically an absurd amount of GPU memory.

For example I have what I consider a fairly "high end" rig for being a
hobbyist individual, with 32GB of RAM, i7 8700k, 1080ti - there's 0 chance
their model would fit on my system.

So I mean maybe if you have a ton of money? Usually what happens is a slimmer
model with not "quite" as high of a score gets released that actually fits on
consumer hardware.

~~~
vagab0nd
Maybe I'm oversimplifying, but it seems to me that once you have the model
trained, it should be possible to partition it somehow when inferencing, to
fit smaller machines. At least for a proof of concept it should be possible.

~~~
nmfisher
I'm not aware of any "partioning" strategies per se (at least during
inference), but it's now common practice to distill a larger model to a
smaller one by either (a) training a smaller "student" network to replicate
the larger "teacher" network, or (b) pruning smaller weights from the larger
network to reduce the size.

Just brainstorming here, but a vanilla network partition strategy might be to
load each layer's weight into memory and perform the forward pass
sequentially. I think that would be prohibitively slow - some of these models
(e.g. BERT) can already take up to 3-4 seconds to perform a single forward
pass on a CPU, and that's with all model weights already loaded into main
memory. I suspect fetching/loading each layer separately would blow this out
by an order of magnitude.

------
vonseel
I wonder what I would score on this test. Are these things correlated to
standardized test scores at all for humans ?

------
LukeB42
[http://www.irc.org/history_docs/tao.html](http://www.irc.org/history_docs/tao.html)

