
Recent Advances in Natural Language Processing - saadalem
https://deponysum.com/2020/01/16/recent-advances-in-natural-language-processing-some-woolly-speculations/
======
rvense
I think the point about language being a model of reality was interesting. I
have an MA in linguistics including some NLP from about a decade ago and was
looking at a career in academic NLP. I ultimately left to become a programmer
because (of life circumstances and the fact that) I didn't see much of a
future for the field, precisely because it was ignoring the (to me) obvious
issues of written language bias, ignorance of multi-modality and situatedness
etc. that are brought up in this post.

All of these results are very interesting, but I'm not really feeling like
we've been proved wrong yet. There is a big question of scalability here, at
least as far as the goal of AGI goes, which the author also admits:

> Of course everyday language stands in a woolier relation to sheep, pine
> cones, desire and quarks than the formal language of chess moves stands in
> relation to chess moves, and the patterns are far more complex. Modality,
> uncertainty, vagueness and other complexities enter but the isomorphism
> between world and language is there, even if inexact.

This woolly relation between language and reality is well-known. It has been
studied in various ways in linguistics and the philosophy of language, for
instance by Frege and not least Foucault and everything after. I also think
many modern linguistic schools take a very different view of "uncertainty and
vagueness" than I sense in the author here, but they are obviously writing for
non-specialist audience and trying not to dwell on this subject here.

My point is, when making and evaluating these NLP methods and the tools they
are used to construct, it is extremely important to understand that language
models social realities rather than any single physical one. It seems to me
all too easy, coming from formal grammar or pure stats or computer science, to
rush into these things with naive assumptions about what words are or how they
mean things to people. I dread to think what will happen if we base our future
society on tools made in that way.

~~~
joe_the_user
_My point is, when making and evaluating these NLP methods and the tools they
are used to construct, it is extremely important to understand that language
models social realities rather than any single physical one._

I'd claim actual involves language both a general shared reality and quite
concrete and specific discussions of single physical and logic facts/models.
Some portion of language certainl looks mostly like a stream of associations.
But within it is also are references to physical reality and world model and
the two are complexly intertwined (making logical sense is akin to a "parity
check" \- you can for a while without it but then you have to look at the
whole to get it). I believe one can see this in a GPT paragraph, where the
first two sentences seem intelligence, well written but the third sentence
contradicts the first two sentences sufficiently one's mind isn't sure "what's
being said" (and even here, our "need for logic" might be loose enough that we
only notice the "senselessness" after the third logical error).

~~~
rvense
In human language I think physical reality is always a few layers out.
Language is social, first and foremost, and naming something is not neutral.
We can hardly refer to single objects directly, we mostly do it through their
class membership, which always, always include a whole range of associations,
reductions, metaphor, etc. that are cultural.

~~~
joe_the_user
_In human language I think physical reality is always a few layers out._

Yes, the point is you can neglect physical (and logic) reality for a while in
a stream of sentences. But not forever and that's where current NLP's output
has it's limits. Just a simple level, a stream of glittering adjectives can be
added to a thing and just add up to "desirable" unless those adjectives "go
over threshold" and contradict each other and then the description can get
tagged by the brain as a bit senseless.

~~~
stosto88
Are you a NLP bot?

~~~
joe_the_user
Oh Gawd, I suppose that quip's topical.

It seems like at this point, there's no way to distinguish the coherent
comments of a bot from a person. The bot could have written sentence X for
just about any X. It's just that bots can't sustain stream of logical claims
that are consistent with each other.

So it's easier to demonstrate a paragraph is done by a bot, ie, that a
paragraph makes no sense, than it is demonstrate that paragraph is not done by
a bot - since both humans and bots write some sensible paragraphs.

Still, I'm egotistic enough to think a bot couldn't come up with that argument
though I could be wrong.

~~~
mercer
Much as I'm assuming the person you're responding to was joking, I've
encountered a number of comments/commenters where I felt the same way I felt
about GPT output.

The best way I can describe the feeling is that it reminds me of conversations
and friendships I've had with schizophrenics, people in the process of having
a psychotic breakdown, and people with alzheimers.

There's a feeling that what they're saying is not entirely non-sensical, a
feeling of 'catching up' to what they're trying to say (akin to translating in
a language one isn't too proficient in). But reflecting on the conversation, I
find myself wondering how much I managed to understand what they were trying
to convey, and how much it was just my brain trying to make sense of something
that ultimately doesn't.

'Understanding' or 'communication' aside, I've often valued these kinds of
conversations because they tickle the more free-associative side of my own
thinking, and the results, however I/we got there, were useful to me.

As a result, I'm much more interested in how these developments in 'AI' might
augment this creative process than I am in how they might convincingly appear
human. Not that the latter isn't interesting too, though.

------
FiberBundle
I found the science exams results interesting and skimmed the paper [1]. They
report an accuracy of >90% on the questions. What I found puzzling was that
they have a section in the experimental results part where they test the
robustness of the results using adverserial answer options, more specifically
they used some simple heuristic to choose 4 additional answer options from the
set of other questions which maximized 'confusion' for the model. This
resulted in a drop of more than 40 percentage points in the accuracy of the
model. I find this extremely puzzling, what do these models actually learn?
Clearly they don't actually learn any scientific principles.

[1]
[https://arxiv.org/pdf/1909.01958.pdf](https://arxiv.org/pdf/1909.01958.pdf)

~~~
wrs
I would be interested in hearing the results from _humans_ presented with
adversarial answer options. You may say that a machine learning correlations
between words isn’t really learning science, but I wonder how many human
students aren’t either, just pretty much learning correlations between words
to pass tests...

~~~
FiberBundle
They do give an example of a question, in which the model chose an incorrect
answer in the adversarial setting:

"The condition of the air outdoors at a certain time ofday is known as (A)
friction (B) light (C) force (D)weather[correct](Q) joule (R)
gradient[selected](S)trench (T) add heat"

I assume this might be characteristic for other questions as well, although I
don't know anything about the Regents Science Exam and whether there are
multiple questions about closely related topics.

~~~
taneq
That’s a terribly worded question anyway. Of the original answers, ‘weather’
is the least worst but it’s still vague.

~~~
mannykannot
It is a well-worded question for its purpose. The whole point is that, of all
the options given, only one is justifiable (and it does not require a
tendentious stretch to justify it, either.) Even “light” (which was _not_
chosen) only applies half the time, on average. This is a valid test of
natural language understanding.

~~~
rvense
Remember when IBM went on Jeopardy? There was a question about which Egyptian
pharaoh. A human with some knowledge of history might mix up Ramses and Seti,
or whatever, or just not know the answer, but know that they didn't know.
Watson answered "What are trousers?" with supreme confidence.

Jeopardy is fun and games and it was great for the blooper reel, but they're
trying to sell this stuff to diagnose cancer and guide police efforts. Failure
modes are kind of important.

------
rland
> Models are transitive- if x models y, and y models z, then x models z. The
> upshot of these facts are that if you have a really good statistical model
> of how words relate to each other, that model is also implicitly a model of
> the world.

This right here is a great way of putting the success of GPT-3 into context.
We _think_ GPT is smart, because when it says something eerily human-like we
apply _our_ model of the world onto what it is saying. A conversation like
this:

> Me: So, what happened when you fell off the balance beam?

> GPT: It hurt.

> Me: Why'd it hurt so bad?

> GPT: The beam was high up and I feel awkwardly.

> Me: Wow, that sounds awful.

In this conversation, one of us is thinking far harder than the other. GPT can
have conversations like this now, which is impressive. But only I can model
the beam, the fall, and the physical reality. When I say "that sounds awful,"
I actually do a miniature physics simulation in my head, imagining losing my
balance and falling off a high beam, landing, the physical pain, etc. GPT does
none of that. In either case, when it asks the question or when it answers it,
it is entirely ignorant of this sort of "shadow" model that's being
constructed.

Generalizing a bit, our "shadow" model of reality in every single domain is
far more powerful than language's approximation. That's why we won't be able
to use GPT to do a medical diagnosis or create a piece of architecture or
whatever else people are saying it's going to do now.

~~~
lmm
> In this conversation, one of us is thinking far harder than the other. GPT
> can have conversations like this now, which is impressive. But only I can
> model the beam, the fall, and the physical reality. When I say "that sounds
> awful," I actually do a miniature physics simulation in my head, imagining
> losing my balance and falling off a high beam, landing, the physical pain,
> etc. GPT does none of that. In either case, when it asks the question or
> when it answers it, it is entirely ignorant of this sort of "shadow" model
> that's being constructed.

GPT-3 could presumably write a paragraph like that one. You can claim to have
a working physics model in your head, but why should I believe that unless it
becomes evident from the things that you communicate to me? I've certainly met
humans who could have a superficially legitimate conversation about objects in
motion while harbouring enormous misconceptions about the physics involved.

Maybe the biggest takeaway from GPT-3 should be that we should raise our
standards for human conversation, demanding more precise language and giving
less credit to flourishes that make the meaning ambiguous.

~~~
rland
I mean the miniature physics simulation of me actually imagining another
embodied human falling off of a beam. There is a huge knowledge of the
physical world encoded into that. Think about it, you can viscerally imagine
balancing on a beam, losing your footing, falling, and the subsequent forces
and accelerations on all parts of your body during and after the fall, in real
time. All of that information is in our model, and none of it is in GPT's.
Everyone has this, it's builtin. I'm not talking about Newton's Laws here.

The physical mechanics of being an embodied thing isn't the only example of
this. Any experiential knowledge, any knowledge with feedback suffers from
this. And any knowledge that is difficult to put into words, like emotional
knowledge, interpersonal knowledge, etc., will be off-limits.

That's my point.

~~~
mercer
While I don't disagree, if I were to have the same conversation I would
probably be spending more time thinking about the appropriate response than
trying to remember and simulate an experience even remotely similar. So I'd be
more like GPT, perhaps?

------
mqus
Not a single mention if this is only applicable to english or to other natural
languages. Afaict this mostly lists advancements in ELP (english language
processing), Especially the Winograd schema (or ar least the given example)
seems to be heavily focused on english.

Relevant article for this problem:
[https://news.ycombinator.com/item?id=24026511](https://news.ycombinator.com/item?id=24026511)

~~~
MiroF
But there's no reason the models are english specific...

~~~
YeGoblynQueenne
Yes, there is. A "model" is a set of parameters optimisted by some algorithm
or system trained on the data in a specific dataset. Thus, a language model
trained on a dataset of English language examples is only capable of
representing English language utterances, not, e.g. French, or Greek, or
Gujarati utterances. Diagrammatically:

    
    
      data --> system --> model
    

What is not necessarily English-specific are the systems used to train
different language models, at least in theory. In practice, systems are
typically hand-crafted and fine-tuned to specific datasets, to such a degree
that most of the work has to be done anew to train on a different dataset.

~~~
mannykannot
This is very interesting, and I would be interested in learning more about the
language-specific accommodations that are made. My minimal understanding of
the GPT systems is that they are initially trained on a large corpus of
English-language text, and then often, though not necessarily, given few-shot
training on example tasks before being tested with questions that are scored.

It would be unremarkable if the same system would have to be retrained from
scratch to work with a different language, but I take it that you mean more
than that. One guess is that the training data is annotated with grammatical
information, in which case I would wonder if this is just a shortcut, or
whether it solves a fundamental problem for such systems. Another guess is
that the training set includes disambiguation, but that would seem to render
meaningless the results on Winograd schema. (Update: I withdraw this last
point, as presumably the Winograd-schema questions themselves are not
disambiguated. Disambiguation of the training corpus would be a quite
significant language-specific accommodation, if that is what is happening.)

~~~
MiroF
> It would be unremarkable if the same system would have to be retrained from
> scratch to work with a different language, but I take it that you mean more
> than that.

It really is that unremarkable, despite the GPs insinuations to the contrary.

~~~
YeGoblynQueenne
If training a language model was "unremarkable" we wouldn't have a hypefest
everytime anyone releases a new model, with a catchy name no less. BERT,
ROBERTA, GPT-2/3/... etc. are remarkable enough for their creators to write
papers exclusively for the purpose of announcing _the model_ and describing
the architecture used to train it etc.

In any case, I don't remember anyone ever naming their n-gram models or HMMs
etc. The reason for this of course is that training a language model with a
deep neural network architecture is anything but unremarkable, not least
because the costs in data, human-hours and compute are beyond the reach of
most entities other than large corporate teams. If training a new model from
scratch were truly "unremarkable" we'd all have our own.

~~~
MiroF
> If training a language model was "unremarkable" we wouldn't have a hypefest
> everytime anyone releases a new model, with a catchy name no less. BERT,
> ROBERTA, GPT-2/3/... etc. are remarkable enough for their creators to write
> papers exclusively for the purpose of announcing the model and describing
> the architecture used to train it etc.

Actual NLP practitioners are not having a hype-fest every time someone spends
a shitton to train a transformer LM on a new dataset. It's definitely cool,
but I wouldn't cal it remarkable. What was cool about GPT-3 was the
demonstration of the effectiveness of the "few-shot"/"conditioning" approach.
I didn't find it surprising, but it was cool to push it to the next level.

> In any case, I don't remember anyone ever naming their n-gram models or HMMs
> etc

That's because they are literally different model architectures. The 3-gram
model is the same everywhere, it doesn't depend on dataset. A 3-gram "trained"
on Russian would still be a 3-gram. Similarly, BERT trained on a German corpus
would still be BERT, it wouldn't need a "new catchy name." That's my point.

As a side remark, you are vastly overestimating the amount of dataset specific
tuning that you need - you can often just use the exact same hyperparams and
run it on a new dataset - there's been some recent literature showing that a
lot of these architectures are actually quite resilient to different datasets
and hyperparameter choices and arrive at similar local minima. Sure, you might
be off by 1-2% from SOTA accuracy (let's say on a classification task) but in
production systems you're generally not shooting for the SOTA that was
released in the last 2 months but rather something close-ish to those results.

Honestly, quite confused this is the hill you want to die on.

~~~
YeGoblynQueenne
I'm unsure what you mean by "actual NLP practitioners". In any case does it
really make a difference? We've been having GPT-3 related articles on HN and
in the techie press for weeks after GPT-3 was published.

>> That's because they are literally different model architectures. The 3-gram
model is the same everywhere, it doesn't depend on dataset. A 3-gram "trained"
on Russian would still be a 3-gram. Similarly, BERT trained on a German corpus
would still be BERT, it wouldn't need a "new catchy name." That's my point.

N-gram models are not "the same everywhere". An n-gram model is a set of
probabilities assigned to, well, n-grams. Different sets of n-grams, or
probabilities trained from different datasets would be- different. The n-grams
would be different and the probabilities would be different and the structure
of the model would be different. A Russian n-gram model would represent the
probabilities of Russian n-grams, an English one would reprsent the
probabilities of English n-grams and so on. The _type_ of model would be the
same, an n-gram model; but the _model_ itself would be different. Otherwise,
why bother training more models?

Same goes for a model trained with a transformer architecture. BERT is an
English language model trained with a transformer architecture and
representing the probabilities of tokens in an English corpus. A model trained
with a transformer architecture and representing the probabilities of tokens
in a German corpus would be, again, a different model. You could also call it
"BERT" but that would eventually become very confusing.

>> Honestly, quite confused this is the hill you want to die on.

I'm not dying on any hill. Could you please not do that?

~~~
MiroF
I disagree with how you define "model" to mean the parameters, rather than the
form of the relationship between input and output. I don't think every time we
make a new estimate of the electron mass, we are creating a "new" Standard
Model.

I don't really think it's super worth quibbling over, since I think we're
generally on the same page. It's worth noting that BERT trained on different
corpora is generally still called BERT (you can look at the
huggingface/transformers library, for instance).

I'm sorry that my message came across as more adversarial than intended.

~~~
YeGoblynQueenne
>> I'm sorry that my message came across as more adversarial than intended.

Thanks. I enjoyed our conversation.

------
skybrian
Darn, based on the title, I was hoping for an overview of recent research.

Lots of people are having fun playing with GPT-3 or AI Dungeon, myself
included, but it seems like there is other interesting research going on like
the REALM paper [1], [2]. What should I be reading? Why aren't people talking
about REALM more? I'm no expert, but it seems like keeping the knowledge base
outside the language model has a lot going for it?

[1] [https://ai.googleblog.com/2020/08/realm-integrating-
retrieva...](https://ai.googleblog.com/2020/08/realm-integrating-retrieval-
into.html) [2]
[https://arxiv.org/abs/2002.08909](https://arxiv.org/abs/2002.08909)

~~~
nl
REALM is amazing, and Google's PEGASUS[1] taught itself to count between 2 and
5 (which is mindblowing).

Basically the NLP groups from Google and FB[2][3] are always worth watching.

[1] [https://ai.googleblog.com/2020/06/pegasus-state-of-art-
model...](https://ai.googleblog.com/2020/06/pegasus-state-of-art-model-
for.html)

[2] [https://ai.facebook.com/blog/covost-v2-expanding-the-
largest...](https://ai.facebook.com/blog/covost-v2-expanding-the-largest-most-
diverse-multilingual-speech-to-text-translation-data-set/)

[3] [https://ai.facebook.com/blog/introducing-a-new-large-
scale-d...](https://ai.facebook.com/blog/introducing-a-new-large-scale-
dynamic-data-set-to-push-the-limits-of-natural-language-processing/)

------
YeGoblynQueenne
Separate comment because separate point.

To my mind, all this benchmark-beating is just pointless faffing about and not
a little frustrating to see that this is where the current state of the art is
at. NLP systems keep getting better and better at beating benchmarks and yet
we still don't have artificial agents that can hold an intelligent
conversation, even in a restricted context.

Or rather, we had such programs since the late 1960's, with no one else's than
Terry Winograd's own doctoral thesis system, SHRDLU [1], but it seems that in
the race to build powerful classifiers, we forgot how to build systems that
can interact intelligently with a human.

Here's a small excerpt from SHRDLU (taken from the wikipedia page on SHRDLU,
originally in an appendix in Terry Winograd's thesis):

    
    
        Person: Is there a large block behind a pyramid?
        Computer: YES, THREE OF THEM: A LARGE RED ONE, A LARGE GREEN CUBE, AND THE BLUE ONE.
        Person: Put a small one onto the green cube which supports a pyramid.
        Computer: OK.
        Person: Put the littlest pyramid on top of it.
        Computer: OK.
        Person: Does the shortest thing the tallest pyramid's support supports support anything green?
        Computer: YES, THE GREEN PYRAMID.
        Person: What color is it?
        Computer: BY "IT", I ASSUME YOU MEAN THE SHORTEST THING THE TALLEST PYRAMID'S SUPPORT SUPPORTS.
        Computer: RED
    

Note that this is not just mindless text generation. The conversation is held
in the context of a "blocks world" where a robot with a single hand and eye
("Computer" in the excerpt above) moves blocks of various shapes and colours
around, as directed by a human user in free-form natural language. When the
Computer says "OK" after it's directed to "put the littlest pyramid on top of
it" it's because it really has grabbed the smallest pyramid in the blocks
world and placed it on top of the small block in an earlier sentence, as the
Person asked. The program has a memory module to keep track of what ellipses
like "it", "one" etc refer to throughout the conversation.

SHRDLU was a traditional program hand-crafted by a single PhD student- no
machine learning, no statistical techniques. It included, among other things,
a context-free grammar (!) of natural English and a planner (to control the
robot's hand) all written in Lisp and PLANNER. In its limited domain, it was
smarter than anything ever created with statisical NLP methods.

______________________

[1]
[https://en.wikipedia.org/wiki/SHRDLU](https://en.wikipedia.org/wiki/SHRDLU)

~~~
liuliu
We knew hand-crafted program in limited domains can work for NLP, computer
vision and voice recognition long time ago. The challenge is always, that the
limited domain can be extremely limited, and to get anything practically
interesting requires a lot of human involvement to encode the world (expert
system).

Statistical methods traded that. With data, some labelled, some unlabelled and
some weakly-labelled, we can generate these models with much more efficient
human involvement (refine the statistical models and labelling data).

I honestly don't see the frustration. Yes, current NLP model may not be the
"intelligent agent" everyone looking for yet to any extent. But claiming it is
all faffing and no better than 1960s is quite a stretch.

~~~
YeGoblynQueenne
>> We knew hand-crafted program in limited domains can work for NLP, computer
vision and voice recognition long time ago.

Yes, we did. So- where are the natural language interfaces by which we can
communicate with artificial agents in such limited domains? Where are the
applications, today, that exhibit behaviour as seemingly intelligent as SHRDLU
in the '60s? I mean, have you personally seen and interacted with one? Can you
show me an example of such a modern system? Edit: Note again that SHRDLU was
created by a single PhD student with all the resources of ... a single PhD
student. It's no stretch to imagine that an entity of the size of Google or
Facebook could achieve something considerably more useful, still in a limited
domain. But this has never even been attempted.

Yes, it is faffing about. Basically NLP gave up on figuring out how language
works and switched to a massive attempt to model large datasets evaluated by
contrived benchmarks that serve no other purpose than to show how well modern
techniques can model large datasets.

~~~
zodiac
> It's no stretch to imagine that an entity of the size of Google or Facebook
> could achieve something considerably more useful, still in a limited domain.
> But this has never even been attempted.

How do you know if no one has attempted it, or if all attempts so far have
failed to achieve their goals? One of the claimed downsides of "hand-crafted
systems in limited domains" (i.e., something like the SHRDLU approach) is that
they would take too much effort to create when the domain is expanded to
something even slightly bigger than SHRDLU's domain, so a lack of successful
systems could be evidence of no one trying, or it could be evidence that the
claimed downside is indeed true.

The fact that a working system for the limited domains of e.g. customer
support or medical diagnosis would be worth a lot of money suggests to me that
they must have been tried, but that nothing useful could be built, and we
didn't hear about the failed attempts, meaning that those domains (at least)
are too big for hand-crafted systems to work.

> Yes, it is faffing about. Basically NLP gave up on figuring out how language
> works and switched to a massive attempt to model large datasets evaluated by
> contrived benchmarks that serve no other purpose than to show how well
> modern techniques can model large datasets.

It is inaccurate to say that all benchmarks are useless. For instance,
language translation (as in Google Translate) is a benchmark NLP task, but it
is also something I personally use at least every week, and deep learning
based solutions beat handcrafted systems by a lot for this particular task
(speaking as an end user who has used systems based on both approaches). The
same comments apply to audio transcription (e.g. generating subtitles on
YouTube) as well.

~~~
YeGoblynQueenne
>> The fact that a working system for the limited domains of e.g. customer
support or medical diagnosis would be worth a lot of money suggests to me that
they must have been tried, but that nothing useful could be built, and we
didn't hear about the failed attempts, meaning that those domains (at least)
are too big for hand-crafted systems to work.

That's actually a good point and one I've made myself a few times in the past-
negative results never see the light of day. So, yes, I agree we can't know
for sure that such systems haven't been attempted, despite the certainty with
which I assert this in my comment above.

On the other hand, there are some strong hints. First of all, there was an AI
winter in the 1980's after which most of the field turned very hard towards
statistical techniques and away from the kind of work in SHRDLU. This kind of
work became radioactive for many years and it would have been very hard to
justify putting a PhD student, or ten, to work trying to even reproduce it,
let alone extend it. That's in academia. In the industry, it's clear that
nowadays at least companies like the FANGs strongly champion statistical
machine learning and anyone proposing spending actual money on such a research
program ("Hey, let's go back to the 1960's and start all over again!") would
be laughed out of the building. That is, I believe there are strong hints that
the political climate in academia and the culture in large companies, has
suppressed any attempt to do work of this kind. But that's only my conjecture,
so there you have it.

>> It is inaccurate to say that all benchmarks are useless.

Of course. My point is that _current_ benchmarks are useless.

The fact that you find Google translate useful is unrelated to how well Google
translate scores in benchmarks, which are not designed to measure user
satisfaction but instead are supposed to tell us something about the formal
properties of the system. In any case, for translation in particular, it's not
controversial that there are no good benchmarks and metrics and many people in
NLP will tell you that is the case. In fact, I'm saying this myself because I
was told during my Master's, by our NLP tutor who is a researcher in the
field. Also see the following article which indcludes a discussion of commonly
used metrics in machine translation and the difficulty of evaluating machine
translation systems:

[https://www.skynettoday.com/editorials/state_of_nmt](https://www.skynettoday.com/editorials/state_of_nmt)

~~~
zodiac
Yeah, I know about how current automatic MT benchmarks don't reflect "user
satisfaction" very accurately, and that it's an open problem to get one that
is serviceable. However you make it sound like all deep learning solutions to
tasks perform well on the task benchmark but poorly on the real-world task the
benchmarks try to approximate, whereas that's not true for MT - they are bad
at the benchmarks, but outperform non-deep-learning-based translation
approaches at the real-world-task.

On the subject of benchmarks, how about speech transcription? I was under the
impression that those benchmarks are pretty reliable indicators of "real-world
accuracy" (or about as reliable as benchmarks are in general)

~~~
YeGoblynQueenne
I don't know much about speech transcription, sorry.

How do you mean that machine translation outperforms non deep learning based
approaches at the real world task? How is this evaluated?

~~~
zodiac
> How do you mean that machine translation outperforms non deep learning based
> approaches at the real world task? How is this evaluated?

There were two ways that performance can be evaluated that I had in mind:

1\. Commercial success - what approach does popular sites like Google
Translate use? 2\. Human evaluation - in a research setting, ask humans to
score translations - this is mentioned in your link

~~~
YeGoblynQueenne
OK, thanks for clarifying. The problem with such metrics is that they're not
objective results. It doesn't really help us learn much to say that a system
outperforms another based on subjective evaluations like that. You might as
well try to figure out which is the best team in football by asking the fans
of Arsenal and Manchester United.

~~~
zodiac
The subjective human evaluations used in research are blinded - the humans
rate the accuracy of translation without knowing what produced the translation
(whether a NMT system, a non-ML MT system, or a human translator), whereas the
football fans in your scenario are most definitely not blinded. There are some
criticisms you could make about human evaluation, but as far as how well they
correspond to the real-world task, I think they're pretty much the best we can
do. I'm very curious to know if you actually think they're a bad target to
optimize for.

More to the point, you still have yet to show that NMT "serves no other
purpose than to show how well modern techniques can model large datasets",
given that they do well on human evaluations and they're actually producing
value by serving actual production traffic (you know, things humans actually
want to translate) in Google Translate. If serving production traffic like
this is not "serving a purpose", what is?

~~~
YeGoblynQueenne
Sorry for the late reply.

Regarding whether human evaluations are a good target to optimise for, no, I
certainly don't think so. That's not very different than calculating BLEU
scores, except that instead of comparing a machine generated translation with
one reference text, it's compared with peoples' subjective criteria, which
actually serves to muddle the waters even more -because who knows why
different people thought the same translation was good or bad? Are they all
using the same criteria? Doubtful! But if they're not, then what have we
learned? That a bunch of humans agreed that some translation was good, or bad,
each for their own reasons. So what? It doesn't make any difference when the
human evaluators are blinded, you can make the same experiment with human
translations only and you'll still have not learned anything about the quality
of the translation- just the subjective opinions of a particular group of
humans about it.

See, the problem is not just with machine translation. Evaluating _human_
translaion results is also very hard to do because translation itself is a
very poorly characterised task. The question "what is a good translation is
very difficult to answer. We don't have, say, a science of translation to tell
us how a text should be translated between two languages. So in machine
translation people try to approximate not only the task of translation, but
also its evaluation- without understanding either. That's a very bad spot to
be in.

In fact, a "science of translation" would be a useful goal for AI research,
but it's the kind of thing that I complain is not done anymore, having been
replaced with beating meaningless benchmarks.

Regarding the fact that neural machine translation "generates value", you mean
that it's useful because it's deployed in production and people use it? Well,
"even a broken clock is right twice a day" so that's really not a good
criterion of quality at all. In fact, as a criterion for an AI approach it's
very disappointing. Look at the promise of AI: "machines that think like
humans!". Look at the reality: "We're generating value!". OK. Or we could make
an app that adds bunny ears and cat noses to peoples' selfies (another
application of AI). People buy the app- so it's "generating value". Or we can
generate value by selling canned sardines. Or selling footballs. Or selling
foot massages. Or in a myriad other ways. So why do we need AI? It's just
another useless trinket that is sold and bought while remaining completely
useless. And that for me is a big shame.

~~~
zodiac
> Look at the promise of AI: "machines that think like humans!". Look at the
> reality: "We're generating value!". OK.

OK, I hadn't realised we had such different implicit views on what the "goal"
of AI / AI research was. Of course, I agree that the goal of "having machines
think like humans" is a valid goal, and "generates value by serving production
traffic" is not a good subgoal for that. However, this is not the only goal of
AI research, nor is it clear to me that for e.g. public funding bodies see it
as the only goal.

I use MT (at least) every week for my job and for my hobbies, mostly
translating stuff I want to read in another language. I love learning
languages but I could not learn all the languages I need to a high enough
level to read the stuff I want to read. The old non-NMT approaches produced
translations that were often useless, whereas the NMT-based translations I use
now (mostly deepl.com), while not perfect, are often quite good, and
definitely enough for my needs. Without NMT, realistically speaking, there is
no alternative for me (i.e., I can't afford to pay a human translator, and I
can't afford to wait until I'd learned the language well enough). So how can
you say that AI "remains completely useless"?

Basically, you have implicitly assumed that "make machines that think like
humans" is the only valid goal of AI research. And, from that point of view,
it is understandable that evaluating NMT systems for how well they approach
that goal using human evaluations, has many downsides. However, while some
people working on NMT do have that goal, many of them also have the goal of
"help people (like zodiac) translate stuff", and in the context of that goal,
human evaluation is a much better benchmark target.

~~~
YeGoblynQueenne
In general, yes, that's it. But to be honest I'm actually not that interested
in making "machines that think like humans". I say that's the "promise of AI"
because it was certainly the goal at the beginning of the field, specifically
at the Dartmouth workshop where John McCarthy coined the term "Artificial
Intelligence" [1]. Researchers in the field have varying degrees of interest
in that lofty goal but the public certainly has great expectations, as seen
everytime OpenAI releases a language model and people start writing or
tweeting stuff about how AGI is right around the corner etc.

Personally, I came into AI (I'm a PhD research student) because I got (really,
_really_ ) interested in logic programming languages and well, to be frank,
there's no other place than in academia that I can work on them. On the other
hand, my interest in logic programming is very much an interest in ways to
make computers not be so infuriatingly dumb as they are right now.

This explains why I dislike neural machine translation and similar statistical
NLP approaches: because while they can model the structure of language well,
they do nothing for the meaning carried by those structures which they
completely ignore by design. My favourite example is treating sentences as a
"bag of words", as if order doesn't make a difference- and yet this is a
popular technique... because it improves performance on benchmarks (by
approximately 1.5 fired linguists).

The same goes with google translate. While I'd have to be more stubborn than I
am to realise that people use it and like it, I find it depends on the use
case -and on the willingness of users to accept its dumbness. For me, it's
good where I don't need it and bad where I do. For example, it translates well
enough between languages I know and can translate to some extent myself, say
English and French. But if I want to translate between a language that is very
far from the ones I know - say I want to translate from Hungarian to my native
Greek- that's just not going to work, not least because the translation goes
through English (because of a dearth of parallel texts and despite the fact
that Google laughably claims its model actually has an automatically learned
"interlingua") and the result is mangled twice and I get gibberish on the
other end.

I could talk at length on why and how this happens, but the gist of it is that
Google translate stubbornly refuses to use any information to decide what
translation to choose, among many possible translations of an expression, by
looking at the frequencies of token collocations- and nothing else. So for
example, if I ask it to translate a single word, "χελιδόνι", meaning the bird
swallow, from Greek to French, I get back "avaler" which is the word for the
verb to swallow- because translation goes through English where "swallow" has
two meanings and the verb happens to be more common than the bird. The
information that "χελιδόνι" is a noun and "avaler" is a verb exists, but
Google translate will just not use it. Why? Well, because the current trend in
AI is to learn everything end-to-end from raw data and without prior
knowledge. And that's because prior knowledge doesn't help to beat benchmarks,
which are not designed to test world-knowledge in the first place. It's a
vicious circle.

So, yes, to some extent it's what you say- I don't quite expect "machines that
think like humans", but I do want machines that can interact with a human user
in a slightly more intelligent manner than now. I gave the example of SHRDLU
above because it was such a system. I'm sad the effort to reproduce such
results, even in a very limited domain, has been abandoned.

P.S. Sorry, this got a bit too long, especially for the late stages of an HN
conversation :)

___________

[1] That was in 1956:
[https://en.wikipedia.org/wiki/Dartmouth_workshop](https://en.wikipedia.org/wiki/Dartmouth_workshop)

~~~
zodiac
I hear you about those language pairs that have to be round-tripped through a
third language. I completely agree, too, that the big open questions in NLP
are all about understanding meaning, semantic content, pragmatics, etc rather
than just syntax.

I don't think that "NMT and similar techniques" ignore meaning by design
though. What they do do by design, compared to expert systems etc, is avoid
having explicitly encoded knowledge (of the kind SHRDLU had). Take word2vec
for instance, it's not NMT but fits into the "statistical NLP" description -
its purpose is to find encodings of words that carry some semantic content.
Now, of course it's very little semantic content compared to what an expert
could plausibly encode, but it _is_ some semantic content, and this content
improves the (subjective human) evaluation of NMT systems that do use word2vec
or something similar.

Also, we should carefully distinguish "prior knowledge" as in "prior common-
sense knowledge" and "prior linguistic knowledge". The end-to-end trend
eschews "prior linguistic knowledge", while current NLP systems tend to lack
"common-sense" knowledge, for rather different reasons.

End-to-end training tends to eschew prior linguistic knowledge because it
improves (subjectively evaluated) performance in real-world tasks - I believe
this is true for MT as well, but an easier example if you want to look into it
is in audio transcription. I don't think there's a consensus about why this
happens, but I think it is something like - the previously way people were
encoding linguistic knowledge was too fragile / simplified (think about how
complicated traditional linguistic grammars are), and if that information can
somehow be learned in the end-to-end process, that performs better.

Lacking "common-sense" knowledge - that's more in the realm of AGI, so there's
a valid debate about to what extent neural networks can learn such knowledge,
but the other side of that debate is that expressing common-sense knowledge in
today's formal systems is really hard and expensive, and AIUI this is also
something that attempts to generalize SHRDLU run into. But it is definitely
incorrect to say that it's ignored by anyone by design...

BTW, the biggest improvements (as subjectively evaluated by me) I've seen in
MT on "dissimilar languages" have come from black box neural nets and throwing
massive amounts of (monolingual or bilingual) data at it, rather than anything
from formal systems. I use deepl.com for Japanese-English translation of some
technical CS material, and that language pair used to be really horrible in
the pre-deep-learning days (and it's still not that good on google translate
for some reason).

~~~
YeGoblynQueenne
Sorry for the late reply again.

I agree about word2vec and embeddings in general- they're meant to represent
meaning or capture something of it anyway. I'm just not convinced that they
work that well in that respect. Maybe I can say how king and queen are
analogous to man and woman etc, but that doesn't help me if I don't know what
king, queen, man or woman mean. I don't think it's possible to represent the
meaning of words by looking at their collocation with other words- whose
meaning is also supposedly represented by their collocation with other words
etc.

I confess I haven't used any machine translation systems other than google
translate. For instance, I've never used deepl.com. I'll give it a try since
you recommend it although my use case would be to translate technical terms
that I only know in English to my native Greek and I don't think anything can
handle that use case very well at all. Not even humans!

Out of curiousity, you say neural machine translation is better than earlier
techniques, which I think is not controversial. But, have you tried such
earlier systems? I've never had the chance.

------
YeGoblynQueenne
>> The Winograd schema test was originally intended to be a more rigorous
replacement for the Turing test, because it seems to require deep knowledge of
how things fit together in the world, and the ability to reason about that
knowledge in a linguistic context. Recent advances in NLP have allowed
computers to achieve near human
scores:([https://gluebenchmark.com/leaderboard/](https://gluebenchmark.com/leaderboard/)).

The "Winograd schema" in Glue/SuperGlue refers to the Winograd-NLI benchmark
which is simplified with respect to the original Winograd Schema Challenge
[1], on which the state-of-the-art still significantly lags human performance:

 _The Winograd Schema Challenge is a dataset for common sense reasoning. It
employs Winograd Schema questions that require the resolution of anaphora: the
system must identify the antecedent of an ambiguous pronoun in a statement.
Models are evaluated based on accuracy._

 _WNLI is a relaxation of the Winograd Schema Challenge proposed as part of
the GLUE benchmark and a conversion to the natural language inference (NLI)
format. The task is to predict if the sentence with the pronoun substituted is
entailed by the original sentence. While the training set is balanced between
two classes (entailment and not entailment), the test set is imbalanced
between them (35% entailment, 65% not entailment). The majority baseline is
thus 65%, while for the Winograd Schema Challenge it is 50% (Liu et al.,
2017). The latter is more challenging._

[https://nlpprogress.com/english/common_sense.html](https://nlpprogress.com/english/common_sense.html)

There is also a more recent adversarial version of the Winograd Schema
Challenge called Winogrande. I can't say I'm on top of the various results and
so I don't know the state of the art, but it's not yet "near human", not
without caveats (for example, wikipedia reports 70% accuracy on 70 problems
manually selected from the originoal WSC).

__________

[1]
[https://www.aaai.org/ocs/index.php/KR/KR12/paper/view/4492](https://www.aaai.org/ocs/index.php/KR/KR12/paper/view/4492)

------
bloaf
I know that there are allegedly NLP algorithms for generating things like
articles about sports games. I assume they have something more like the type
signature (timeline of events) -> (narrative about said events)

What this article is about is more (question/prompt) -> (answer/continuation
of prompt)

Does anyone know if there is progress in the (timeline of events) ->
(narrative about said events) space?

~~~
082349872349872
For an intermediate goal on the way to sports games, the financial press
version of (timeline of events) -> (narrative about said events) could be
tackled as a memoryless system.

------
walleeee
> A lot of the power of the thought experiment hinges on the fact that the
> room solves questions using a lookup table, this stacks the deck. Perhaps we
> be more willing to say that the room as a whole understood language if it
> formed an (implicit) model of how things are, and of the current context,
> and used those models to answer questions.

Some define intelligence (entirely separately from consciousness) precisely as
the ability to develop an internal model. Coupled to a regulatory feedback the
system can then modify itself in response to some set of internal and/or
external conditions (Joscha Bach for instance suggests consciousness is a
consequence of extremely complex _self_ -models)

------
ragebol
> In my head- and maybe this was naive- I had thought that, in order to
> attempt these sorts of tasks with any facility, it wouldn’t be sufficient to
> simply feed a computer lots of text.

(Tasks here referring to questions in the New York Regent’s science exam)

Same for me.

But it makes sense of course that learning from text only is entirely
possible. I certainly have not directly observed the answer to eg. 'Which
process in an apple tree primarily results from cell division? (1) growth (2)
photosynthesis (3) gas exchange (4) waste removal', I have been taught, from
text books, what the answer should be.

I do have a much better grounding of what growth is, what apples and apple
trees are though.

------
_emacsomancer_
A bit I found rather strange, on the language-side:

> This is to say the patterns in language use mirror the patterns of how
> things are(1).

> (1)- Strictly of course only the patterns in true sentences mirror, or are
> isomorphic to, the arrangement of the world, but most sentences people utter
> are at least approximately true.

Presumably this should really say something like "...but most sentences people
utter are at least approximately true _of their mental representation of the
world_."

------
ascavalcante80
NLP is great for many things, but, from my own experience as a NLP developer,
machines are not even close to understand human language. They can interpret
well some kind of written speech, but they will struggle to grasp two humans
speaking to each other. The progress we are make on building chatbots and
vocal assistants is mainly due to the fact We are learning how to speaking to
the machines, and not the contrary.

------
laurieg
I find it a little bit strange that there is an unspoked assumption in almost
all natural language processing: That speech and text are perfectly
equivalent.

All of the examples in the article work on English text, not spoken English. I
would consider spoken English to be a much better "Gold standard" of natural
language.

I'm really looking forward to machine translation operating purely on a speech
in/speech out basis, instead of converting to text as an intermediate step.

------
rllin
the thing is humans have most efficiently encoded (in detail) reality in text.
humans already highlight what is worth encoding about reality.

for example, you can finetune gpt-2 to have an idea of sexual biology by
having it read erotica. just like how you can have a model learn the same by
watching porn. but it is much more efficient to read the text, since there is
much less information that is "useless"

------
p1esk
Note this is pre-GPT-3. In fact I expect GPT-4 will be where interesting
things start happening in NLP.

~~~
curiousgal
I honestly don't get where the big deal is with NLP. So far the most useful
application has been customer support chatbots and those still don't rise to
the level of having an actual human that can understand the intricacies of
your special request.

~~~
ben_w
Current NLP is bad. Still useful (Google search increasingly feels like it is
doing NLP to change what I asked for into what it thinks I meant) but bad. A
hypothetical future “perfect” NLP can demonstrate any skill that a human could
learn by reading, and computers can read so much more than any given human.

~~~
plafl
Is reading enough to understand the real world without direct experience of
the real world? Is there any research that tries to answer this question?

~~~
p1esk
_Is there any research that tries to answer this question?_

That's the whole point of the experiment called GPT-3.

------
benibela
Rather than a generator, I could use a good verifier, i.e., an accurate
grammar checker

------
narag
Has it happened that a "thought experiment" has become a real experiment ever?

~~~
dane-pgp
Most historians think that this was actually a thought experiment:

[https://en.wikipedia.org/wiki/Galileo's_Leaning_Tower_of_Pis...](https://en.wikipedia.org/wiki/Galileo's_Leaning_Tower_of_Pisa_experiment)

An equivalent experiment was famously carried out for real in 1971 on the
surface of the Moon.

------
jvanderbot
I'd go one step further: Humans themselves don't understand anything, we are
just good at constructing logical-sounding (plausible, testable) stories about
things. These are mental models, and it's the only way we can make reasonable
predictions to within error tolerances of our day-to-day experience, but they
are flat-out lies and stories we tell ourselves not based on a high-fidelity
understanding of anything.

Rumination, deep thinking, etc is simply actor-critic learning of these mental
models for story-telling.

~~~
runT1ME
Do current NLP systems understand arithmetic, and can the do it with
unfamiliar numbers they've never seen on? If not, I'd think that your theory
is demonstratably false, as a child can extrapolate mathematical axioms from
just a few example problems, whereas NLP models are not able to do so.

~~~
glenstein
>Do current NLP systems understand arithmetic, and can the do it with
unfamiliar numbers they've never seen on?

I don't know if that question is rhetorical or not, but GPT-3 can do basic
math for problems it has not been directly trained on, and there's been a fair
amount of debate, including right here at hn, about what the takeaway is
supposed to be.

~~~
YeGoblynQueenne
GPT-3 can't do arithmetic very well at all. There is a big, fat, extraordinary
claim that it can in the GPT-3 paper but it's only based on perfect accuracy
on two-digit addition and subtraction, ~90% accuracy on three digit addition
and subtraction and ... around 20% accuracy on addition and substraction
between from three to five digits and multiplication from between two measly
digits. Note: no division at all and no arithmetic with more than five digits.
And very poor testing to ensure that the solved problems don't just happen to
be in the model's training dataset to begin with, which is the simplest
explanation of the reported results given that the arithmetic problems GPT-3
solves correctly are the ones that are most likely to be found in a coprus of
natural language (i.e. two- and three- digit addition and subtraction).

tl;dr, GPT-3 can't do basic math for problems it has not been directly trained
on.

____________

[1] [https://arxiv.org/abs/2005.14165](https://arxiv.org/abs/2005.14165)

See section 3.9.1 and Figure 3.10. There is an additional category of problems
of combinations of addition, subtraction and multiplication between three
single-digit numbers. Performance is poor.

~~~
gwern
How ironic you claim that the paper overstates it, when you very carefully
leave out every single qualifier about BPEs and how GPT-3's arithmetic
improves massively when numbers are reformatted to avoid BPE problems. Pot,
kettle.

~~~
ximeng
“Byte pair encoding”: more discussion at
[https://www.gwern.net/GPT-3#bpes](https://www.gwern.net/GPT-3#bpes)

