
Do NLP Beyond English - dsr12
https://ruder.io/nlp-beyond-english/
======
tanilama
With new direction of research, like pre-taining and subword vocab, the models
now are mostly language agnostic. English or not, it is just sequence of
tokens.

Difference mainly lies in data abundance, where the distribution is severly
skewed.

------
YeGoblynQueenne
>> English and the small set of other high-resource languages are in many ways
not representative of the world's other languages. Many resource-rich
languages belong to the Indo-European language family, are spoken mostly in
the Western world, and are morphologically poor, i.e. information is mostly
expressed syntactically, e.g. via a fixed word order and using multiple
separate words rather than through variation at the word level.

I thought I'd give an example of this for readers who don't speak many
languages other than English. My native language is Greek, which _is_ an Indo-
European language, but it has some morphological intricacies that make it more
complicated to learn and use (as a second language and from what I'm told of
course) than English.

For example, in Greek, nouns have a gender [1]: masculine, feminine or neuter.
For instance, "the dog" is "ο σκύλος" ("the male dog") while "the cat" is "η
γάτα" (the female cat). However, grammmatical gender is not necessarily fixed
so one can also say "η σκύλα" ("the female dog") or "ο γάτος" ("the male
cat"). So there is a word-root ("σκυλ-" for dog, "γατ-" for cat) that is then
modified by a termination, typically -ος, -η -ο for the nominative of each
gender. The terminations change depending on the declention, for example the
genetive for "σκύλος" is "σκύλου" and the generative for "σκύλα" is "σκύλας",
etc. The termination also changes to denote number, singular or plural
(ancient Greek also used to have a dual number, used to refer to pairs of
nouns). Terminations vary depending on number according to gender and
declention. So for example the plural of "σκύλος" is "σκύλοι", genetive
"σκύλων" and the plural of "σκύλα" is "σκύλες", genetive "σκυλών".

Compare this with English where a dog is a dog is a dog, there is only one
plural form, "dogs" and there is only one genetive form "dog's" or "dogs'" for
each number. And a female dog is either a circumlocution, "female dog" or an
entirely new word, "bitch", with its own simple set of transformations for
number and genitive "bitches" and "bitch's" or "bitches'". Basically, changing
meaning in English can be represented by adding a few words to a vocabulary-
but changing meaning in Greek requires manipulating words at the structural
level.

There are probably a few counter-examples to the above but my understanding is
that this is how it works for the most part. And this, in particular the
abilty of English to represent more meanings with more words, might go some
way to explain why approaches that work best with large datasets tend to be
favoured in English NLP.

Edit: not an NLP expert or a linguist so corrections welcome, of course!

______________

[1] Other Indo-European languages also have genders but the interesting thing
is that those genders don't always match, between languages. This is an
endless source of confusion for foreign language learners. For example I speak
French from an early age and I only realised much later in life that Greek and
French do not always use the same gender for the same nouns, even though I'd
been using the right genders in my speech. For example, both "dog" and "cat"
are masculine in French: "le chien" and "le chat". But while I always used "le
chat" correctly, I always thought of it in my mind as "η γάτα", the feminine
Greek noun. Human language is weird!

------
627467
I'm reading lots of comments reacting against the premise of the article in
name of some kind of efficiency or inherent value of a universal standar.

I wonder if commenters have considered that Research is useful to understand
the present and more importantly prepare for the future. While English being
useful língua Franca today seems to back decision to just focus on English
research the point is: will this English uptake continue at same rate in
future? Isn't there some kind of bubble effect where here in hn many think
that (as non English native speakers) we can all engage so we think English is
the only useful thing for us to understand?

What about all possible unknown Mandarin NLP research (which I think many
agree) may be useful for non-mandarin speakers?

Also, isn't one of the points of Information Technology to enable/create value
in a long tail of experiences? Focusing on a single attribute of a substantial
part of the world seems counter productive and shortsighted...

------
mindvirus
Can anyone chime in to how Chinese NLP compares? I speak Mandarin as a second
language at an intermediate level, and from my perspective it has some
interesting properties compared to English (my native language) that seem like
they might make NLP easier in ways that English NLP might be hard:

\- The grammar for standard Mandarin is much simpler than English. No verb
conjugations or strange tense rules ("go" -> "went")!

\- Tense and even voice are explicit and additive - i.e. adding "呢" or "吧" to
a sentence.

\- Written, it seems less ambiguous. i.e. "duck" or "grave" have different
meanings in English based on context, but that seems much much rarer in
Chinese - two words spoken the same way but with different meanings usually
have different written forms.

~~~
Tainnor
Mandarin has even less morphology than English[1], so to some extent, working
on Mandarin would not "fix" some of the issues of NLP models often performing
poorly on morphology-rich languages.

But it's possible that focusing more on Mandarin would still lead to
improvements in other areas, because the two languages are otherwise quite
different (I wonder how well speech recognition works for tonal languages, for
example?).

[1] Possibly not without reason: there is some indication that languages
spoken by a large number of speakers tend to become structurally simpler,
while very small isolated languages can often develop surprising complexity.

------
PeterStuer
The cheap money, as in Oligarchy backed petrodollar slush funds, is residing
mainly in the US. The 'defacto' second language of the world is English.
Combine those two and it is not hard to see why English NLP is so dominant.

I'm not yet sure whether that is a good or a bad thing for non-English-as-a-
first-languege regions.

~~~
ShamelessC
I don't have a horse in this race, but surely it could only be bad for those
regions, right? There's a whole world of applications involving NLP which non-
English speakers don't have access to.

~~~
PeterStuer
On the other hand, a _huge_ amount of those applications target consumers to
get them to misspend more money, so not having them might be a benefit?

Surveillance capitalism relies on getting ever more info about you, and
exploiting it or selling it on to all that want to exploit it. NLP is one
vector to extract that info from raw data.

------
YeGoblynQueenne
>> Recent models have repeatedly matched human-level performance on
increasingly difficult benchmarks—that is, in English using labelled datasets
with thousands and unlabelled data with millions of examples. In the process,
as a community we have overfit to the characteristics and conditions of
English-language data. In particular, by focusing on high-resource languages,
we have prioritised methods that work well only when large amounts of labelled
and unlabelled data are available.

This is an interesting observation. Training on large datasets tends to be
framed as a strength, e.g. we have recently seen articles praising OpenAI's
GPT-3 for being, well, big. The truth is that model size (and the assorted
dataset size) is a bug, not a feature. It is the result of a dearth of
research on approaches with low sample complexity. Or in other words, it's the
result of consistently picking the low-hanging fruit and calling sour grapes
on any domain for which there isn't sufficiently large data ("who cares about
Swahilli?").

>> In contrast, most current methods break down when applied to the data-
scarce conditions that are common for most of the world's languages. Even
recent advances in pre-training language models that dramatically reduce the
sample complexity for downstream tasks (Peters et al., 2018; Howard and Ruder,
2018; Devlin et al., 2019; Clark et al., 2020) require massive amounts of
clean, unlabelled data, which is not available for most of the world's
languages (Artetxe et al., 2020). Doing well with few data is thus an ideal
setting to test the limitations of current models—and evaluation on low-
resource languages constitutes arguably its most impactful real-world
application.

And this is interesting to read in the context of the recent manic hype about
GPT-3, itself described in a paper titled "Language models are few-shot
learners", hilariously attempting to sweep under the carpet the amount of data
and compute required to get to the point where a language model can do "few"
shot learning.

In general, the excitement about the successes that have come from training
large, deep neural nework models with very large datasets has only served to
avoid posing the obvious question about such data-hungry approaches: what do
we do when there isn't a lot of data? Human language with all its minutely
fragmented diversity, turns out to be just that kind of domain.

~~~
ShamelessC
I see this argument a lot. I have essentially zero ml experience, but isn't it
fair to say that DNA indeed carries a similarly large amount of data? Also,
it's not like any human is just born fully capable of inventing an entire
language. It seemingly takes entire societies to pull this off.

Granted, our ml isn't even close to human intelligence yet. But why is the
assumption made that humans are "trained" with very little data? We've
collected millions of years of data to accomplish what we have.

~~~
YeGoblynQueenne
Humans seem to come into the world with very good quality biases that allow us
to learn any human language but I don't think anyone has ever found DNA
encoding information about, say, the English language. An I misunderstanding
your comment? If so, I apologise- I'm not sure where DNA comes in.

>> Also, it's not like any human is just born fully capable of inventing an
entire language. It seemingly takes entire societies to pull this off.

This is a complicated matter and as I'm not a linguist I don't know what
theories there even exist. However, I think a human child is perfectly capable
of inventing an entire language. Small groups of children are certainly
capable of doing that in the early stages of their lives. For example, see
Creole languages [1] which are the languages invented by the children of
people who only have a pidgin as a common language. A pidgin is not a fully
developed language and pidgins are used e.g. for communication between migrant
communities with different origins and who do not have a common language,
while a Creole is a fully developed natural language that is invented by the
migrants' children, who have their parents' share pidgin as a maternal
language. The children start with their parents' pidgin and develop it into a
fully formed natural language with all the bells and whistles. It's quite a
thing to read about really, have a look at the wikipedia article I link above
it's very interesting.

Also, as far as I understand it there's a lot that we don't know about how the
first human languages were created. From some short readings, the current
understanding is that at some point, humans must have become capable of
producing natural language whereas before they weren't. The way things usually
work the turning point for this was most likely some random mutation that
affected a very few individuals, possibly no more than two. So the first
natural language was probably invented not by an entire society but by a
couple of people, probably a pair of siblings. The idea is that the ability to
use natural language gave those very few individuals an important advantage
that ensured its propagation through generations until our time.

In any case that all goes to invention of a new language. Learning an
_existing_ language is quite another matter. Children who grow up to learn
their maternal language don't have access to "millions of years of data". They
certainly don't have access to a few billion examples of utterances in their
maternal language, like the datasets that language models are trained on. That
is to say, kids learn whatever language is their maternal language very, very
quickly as they grow up and from very few examples. And they're capable of
learning _any_ of the few thousand natural languages, beyond English. For me
at least this lends credence to an idea of an innate bias to learn a certain
type of language, a type broad enough to cover every human language. To put a
name on this idea, that's through and through Noam Chomsky's idea of an innate
"Universal Grammar". See wikipedia about criticisms of UG and linguistic
nativism in general [2].

____________

[1]
[https://en.wikipedia.org/wiki/Creole_language](https://en.wikipedia.org/wiki/Creole_language)

[2]
[https://en.wikipedia.org/wiki/Universal_grammar#Criticisms](https://en.wikipedia.org/wiki/Universal_grammar#Criticisms)

------
LockAndLol
Though i agree with the main point, the blog post went off on a tangent about
how some other languages are in danger because of English. IMO if the speakers
of the language do not:

1\. Provide materials in their own language

2\. Consume materials in their own language

3\. Actively and exclusively contribute to the world in English

it is already a statement about the necessity of their language: it isn't. Why
should non speakers of a certain language invest money and/or time on a
language its own people don't even invest in? If its own speakers kill the
language (actively or passively) who are we to say they shouldn't?

~~~
slx26
> Why should non speakers of a certain language invest money and/or time on a
> language its own people don't even invest in?

Because rational choices at an individual level do not necessarily align with
rational choices at a social collective level. "Meditations on Moloch" is a
good read on this topic. You can easily see how collectively the language
diversity is positive, but individually it kinda makes sense to bet for the
biggest language instead. And to be fair, there's a lot of people investing in
their own languages, just not enough, or not with enough power.

------
toolslive
My native language is Dutch. Google handles it poorly. As a result people have
adapted to either search in English, or to morph the original search (that
google does not understand) into sometime that has a higher changes of
yielding good results. I'm pretty sure Dutch isn't the only language that
fares this way.

Now the main reason (at least, I think it is) it's handled poorly is that
Dutch has an infinite number of words ( * ) and order in compositions matters
a lot. This conflicts with the typical precomputations that try to reduce the
number of words etc.

( * ) take any 2 nouns A & B. you can concatenate them into a new noun AB or a
new noun BA (with completely different meaning). German has this concept too.

~~~
yorwba
> take any 2 nouns A & B. you can concatenate them into a new noun AB or a new
> noun BA (with completely different meaning). German has this concept too.

This is also how it works in English, except the concatenation may be written
with a space as "A B". If anything, writing it as "AB" should make search more
accurate, since you're less likely to get results with "BA" than getting
results with "B A" when searching for "A B".

~~~
Tainnor
Only if your NLP models actually decompose words. Simpler ones don't.

But while it's technically true that English also has compounds, it just
writed them apart, it is also the case that German (don't know about Dutch)
uses compounds, and especially long compounds with more than two nouns, a lot
more than English, where you might more often use a whole sentence.

------
laGrenouille
I generally agree with point made in the article that too much NLP research is
focused only on English and a small number of other high-resourced languages.
To me, this is part of a larger problem with natural language processing's
obsession with "state-of-the-art" metrics and general abandonment of broader
research in linguistics.

The situation for those who, like myself, work in applied linguistics is not
actually so dire... At least when working with the languages that have a
reasonable amount of training data. Decent enough treebanks exist for
lemmatisation, POS-tagging, and dependency parsing for dozens of languages
[0]. Fast tools such as spaCy (15 languages) [1] and udpipe (40+ languages)
[2] are freely available and work well for most applied tasks. There are even
decent word embeddings available trained on various versions of Wikipedia [3].

Of course, some of the issue is that these very tasks are biased towards the
specific structures of Indo-European languages. However, for getting work done
(building sentiment classifiers, document clustering, NER, ect.), currently
available tools make it possible to work with a large proportion of the
currently available data.

There is still a lot of work to be done own regards to computational research
on non-English languages, but a lot of the problem right now is recognition of
applied work by the top NLP conferences rather than a complete lack of quality
work currently being done.

[0] [https://universaldependencies.org](https://universaldependencies.org)

[1] [https://spacy.io/models](https://spacy.io/models)

[2]
[https://ufal.mff.cuni.cz/udpipe/models](https://ufal.mff.cuni.cz/udpipe/models)

[3]
[https://github.com/facebookresearch/fastText/blob/master/doc...](https://github.com/facebookresearch/fastText/blob/master/docs/crawl-
vectors.md)

~~~
dhairya
One thing i think about is what where would NLP and linguistics research gone
if another language (other than english) was the predominant language. Granted
there is quite of bit interesting fundamental research in other countries and
languages. But it seems many of problems and approaches developed in language
understanding and NLP target specific idiosyncracies of the English language.
If NLP researched with say German (which has a deep and overly explicit
vocabulary) or Hindi or Mandarian, would the fundamental NLP approaches and
problem areas be different and potentially better?

~~~
TomMarius
I recommend you to check out NLP research regarding the Czech language. It has
high quality researchers keen on making use of the language's features (it's a
highly regular and expressive grammar). Sadly not much research is done, but
what is done is interesting.

------
marcinzm
> In contrast, most current methods break down when applied to the data-scarce
> conditions that are common for most of the world's languages. Even recent
> advances in pre-training language models that dramatically reduce the sample
> complexity for downstream tasks (Peters et al., 2018; Howard and Ruder,
> 2018; Devlin et al., 2019; Clark et al., 2020) require massive amounts of
> clean, unlabelled data, which is not available for most of the world's
> languages (Artetxe et al., 2020).

Languages are generally not unique and this is how humans are able to deal
with sparse data as well. So you can use a language model that is trained on
text across all languages and then the model will infer common structures to
languages with sparse data. XLM did this and seems to perform fairly well
although I don't know how sparse you can go before it breaks down.

> Doing well with few data is thus an ideal setting to test the limitations of
> current models—and evaluation on low-resource languages constitutes arguably
> its most impactful real-world application.

This is what ML and Stats has done for the last X decades and it has
limitations due to the need to constrain the problem with human generated
rules/knowledge/axioms. The new large scale language models skip the human
rules step and are thus able to learn the long tail of language structure. I
don't see how you can have it both ways as the unconstrained problem is
infeasible to learn on small data as there isn't enough to generalize from.

~~~
thomasahle
> I don't see how you can have it both ways as the unconstrained problem is
> infeasible to learn on small data as there isn't enough to generalize from.

I mostly agree with you, but it does seem like current models require more
data to learn language than baby humans.

This would suggest there is a middle ground between old fashioned data starved
learning, and current overfed models.

~~~
Eridrus
One random note I heard on a podcast about child development is that children
do not improve their language skills by watching tv or educational material
and that they need feedback to learn.

Contrast that with language models which do nothing besides watching what
people are writing, it potentially points to a reason who the amount of data
is not comparable.

~~~
rvense
Even more crucial is that children learn language in situations of actual
language use. The things and situations that are spoken about are often at
hand in some way, and there is a large social element.

No matter how close you might get to "the language learning algorithm", if all
you're feeding it is text, it's not going to learn the same thing that kids
learn. The data is simply not there.

------
jaimex2
THe moneys all in English speaking countries and China so... do NLP in
Mandarin too.

------
bfung
A cynical take supporting the above:

NLP in other languages can help sell better ads in those languages.

You all can follow that chain of thought down to it's corporate conclusion =)

------
eukgoekoko
It's not just NLP, English is pretty bad as an intermediate language for
translations from language A to language B. If I try translating a Russian
word "пружина" ("a mechanical spring") to German using Google translate, I end
up with "Frühling", which in fact means "springtime". This is an obvious
artefact of transitive translatiion: Russian -> English -> German.

Providing context may help, but still translation to English strips important
pieces of information.

~~~
frankie_t
I don't think this is inherent to English, or any other language (perhaps in
more specific cases when there is no word with the similar meaning).

I think in general we pack a concept in a word and lose some information this
way, so when you want to be precise with what you are saying you have to bring
your definitions with you. Essentially with translation you take a concept and
"pack" it in a word, then look for an equivalent packing in different
language, then unpack. Naturally, this process is prone to losing information.

~~~
eukgoekoko
I cannot agree. Not only English seems to have many homonyms (the word
"spring" alone has more than 2 meanings), its grammar is also somewhat
primitive. Let me bring one more example, this time another way around: Google
Translate from German to Russian. The verb "tragen" ("to wear") is translated
as "износ" ("a wear") which is a noun. Using English we lose important
knowledge: we have no clue what part of speech the word "tragen" is.

This isn't an issue for any considerably long fragment of text, it will be
properly translated due to context analysis. Still if the text would be
analyzed using German in the first place, this would become less of a problem.

~~~
Tainnor
You're confusing two separate things: linguistic complexity and ambiguity.

Linguistic complexity is hard to measure, but it's not hard to show that at
least morphologically, English is undercomplex compared to many other
languages.

This doesn't necessarily mean that English is more ambiguous, though. Unlike
German, English typically has very rigid word order, so in the context of a
sentence, you'll know if a specific word is a noun or a verb.

The problem here is that many NLP models inadequately capture syntactic
structure.

~~~
eukgoekoko
Sorry if it was confusing, I really wanted to mention both a) lexical
ambiguities b) syntactic ambiguities as possible obstacles for NLP.

> Unlike German, English typically has very rigid word order, so in the
> context of a sentence, you'll know if a specific word is a noun or a verb.

So you say you are able to guess from the word order what part of speech a
particular word is. But with German you hardly need all this guessing.

If you compare two marginal examples: \- English "time flies like an arrow" \-
German "Wenn Fliegen fliegen hinter Fliegen..."

you'll find out the English one has way more possible interpretations.

~~~
Tainnor
> So you say you are able to guess from the word order what part of speech a
> particular word is. But with German you hardly need all this guessing.

Not really. It's not about guessing: in English, the part of speech really is
mostly determined by its syntactic structure.

> If you compare two marginal examples: - English "time flies like an arrow"
> \- German "Wenn Fliegen fliegen hinter Fliegen..."

Not sure what you're trying to say here. The English example is ambiguous, yes
(and only strictly grammaticaly; semantically the meaning is clear, unless
you're using it in the phrase "time flies like an arrow, fruit flies like a
banana", which is meant as a linguistic joke). It's also very easy to come up
with examples of phrases or sentences that are ambiguous in German, or in any
language for that matter. Here are some fun examples:

"Er liest das Buch seiner Schwester vor" (could either mean "he's reading the
book to his sister" or "he's reading his sister's book to someone")

"der weiße Schimmel" ("white mould", or "white horse")

"wilde Tiere jagen" ("to hunt wild animals", or "wild animals are hunting")

and don't even get me started on the ambiguity of compound words or phrases
with a genitive, where there are often tons of potential interpretations
depending on the intended relationship between head and dependent noun.

And also the German example you gave (fully: "Wenn Fliegen hinter Fliegen
fliegen, fliegen Fliegen Fliegen nach", or "if flies fly behind flies, flies
fly after flies") is a) another joke sentence nobody uses in practice, and b)
is exactly a case where you can only distinguish the part of speech (and the
grammatical case) of a word from the syntactic structure and not from its
morphology, something you claimed doesn't happen in German, but here it
clearly does.

Look, you may make a case that it's easier for English sentences to be
ambiguous than for some other languages, but I would need to see some good
data before I believed that claim, because it's just not something that is
immediately obvious.

~~~
eukgoekoko
I still think you're missing my point, although I am impressed by your German
skills ("der Schimmel" is BTW just a homonym, it's hardly related to the topic
of syntactic ambiguity).

> is a) another joke sentence nobody uses in practice, and b) is exactly a
> case where you can only distinguish the part of speech (and the grammatical
> case) of a word from the syntactic structure and not from its morphology,
> something you claimed doesn't happen in German, but here it clearly does.

I didn't make such a strong claim. All I wanted to say in German syntactic
ambiguities are much less of a problem than in English. I've brought two
anecdotal evidences to let you compare possible ambiguities in both of them,
these two are indeed nothing but jokes.

But let's take a closer look at them once again.

a) "Time flies like an arrow": the word "time" can be 1) a noun 2) an
adjective 3) a verb in declarative form 4) a verb in imperative form. This
gives us a factor of 4 on the very first word of the sentence.

b) "Wenn Fliegen hinter Fliegen fliegen" \- ambiguitity exists just between
"fliegen" as a verb and "Fliegen" as a plural noun, thus the "ambiguity
factor" of the word "f/Fliegen" is just 2.

> but I would need to see some good data before I believed that claim, because
> it's just not something that is immediately obvious.

Fair enough.

------
azangru
I am not a native English-speaker, but my dream is that the world gradually
converges on a single standard for international communication, and that this
standard is English, so that an ever-growing proportion of the population will
become fluent in it. I understand various reasons for why this is unlikely to
happen; and I realise that there've been lots of centrifugal forces lately
that make this vision ever less likely; but I can't help but smile inwardly
when I hear that NLP is so much focused on English, and can't help rooting for
it to remain so.

~~~
inetsee
I am a native English speaker. The problem I have with English as the
Universal International Language is that it gives enormous economic and
political power to the United States. I am not convinced that the United
States can be relied upon to use this power responsibly.

~~~
azangru
> is that it gives enormous economic and political power to the United States

All the best to them :-) I think having models in nations that actually use a
language as their native tongue is far better than choosing a language for
which there is no living model (such as Latin or Esperanto).

But then, why don't you also add Britain, Australia, Canada or South Africa to
your list?

~~~
inetsee
I am aware that there are other countries that have English as their primary
language. According to Wikipedia
[https://en.wikipedia.org/wiki/List_of_countries_by_GDP_(nomi...](https://en.wikipedia.org/wiki/List_of_countries_by_GDP_\(nominal\))
the US GDP is approximately 3.45 times the combined GDP of the other countries
you named. I stand by my statement that the US has "enormous economic and
political power".

edit: I personally think Esperanto would be a good choice as an International
Language. I believe it's much easier to learn than English.

~~~
azangru
I am not disputing this statement.

Moreover, it is well known to sociolinguists that political and economic power
is something that makes a language spoken by that nation more prestigious and
more desirable to attain, and, therefore, contributes to the spread of that
language. So, for the purposes of my dream, I can only wish that the US
remains as influential as it is (and doesn't switch to Spanish in the process)
:-)

I guess my point is that I am not sure that, say, having other countries where
English is the primary language (as listed in the previous comment)
contributes to the power of the US. Likewise, I am not sure that the further
spread of English as an international language will by itself contribute
significantly to the power of the US.

> I personally think Esperanto would be a good choice as an International
> Language

Ugh, not a fan. English is a vibrant, living language that already is being
used by huge swathes of people for various purposes; Esperanto is not.

