
Natural language isn’t just English [pdf] - happy-go-lucky
http://faculty.washington.edu/ebender/papers/Bender-SDSS-2019.pdf
======
briga
Part of the problem seems to be logistical. English natural language training
sets are probably just way more common than other languages. About the only
language that has more training data is Chinese, and they seem to be doing
fine in the NLP department.

Aren't modern transformer networks able to deal with multiple languages? Seems
like models powerful enough to do that should be fairly language-agnostic,
assuming you can pull together enough training data for that language.

~~~
zdragnar
My very limited understanding is that some interstitial language data is used
to map between languages, but they are often somewhat "polluted" by the idioms
or idiosyncrasies of whichever language you first start with to build it up.

I seem to recall, for example, google translate having difficulty with the
word "plane" between two non-english languages, accidentally confusing the
word with two meanings from English. This was, iirc, because the internal
representation of the words were originally generated by an English dataset,
and so airplane in one non-english language became to level wood with a planer
in the other non-english language (or something to that effect).

Mandarin / Chinese has a comparatively simpler grammar (no verb conjugation)
and no phonetic alphabet with arbitrarily varying phonetic rules, it is also
chock full of idioms and homophobes. There is a lot that can be done, but the
80 / 20 rule is in full effect.

~~~
simias
I can definitely attest for the English bias of Google translate, even if
you're translating between pairs of languages that don't include English. In
particular, as you point out, it's often confused by English homonyms (even if
they're not homonyms in either the source or target language).

For instance attempting to translate Portuguese "o báculo" (the staff, the
object) into French gives you "le personnel" (the staff, people working
together). It's a completely wrong translation that, I'm guessing, is caused
by the algorithm not managing to differentiate these two different meanings
because they're homonyms in English:
[https://translate.google.com/#view=home&op=translate&sl=pt&t...](https://translate.google.com/#view=home&op=translate&sl=pt&tl=fr&text=o%20b%C3%A1culo)

An other thing Google translate is terrible at is handling the various level
of addressing. Many languages have the notion of polite vs informal "you"
(you/thou) which have mostly disappeared from modern English. Google very
often translates a formal/polite form of address into an informal one, or vice
versa:
[https://translate.google.com/#view=home&op=translate&sl=fr&t...](https://translate.google.com/#view=home&op=translate&sl=fr&tl=ru&text=vous%20allez%20bien%3F)

Here the polite French "how are you" is translated with the informal "ты" in
Russian instead of "вы". Interestingly if I then translate in the other way
around it correctly uses the "tu" form in French:
[https://translate.google.com/#view=home&op=translate&sl=ru&t...](https://translate.google.com/#view=home&op=translate&sl=ru&tl=fr&text=%D1%82%D1%8B%20%D0%B2%20%D0%BF%D0%BE%D1%80%D1%8F%D0%B4%D0%BA%D0%B5%3F)

~~~
Veedrac
This is simply because Google Translate relies on language-language models
trained on language-language pairs. Where training data is not available in
quantity, the quality of translations is sufficiently low that going through
English is preferred.

This is something that has been solved recently[1] by training ‘massively
multilingual’ models. However, such models come with fairly stark compute
costs, especially given the number of users Google Translate has, so it will
take a while for these advances to roll out. Which ultimately points out how
silly this article is: NLP is all the same in relevant matters, and (to first
approximation) what works for English works for everything else too, as long
as you can get enough training data.

[1] [https://ai.googleblog.com/2019/10/exploring-massively-
multil...](https://ai.googleblog.com/2019/10/exploring-massively-
multilingual.html)

~~~
TazeTSchnitzel
> what works for English works for everything else too

…if they're similar to English. Which isn't good enough, there are many
important languages that are quite different from English.

~~~
Veedrac
They're all similar to English. Languages have their syntactic ambiguities in
different places, but they're all fundamentally _similar_. In every case the
hard part is in understanding the semantics expressed, and the characters used
to express it are a side-issue.

~~~
TazeTSchnitzel
This is assuming the problems are similar between languages. But are they? An
NLP system tested only on English might be useless on CJK languages, because
they do not use spaces, so the system cannot rely on the almost-free
segmentation you get from English. Another example is that if you try a
heavily-inflected language, you suddenly have vastly more forms of the same
word than in English, and your system needs to be robust to that in a way
that's unnecessary for English.

~~~
Veedrac
An NLP model that can't even (implicitly) segment CJK languages is laughable.
It's like saying ‘sure, Magnus Carlsen is really good at chess, but in
draughts you can take multiple moves in a row.’ If you can handle the
ambiguities in natural casual English, you can handle a little inflection.

------
pjmlp
Indeed, one of the reasons why I usually don't bother with natural language
processing tools, as I want to use my own native language, instead of English
or Brazilian Portuguese (if I am lucky enough for it to be supported).

------
zzo38computer
It is correct, natural language isn't just English there is also many other
kind too. But, the document is itself written in English, so we can guess that
it is English. Still it doesn't help when you want to distinguish the natural
language stuff dealing specifically English or in general (and if in general,
you must be prepared to deal with stuff that is different from English,
because they have their own features which are different from English).

------
elfexec
Pages 4 and 5 ( "Languages of the world" ) of the pdf are duplicates.

I agree that NL isn't just English, but the world of computer science,
information processing, AI and even linguistics is heavily english-centric for
a wide variety of reasons. And unless something drastic happens ( the end of
pax americana and the fracturing of the world order ), everything will get
even more english-centric. The internet is spreading all over the world.
American culture is spreading all over the world. And so is american/english.

I do wish CS and programming would be less english-centric. It doesn't make
sense to me that people in china, russia, europe, india, etc are writing code
in english. Why not translate computer languages to their native language?
Given the nature of programming languages, it's so simple to do. Wouldn't it
be simpler to translate C or python or java to chinese and have chinese
students/programmer code in the chinese language ( mandarin )? Rather than
having chinese students learn english and then learn to code in english?

~~~
zzzcpan
_> It doesn't make sense to me that people in china, russia, europe, india,
etc are writing code in english. Why not translate computer languages to their
native language?_

It doesn't make sense to you, because people are not writing code in English,
they are writing code using a set of characters, only some of those characters
and some keywords overlap with the English language, namely latin alphabet and
those keywords have different distinct meanings, than what they mean in
English. Furthermore, most of the characters, including the latin alphabet,
already have a significant part in every education system in the world. People
are taught latin alphabet, some keywords and most of the operators you see in
programming languages at school in math and physics classes. So, how in the
world would it benefit anyone to have programming languages with non-latin
alphabets? It can only make things a) harder to learn, because there is no
utilizing familiarity with other fields people already studied and b) harder
to share code and knowledge, because other people in the world can't read and
input your non-latin characters.

By the way, this is the same stupid idea that got people to allow unicode
variable names in programming languages.

~~~
TazeTSchnitzel
Programming languages _are_ in English though. The keywords are pulled from
English, the names of functions are clearly made up of English words, the
documentation is in English first and often only English, the untranslated
error messages are in English, the filenames are in English. Already speaking
English is a massive advantage in understanding and working with all of that.

~~~
zzzcpan
Keywords, names, filenames don't retain their meanings from English, so it's
not important where they came from. And error messages use the same few
English words, but mostly all those keywords, names, filenames. Documentation
available in local languages is the only important part here. But
documentation and other reading materials is an issue for people who are
already many years in programming and the deeper they go into the field, the
more necessary it becomes to learn English.

------
the_decider
This blog-post discusses the dangers of blindly applying English-optimized NLP
models to other languages: [https://primer.ai/blog/Russian-Natural-Language-
Processing/](https://primer.ai/blog/Russian-Natural-Language-Processing/)

~~~
Veedrac
If anything this post disagrees with the claim that ‘English isn't generic for
language’. It considers a naïve approach that fails for English, and shows
that it fails for Russian in a similar way, and then compares it with an
improvement also developed for English, which too helps Russian in a similar
way.

It is true that when Facebook trained their model “blindly”, this introduced
meaningful bugs, so I agree with that claim from the post. It does not,
however, give credence to the idea that techniques developed for English don't
reliably work for other languages.

Unfortunately the specific comparisons they make are suspect given how
different the training sets were.

------
avmich
It could be extremely useful to concentrate efforts on usage and details of
one specific language rather than spread them over all different communication
forms.

The opposite could also be true.

------
natalyarostova
Methods that work on English are an order of magnitude more important than
others. There is a fair methodological point towards saying English=!natural
language. But the economics of it simply make it the most important one to get
right.

~~~
rahulnair23
Fully agree on the economic merits of English here. I think the point is more
than methodological though. It is one of equity and access.

Broad swaths of society don't have the same access to the technology. My
mother tongue Malayalam in the 1960's redefined its script to be better suited
for typewriters. It is still non-trivial to typeset correctly. Like my Chinese
colleagues, there are all sorts of constraints on how text is input.

Machine translation beyond the "main" languages aren't too great. To see bias
in action just try English->Turkish, a gender neutral language. "o multu. o
mutsuz." translates to "he's happy. she is unhappy."

Things built for English don't readily transfer. Its worth raising awareness
around this as Emily does.

~~~
natalyarostova
I think people are aware, it’s just that English is clearly the most
important. That’s just reality.

------
unishark
A NLP researcher is supposed to describe how language is used, not insist on
how it should be used.

~~~
hibbelig
Emily's point is that NLP researchers tend to describe only how English is
used, not how other languages are used.

~~~
unishark
That's like criticizing a biologist for only studying nematodes. Research is
specialized by its nature. She does say a bit about directing research to
other languages. But mostly she complains about researchers not identifying
their research as English, including threats to heckle speakers who don't
identify it as such, i.e. who don't use the spoken language in the way she
prefers.

------
mamon
I would jokingly say that English is halfway between natural languages and
programming languages. I mean: English grammar is so simple and all sentences
are so well structured that processing English is order of magnitude easier
than, say, German.

~~~
gbear605
English is in fact not particularly simple, especially not if your only
comparison is German. Both English and German can be modeled at approximately
the same level of difficulty, with only a few minor differences. There is case
assignment, but that’s simple compared to everything else (and English has
deal with case too, if not to the same extent). The field of linguistics has
been trying to fully model English syntax for the last 70 years and is still
turning up new corner cases. And a lot of languages are in fact a lot simpler.
To quote one of my bilingual friends after waking up from general anesthesia:

> [I] couldn't put English in order so I just switched to Russian since I knew
> what syntactic roles I wanted, just not how to order words

Latin, a supposedly more complex language, is another example of a language
with much less strict syntax than English. And Google Translate is horrible at
Latin.

Mandarin, on the other hand, has a more complex grammar than English, but that
doesn’t mean that it is more or less easy for NLP. After all, Google Translate
is horrible at Mandarin.

