
Unsupervised machine translation - Kemet
https://code.fb.com/ai-research/unsupervised-machine-translation-a-novel-approach-to-provide-fast-accurate-translations-for-more-languages/
======
heydenberk
The previous paper they mention explains the core insight that makes
unsupervised translation possible:
[https://arxiv.org/abs/1710.04087](https://arxiv.org/abs/1710.04087)

The original paper didn't receive the attention I thought it would, but I
continue to think this is a fascinating result which has deep implications for
machine learning and for linguistics.

~~~
schoen
These word embeddings keep on yielding all kinds of amazing benefits. Is there
any kind of explainability research to help people understand them better in
terms of human psychology?

~~~
yorwba
Word embeddings work because they reflect co-occurrences. I don't know whether
that counts as an explanation in terms of psychology, but humans tend to put
related things together. In a newspaper the articles aren't jumbled together,
but there are sections on different topics, and in each section the articles
are clearly delineated instead of mixing their sentences and each sentence
represents a single unit instead of giving partial information on a dozen
unrelated things.

It might seem obvious that things should be done that way, but if you consider
servers hosting lots of different websites on the same physical machine, or
data structures spread out over several memory allocations held together by
pointers, it's clear that there are other possibilities. So it does seem to be
specific to the way humans use language.

And because human language has this property of co-occurrences corresponding
to relatedness in meaning, you can represent the meaning of a word by building
a model that only predicts the probability that two words occur together.

~~~
PeterisP
In linguistics, there's a classic principle "You shall know a word by the
company it keeps" (Firth, J. R. 1957) - colocations (a sequence of words or
terms that co-occur more often than would be expected by chance) are _very_
informative about what a word means.

------
Grue3
How does it learn idioms (sequences of words that make no sense when
translated into another language word-for-word). For example "Stop beating
around the bush!" would result in complete nonsense if translated into any
language other than English.

------
hervature
Looks like promising research. I will have to read the actual paper later as
opposed to just the blog post. One thing I would like to say is my qualm with
the obsession with unsupervised learning as the quintessential technique.
Unsupervised != no human input. Having the benefit of knowing about word
embeddings is something inherently built into this system by the human
designers and something that supervised learning does not have the benefit of
when given original-translated pairs.

~~~
ageitgey
> One thing I would like to say is my qualm with the obsession with
> unsupervised learning as the quintessential technique. Unsupervised != no
> human input.

I'm not sure I follow the qualm you are trying to get across. Are you saying
you disagree with the term 'unsupervised' because unsupervised algorithms
still bake in human assumptions (like a human-designed word embedding model)
so that's essentially still supervision?

The obsession with 'unsupervised' learning as the quintessential technique is
about getting better results for less money/effort. The premise is that we
assume deep models tend to scale up in accuracy as training data size
increases, so we always want larger datasets to increase accuracy. But
creating labeled data takes a linear amount of human effort ($$$) as the
dataset size increases. At a certain point, creating more labeled data to
improve a model is not cost effective or maybe even impossible.

Unlabeled data can be acquired nearly for free in nearly unlimited quantities
in many cases. So if we can use unlabeled data instead (even if it requires
complex pre-processing like CBOW embedding models which essentially turn bits
of the unlabeled data into it's own label) your final results-per-dollar-
invested goes through the roof over supervised learning. That's the obsession.
It's not about literally no supervision being involved in the process. It's
about driving down the cost of data acquisition while driving up the percent
of the world's available data you can use for training a model.

I apologize in advance if I'm missing the point you are making.

~~~
hervature
No need to apologize, I like HN precisely because people point out the
flaws/confusing parts of my comments/opinions. I agree with everything you say
and am happy you said it because this is how unsupervised learning should be
viewed. I.e., a better ROI in specific cases. However, I have seen too often
the "cake of AI" where the batter is unsupervised, the icing is supervised,
and the cherry on top is reinforcement learning. Somehow, this image connotes
that unsupervised is at the core of AI and also the most important. For what
is a cake consisting of only icing and a cherry?

Where I disagree with you is that the obsession is purely driven from
"results-per-dollar-invested", at least in the academic world. That being
said, unsupervised learning is a great tool and definitely worthy of research.

To summarize, my comment was completely tangential of this paper (the authors
make no such claims). It was more of a stream of consciousness comment that
arose because I envisioned someone reading the paper and saying "see!,
unsupervised learning leads to real understanding, no humans needed!"

------
riku_iki
So, they built translation system using bilingual dictionaries, then asked it
to translate from English to Urdu, and then back to English, and minimized
loss between original and double-translated English.

In one of my previous companies we used this technique to hire pair of
translators: we gave translator pairs such task, and hired pair which
reconstructed original text more closely.

~~~
apendleton
No, they didn't start with dictionaries, or any other parallel corpora; they
learned the word by word translations as well from monolingual corpora, by
finding alignments between monolingual word embeddings in the target
languages.

~~~
pc2g4d
Next step: unsupervised word segmentation. That way they could maybe apply
this unsupervised translation system to undeciphered texts, e.g. Linear A,
Rongorong, etc. I doubt it will work since most of the undeciphered scripts
have a very small corpus, but maybe worth a try.

------
snadal
"We can improve upon this by making local edits using a language model that
has been trained on lots of monolingual data to score sequences of words in
such a way that fluent sentences score higher than ungrammatical or poorly
constructed sentences."

Does anyone knows some references / corpus in English language for this?

------
TaylorAlexander
I have a feeling the answer is “no”, but can anyone comment on whether or not
this could be used to decode the utterances of other animals, such as whales?

~~~
snadal
You would need a bilingual dictionary as the first step of the process, so I'm
afraid that this will not be possible.

~~~
apendleton
Without commenting on whether this method could be applied to whales, the
method described here does not require a bilingual dictionary. They learned
their "dictionary" unsupervised by aligning monolingual word embeddings.

------
londons_explore
All of this seems to be using word embeddings, but most languages don't have
all that many words. You're effectively trying to train with just a few
thousand data points, and will quickly overfit.

Wouldn't the method work better with n-gram embeddings, where n=3 or 4?

~~~
guismay
The method is straightforward to extend to n-grams, this is actually what they
do (see table 1 in
[https://arxiv.org/abs/1804.07755](https://arxiv.org/abs/1804.07755)).

And since you are simply learning a rotation matrix, there is no risk of
overfitting.

