
Using AI to match human performance in translating news from Chinese to English - Maimedpuppet
https://blogs.microsoft.com/ai/machine-translation-news-test-set-human-parity/
======
anewhnaccount2
Compare this to one of Google's blog post promoting their MT research:
[https://research.googleblog.com/2016/09/a-neural-network-
for...](https://research.googleblog.com/2016/09/a-neural-network-for-
machine.html)

It is:

1) More accurate, compared to hyperbole like e.g. "Bridging the Gap between
Human and Machine Translation" we have right there in the title the domain:
news.

2) A more impressive result. This result is on an independently set up
evaluation framework, compared to Google's which used their own framework.

Compare further the papers:
[https://arxiv.org/pdf/1609.08144.pdf](https://arxiv.org/pdf/1609.08144.pdf)
[https://www.microsoft.com/en-
us/research/uploads/prod/2018/0...](https://www.microsoft.com/en-
us/research/uploads/prod/2018/03/final-achieving-human.pdf)

These researcher appear to have been much clearer about what they're actually
claiming, and also used more standard evaluation tools (Appraise) and
methodology rather than something haphazardly hacked together.

~~~
snovv_crash
The outputs of Microsoft Research are really good. At least in my field, it is
one of the few places where if they published something you can be sure of
being able to reproduce the results using only what is described in the paper,
no secret sauce required.

------
d--b
As impressive as it may be, these people should refrain from claiming 'human-
like' translation from a system that has no way of 'knowing' anything about
context, other than statistical occurrences.

It is certain that, on occasion, the system will make such mistakes as stating
the opposite of what is being said in the first place, or attribute one action
to the wrong person, and what not. Perhaps on average it's as good as a
person, but this system will make mistakes that disqualifies it from being
used without a bucket of salt.

~~~
dvh
Let's just define that "human like" in context of machine translation from now
on mean "with full legal responsibility". Then let's see who claim their
translator is "human like".

~~~
Analemma_
If you defined it that way, not even human translators would meet the
standard. Treaties and other official documents published in multiple
languages always specify one as the "official" one for purposes of legal
interpretation and that, in the event of conflict or confusion, the
translations are subservient to it. Setting a bar for AI performance so high
that even humans don't reach it seems unhelpful.

~~~
occamrazor
Actually most international treaties specify all language versions to be
equally authentic. Multilingual contracts on the other hand generally have a
single authoritative version.

------
lima
The most impressive ML translation tool I've seen so far is DeepL[0].

Sometimes, it manages to translate whole articles without errors.

[0]: [https://www.deepl.com/translator](https://www.deepl.com/translator)

~~~
foldr
Impressive, but there's an easy formula for getting these systems to make
mistakes. Just input a sentence with some kind of long distance dependency.
For example, DeepL gets agreement right in English to Spanish translations
when the two things that agree are close together:

    
    
        I like soup -> COMO sopa
        They eat soup -> COMEN sopa
    

Impressively, it can even get agreement correct across clause boundaries in
many cases. But if you do wh-movement through two or more clauses, you're
usually out of luck:

    
    
        Which boys does he say he believes eat soup?
        ->
        ¿Qué chicos dice que cree que COME sopa? [should be COMEN]
    

It doesn't really matter very much in practice if an MT system makes mistakes
like this, but they are mistakes that you can rely on humans not to make
systematically.

~~~
foldr
(outside edit window. first sentence should be 'I eat soup' not 'I like soup')

------
rvense
Translate "sentences of news" is very different to translating an entire
article, which is obviously what's interesting.

Is anybody in MT or text comprehension/generation really working on systems
that construct a model/"understanding" of the bigger narrative in a longer-
running text? Even just to be able to do correct anaphora resolution across
sentence and paragraph boundaries, but intuitively also WSD seems easier if
you've got some sort of abstract context over more than just a sentence.

~~~
roel_v
I think Google translate already has this. I was translating some text into
German a few days ago, and after a few sentences I used a word that made it
clear that I was talking about a specific type of contract appointment, and it
went back and adjusted earlier sentences to use more precise terminology. You
only notice this when you a) speak the language you're translating into
somewhat; b) actually type/compose the message in the Google Translate text
box; and c) are typing something idiomatic enough that such specific phrases
can be inferred. So I guess it's just something you wouldn't normally notice.

Either way, I was mightily impressed, to the point where my wife had to roll
her eyes and say 'yeah yeah I understand it now' to get me to drop it. (I'm
just easily excited I guess.)

~~~
rerx
That sounds extremely interesting. I had not noticed that feature before. Do
you happen to have some example input at hand that triggers such an
adjustment?

~~~
londons_explore
It's likely this tech is released to only a small percentage of users, and at
off-peak times.

Parsing an entire paragraph for context is _expensive_...

------
Quanttek
Be careful when reading such claims:
[https://www.theatlantic.com/technology/archive/2018/01/the-s...](https://www.theatlantic.com/technology/archive/2018/01/the-
shallowness-of-google-translate/551570/?single_page=true)

~~~
jimbokun
That was much better than I expected.

And it wasn't until I looked at the byline at the end when I realized, yes, it
is _that_ Hofstadter (Godel, Escher, Bach).

------
blennon
What I find most interesting is the multiple training methods used to get the
network to improve its performance. They name a few in the article:

\- dual learning \- deliberation networks \- joint training \- agreement
regularization

I haven't read the paper to see how these are combined but it makes intuitive
sense that using multiple training methods can lead to better performance.
That is to say, to more effectively search the weight space of the network.

------
iliketosleep
I find these types of "match human performance" claims to be ridiculous,
especially when it comes to Chinese -> English translations. Translation is
both an art and a science, requiring nuanced understanding of the languages,
cultures, and context. It also demands quite a bit of creativity. No
translation tool I've tried has come even close to matching human performance
of a good human translator, including microsoft's tools. AI will need to reach
the point where its understanding of language, culture, context, and creative
ability matches that of humans to truly be capable of "human performance" in
translation.

~~~
yorwba
After reading the paper, my takeaway is that humans aren't really very good at
translation either. None of the methods scores higher than 70% in the
evaluation and that includes several different human translations (whose
performance varies greatly depending on how they were sourced). So while
matching the quality of the average human translator is a great milestone,
there's still lots of room to improve.

~~~
iliketosleep
Most humans who attempt to translate are not actually translators in the
proper sense. Specifically, they are not fully competent in both the source
and target language and usually have no formal training in translation.
Translation is hard, but there are competent people out there who do it
extremely well. Sadly, as an industry it's not taken as seriously as it should
be, and most of the people who are actually doing translations do not have the
appropriate skill set.

------
jakecrouch
It's obvious that there are limits to how well machine translation can work
unless the models have sensory grounding. I wonder if the problem is that
people haven't figured out how to do sensory grounding or that the hardware is
still too slow for it to work.

~~~
jimbokun
This article, posted by Quanttek above, is very relevant to your question:

[https://www.theatlantic.com/technology/archive/2018/01/the-s...](https://www.theatlantic.com/technology/archive/2018/01/the-
shallowness-of-google-translate/551570/?single_page=true)

------
baybal2
About translators solely reliant on NN. The thing is, while 70% of output can
be well passable, some of the rest can be very weird if original input was not
learned. Like a string of gibberish turning into 10 full sentences.

You have to score the extent of wrongness too.

------
fouc
Would be nice to have improved MTL performance for Wuxia/Xanxia webnovels.

------
abacate
I'd suggest addressing non-English to non-English translations first, which is
usually limited in most engines out there compared to translations to/from
English.

------
trisimix
Amazing when can i start reading chinese cs boards

