
The 4,000 Lines of Code Harvard Hopes Will Change Translation (2017) - handpickednames
https://slator.com/academia/4000-lines-code-harvard-hopes-will-change-translation/
======
not_a_terrorist
For quality translation of complex, non repetitive ideas, a human is, and will
always be required.

I have been in the translation business field for about 20 years now, so I
have seen the rise of desktop tools, then server-based solutions, and lastly,
of course, cloud-based tools. While each generation is better than the
previous, the increments are decreasing.

We are based in Canada, where we receive around 300,000 new immigrants per
year. That's 6 million over those 20 years. Guess what? Those people are
hitting the job market, and now, their children too.

Consequence: the overall quality of English texts is noticably decreasing,
from ALL of our customers. I notice the names, and they are not your typical
English, French, German, Italian or Ukrainian names (traditional settler
groups in Canada). We do more and more of what is called "transcreation", a
fancy way to say complete re-interpretation of a text. We basically extract
the main ideas, and re-write the whole thing, because basing the translation
on the provided source text always yield disappointing results.

I really don't see how machines could distill the ideas out of a text and
reinterpret them in a nice way.

I can see AI work for short, simple, well-written texts. Heck, even long
manuals with repetitive blocks of text from manual to manual. But with
creative, complex text: still a looooong way baby!

~~~
sametmax
Technical solutions are only a workaround, not a proper solution. A solution
to a problem that have little benefit of existing anyway.

I think it's about time humanity decides to stop the ego trip and declare
english as the earth official language.

No need to kill other languages, every people are free to use them as much as
they want.

But really, all __new__ documents, medias, displays, pieces of information,
should be translated to english as well. And it should be mandatory at school,
as well as to get any administrative position.

I'm french. My country has a VERY strong view on the local language
protection. But promoting your culture doesn't have to be in contradiction
with reuniting humanity.

Peace, democracy, exchange, cooperation, archiving, education: they are too
hard to do in hundreds of languages. It's a waste of resources, and a
hindrance to the most important challenges of humanity.

Esperanto never won, written chinese is way too complicated and english is
already everywhere.

Before trying to share the same money, or abolish borders, live in harmony or
reach any ideal at all, we gotta take the big rocks off the road. Not being
able to understand your neighbor is a terrible curse for our specie, and
easier to solve than war or famine. Actually it could be part of the solution
to them.

And since those things take a long time, we better start now.

~~~
anoncoward111
Ironically the only thing complicated about Chinese is its tonal
pronunciations and writing system. It's grammar and vocabulary are markedly
easier than English.

Additionally, English has some horrific consonant/vowel clusters and minimum
pairs.

~~~
pouetpouet
The alphabet is a phenomenal invention. I mean the alphabet in the large
sense, be it the Latin, Greek, Russian alphabet or any alphabet, abjad,
abudiga or syllabary. The Chinese writing system is a notable exception (along
with Japanese Kanji and some others). The only complicated thing about Chinese
is pronunciation and writing system. So half of it is complicated. If a
language can be characterized at least by phonology, writing system, grammar
and vocabulary. Then Chinese is difficult. Not that any language is easy.
There will never be an agreement for the whole world to speak Chinese.

~~~
sametmax
The alphabet, and punctuation.

Omitting punctuation, or using some meta language for it (e.g: in thai,
repeatiting a word can mean "!") makes reading some text extra difficult.

------
srush
Hi everyone, I'm a creator of OpenNMT and run the NLP group at Harvard
([http://nlp.seas.harvard.edu](http://nlp.seas.harvard.edu)). A lot has
changed in both translation and deep learning over the last couple years.
Happy to answer any questions about the area.

(BTW, the title of this article is crazy hyperbole, almost all this work is
just research implementations of ideas originally from Google/MILA)

~~~
luxpir
Good to see you here! I have a few questions if you're still about.

Have you worked with professional translators much along the way? Most of us
use and keep our own translation memories stretching into the millions of
segments.

What would be the ideal training material for these models to work from?

Do you think NMT could ever recognise the type of text to be translated and
apply style and context accordingly? Displaying an element of creativity, for
example, drawing from relevant contextual themes. I suspect it might do this
automatically if trained widely enough?

~~~
srush
Sure, happy to respond.

> Have you worked with professional translators much along the way? Most of us
> use and keep our own translation memories stretching into the millions of
> segments.

I personally haven't worked too much with translators in my research, but we
built OpenNMT with Systran who employ a group of translators and linguists. We
also host a yearly workshop ([http://workshop-
paris-2018.opennmt.net/](http://workshop-paris-2018.opennmt.net/)) where many
professional translator came to talk. There is a lot of interesting work on
integrating translation memories and automatic systems.

> What would be the ideal training material for these models to work from?

So the curt answer is "the best type of training is more training". In
practice though the best type of training is in-domain data for whatever type
of problem is currently of interest.

> Do you think NMT could ever recognise the type of text to be translated and
> apply style and context accordingly? Displaying an element of creativity,
> for example, drawing from relevant contextual themes. I suspect it might do
> this automatically if trained widely enough?

Creativity is a slippery term, so I will avoid that (personally the models
don't seem too "clever" to me). There is a lot of interest though in models
that can mimic a certain style, whether that be politeness, tone, genre, or
technical material. Often that means learning "knobs" to tune for these
properties.

------
unhammer
> The entire OpenNMT system (with pre-processing, the authors note) has around
> 4,000 lines of code. The Moses SMT framework comes in at over 100,000 lines
> of code, according to the paper’s authors.

Although it probably _is_ more compact/understandable, that's not quite a fair
comparison since actively developed systems tend to accrue a lot of cruft for
dealing with edge-cases and experiments and rewrites etc. Running `cloc .` on
the current master of [https://github.com/moses-
smt/mosesdecoder](https://github.com/moses-smt/mosesdecoder) gives 450963
total lines of code (153741 in the top language, C++), while the commit for
the initial sourceforge(!) checkin 12 years ago, 32edb3d66, had 12200 total
lines of code (2942 in C++, 5702 in sh). If OpenNMT has success, it too will
grow in size :-/

~~~
srush
This is probably true in general, but in the last year with the switch from
Torch => PyTorch the core code has actually dropped in size. There is still
progress being made in improving the frameworks for specifying deep learning
models.

------
white-flame
Communication is me taking an idea in my head, considering you the listener
and our shared context, and deciding which outward signals will invoke that
same idea in your head. The full meaning of a statement is NEVER completely
contained in the words themselves; they are selected in empathetic prediction
merely as triggers of hidden assumed shared state unique to various
speaker/listener pairings.

As children, we blindly associate phrases to situations in which those phrases
were heard, and attempt to map them in our outward communication, often to
comical effect. It's a cargo cult model of communication, since at that stage
we don't understand _how_ those phrases were originally constructed under
specific intention & purpose.

The meaning of words and phrases will change over time and location, as we
have new experiences which text associates. Certainly as humans we can
understand and continually update multiple simultaneous contexts in which
things are said, and from which version an utterance is invoking its meaning.

These issues are why I don't see the direct text-to-text model of machine
translation getting beyond a certain ceiling of usefulness. Text fundamentally
refers to out-of-band human experiences, in very inconsistent associations.

~~~
skybrian
People do still write books targeted at a general audience, though. This
generic shared context (what distant strangers can reasonably be expected to
know) can plausibly be inferred from training data.

So it's not going to get people's in-jokes or local slang, but it still seems
like the ceiling could be pretty darn high?

~~~
white-flame
The general audience verbiage still changes over the time axis, though. Are
you going to have different training sets for different time periods, and were
would you cut the threshold between different corpuses? There's a lot of
continuous flux, as well as discrete moments when a phrase turns sour when it
was positive or neutral in the past. However, these are still classification
issues which NNs could potentially resolve, but are still issues.

But it's certainly true that texts written to be specifically & clearly
informative will always be easier to translate than fiction, opinion pieces,
or informal conversations. It all depends on what you consider to be "high" on
the result spectrum.

The successes already achieved leave the focus on the existing shortcomings as
the ceiling, not necessarily on adding refinement to what it already does
respectably well.

