Hacker News new | past | comments | ask | show | jobs | submit login
OpenNMT: Open-Source Neural Machine Translation (github.com)
133 points by groar on Jan 12, 2017 | hide | past | web | favorite | 34 comments

The amazing thing about this new wave of innovation in AI is that the technology is democratized -- anyone can have near state-of-the-art technology to power their applications without paying for it.

One of the lasting principles in the "business of AI" is that the data is the true source of value. If you don't have a monopoly on the data, you're going to have a tough time sustaining a competitive business because everyone else will more or less be able to clone your technology. My PhD advisor taught me this based on his experience founding one of the first really successful neural networks companies, HNC, which was later acquired by Fair Isaac (FICO). HNC put together a consortium of banks in order to train their credit card fraud detection system and maintained a monopoly on the data. FICO still runs this business today and I've heard from recent employees it still runs on the same old single-hidden-layer multi-layer perceptron architecture as it did in the 90s.

I tested it on the live demo with a few sentences from Google News, English to Italian and viceversa. It looks to be at the same level of Google Translate: some sentences are better translated to Italian by OpenNMT, others by Google, even in the same excerpt from the news.

Both tools seems to be better at translating to English, but maybe because I don't realize some of the mistakes they do.

Overall it's a great tool. I suggest to read the FAQ at http://opennmt.net/FAQ/

The code implements a pretty close model to what Google Translate has published (https://arxiv.org/abs/1609.08144). However the two systems are likely trained on very different datasets.

Where did the English->Italian and Italian->English models come from? The opennmt.net web site does not list them.

Hi, I'm Alexander Rush (@harvardnlp), one of the project leads on OpenNMT and an assistant prof at Harvard. Feel free to ask me anything about the project.

Hi, thanks for offering your help! A couple of questions: - Which resources (books, courses, tutorials...) would you recommend to learn how to use OpenNMT? I am a programmer, with vary basic knowledge of NLP concepts. - I see a dictionary integration in the Systran demo. I assume this is a Systran product, not something included in OpenNMT? Or am I wrong?


As a terrible human being, I tested Chinese-to-English translations on f-word-based profanity on the demo page (https://demo-pnmt.systran.net/production). The results were arguably not as good as Google Translate, which can recognize the colloquial form 操[^1] but not the more orthodox 肏. PNMT's results were great in another way as they form more fluent English.

  [^1]: Everyone hijacked the character.
On the other hand, OpenNMT appears able to translate the "f* you!" sentence to "你妈!" in Chinese, a form of "操你妈" with the F word not spoken but implied. The single-f-word sentence generates amusing variations when used with different punctuations, although many ("...", "?", "!", "") fail by reducing the sentence to a normal call for one's own mother.

These experiments make me curious about what kind of corpora the system was trained on.

* * *

The test sentences used were:

  [^2]: This one is technically wrong.
For the orthodox version, replace all instances of 操 with 肏.

As an additional experiment, I tested some variations of "mother" profanities without "操":

    1. 你妈!
    2. 他妈的!
    3. 他妈的。
Between #2 and #3, PNMT appears more sensitive to punctuations than human do. #1 is only included as a round-trip validation.

Edit: (#1) ... and is not supposed to be translated back into profanity due to high false positive rate.

While this type of translation is heinously understudied, the opposite problem of controlling the politeness forms of translation is actually an important area of research. For instance Controlling Politeness in Neural Machine Translation via Side Constraints (http://homepages.inf.ed.ac.uk/abmayne/publications/sennrich2...) is something that OpenNMT can support.

I think that's kind of uncertainty of the nature of neutral network? Or the training data is not very trustworthy. IMO, the latter is more probable.

Hey Sasha, I took the NLP course with you at Columbia when you covered for Michael Collins 2-3 years ago. I vaguely remember either you or Michael mentioning that it wouldn't really be feasible to do multi lingual translation by mapping the input text to some universal intermediate language (embedding space) and then decoding into the target language (this was in the context of pharse based systems).

It's great to see that the MT field has made such great progress in the last few years and that the latest NMT models are doing just that.

Hey! Yeah I taught that class three years ago...

Several papers that really demonstrated this was possible at scale came out around over the next year, the most well known is "Sequence to Sequence Learning with Neural Networks" (https://papers.nips.cc/paper/5346-sequence-to-sequence-learn...). It's been quite fun watching something I assumed was too hard at the time, become essential to the field.

Hi, This is a great effort! Really excited about what you are doing.

I'm curious about the minimum size of the dataset that would be required to get any reasonable output. I understand that this maybe dependent on the language pair, yet some concrete numbers (like for a few language-pairs shown in the demo) would help me get an idea. Also, what can I do to make it work for language pairs which have very small parallel/aligned data available. I would be grateful for any pointers regarding this.

It's tricky to give you an exact answer. For translation, the minimal size we have been using is about 1 million aligned sentences, although people often report on smaller data. There are also lots of tricks to get around small dataset problem.

(1) You can pre-initialize your model with monolingual word embeddings. We recommend using the Polyglot embeddings which exist for many different languages. See http://opennmt.net/Advanced for details.

(2) You can train your model with a nearby language. For instance we have a model that uses data from all the Romance languages simultaneously.

(3) You can use monolingual data in other ways. For instance, if you are translating into English, you can combine with a standard language model or pretrain on the English data.

There are a bunch of other approaches in the literature, but these are some of the more common tricks.

great work! I've wondered if NMT models could be used with other types of data, like music notation (midi or ABC). The use case I had in mind was "translating" a monophonic melody input to a polyphonic ouput, ie. auto-arranging melodies.

Of course assuming there is an available dataset of input-output pairs to train with.

Non-standard datasets like these are very fun to play with. If you can produce the training data (source => target aligned text files), it is relatively simple to try it out. Some mappings that people have recently published on: code => comments, ingredients => recipes, bad writing => good writing. I haven't seen the application that you are describing, but it would be pretty interesting.

Note though that NMT is particularly helpful for variable-length output. If you know the target is the same length as the source, then there are likely easier ways to go.

Thanks for the reply, that sounds like a lot of fun!

Do you happen to have links to the projects you mentioned for non-standard mappings? Would love to see the results and insights from them, before embarking on assembling training sets for my use case

How large should a training set be, typically?

You can often get something started with ~10000 examples. It's very problem specific though.

Hi Alexander! I'm a trilingual mobile software engineer and I've always been fascinated with breaking down language barriers (particularly interested in Russian-English and English-Russian translation). In the context of these languages, what are the scenarios where OpenNMT performs poorly and how can I contribute to improving performance in these areas?

What's interesting about neural machine translation is that the core model is completely language pair independent. So we roughly use the same code for Russian-English, English-Russian, and Chinese-German. That being said the errors in Russian are quite different than those made in other languages due to case endings. For instance for a similar size data set there are often 5x more unique Russian words than in English.

But if you want to get involved more generally our gitter is http://gitter.im/OpenNMT and our forum is at http://forum.opennmt.net.

This seems like an argument for a character based rather than word based network. It just so happens that English and Chinese, the two languages which have the most machine learning research, are relatively analytic, with a low morpheme per word ratio. But many world languages have a much higher ratio (Russian wouldn't even rank that high!) and acquiring training data covering all unique "words" is essentially impossible.

I'm glad you mentioned this. There is a lot of interest these days in character-based machine translation, including several papers in review at ICLR. The current practical consensus (at least in OpenNMT) is that character-only models are not really worth the efficiency loss. A simple compromise is to use Byte-Pair Encoding as a preprocessing step in morphologically rich languages and allow the model to produce sub-word chunks. This is implemented in OpenNMT as a preprocessing option (see http://opennmt.net/Advanced).

Neural translation has been really amazing and impressive. I stumbled into a way to be mean to Google translate this week: https://twitter.com/driainmurray/status/818934530207862784

OpenNMT can also run amok given totally unreasonable input too (example below). I'm being unfair, but the more serious point is that translation systems might go crazy in less obvious ways. While fluency is greatly improved, they certainly don't always get the meaning correct. I'm excited to see what people will do to better diagnose when we can trust the translation.

Input: "Toll ach toll toll ach toll toll toll toll toll toll toll toll toll toll toll toll toll toll toll toll toll toll toll toll Deutsch."

Output: "Tell me great great, great great, great great great great great great great great great love great ? ? ? great love ? ? ? ? ? ? ? great ? ? ? ? great love ? ? ? great great ? ? ? great great ? ? ? great great ? ? ? great great ? ? great great ? ? ? great great ? ? great great ? ? great love ? ? Good ? ? ? great love ? ? great love ? ? Good love ? ? great love ? ? great love ? ? great love ? ? great love ? ? great love ? ? great love ? ? great love ? ? great love ? ? great love ? ? great love ? ? great love ? ? great love ? ? great love ? ? great love ? ? great love ? ? great love ? ? great love ? ? great love ? ? great love ? ? great love ? ? great love ? ? great love ? ? great love ? ? great love ? ? great love ? ? great love ? ? great love ? ? great love ? ? great love ? ? great love ? ? great love ? ? great love ? ? great love ? ? great love ? ? great love ? ? great love ? ? great love ? ? great love ? ? Good ?"

[Each ? was a musical note glyph, but HN ate the unicode character.]

Unfortunately we do run into these types of "robustness issues" in our experiments. They're rare enough that they don't really change many of the performance metrics of the system, but they are quite embarrassing. It is currently a bit of a mystery as to why they go so wrong, better diagnostic tools are sorely needed. Informally, these mistakes similar to the more well-known mis-classification of noise issues in cnn's for vision.

As someone whose interest in AI and related tech stems as much from a desire to learn about our own brains as it does about problem-solving technologies, the failure modes of these systems are as almost as interesting as the good results - and also the part that tells me we've quite a ways to go.

I love your example too. I studied linguistics, and I always wondered about this idea that a sentence consisting of a determiner and an inifinite number of repetitions of the same adjective followed by a noun was supposed to be a valid sentence of any language. There has to be a little more to that story...

Back on topic, I really love the thought of running my own Google Translate server.

What resources does this require to run? Obviously you need some fairly heavy lifting for the initial training, but once trained, could the models be used on, say, a mobile phone?

Yes, you can actually run them on a phone! We had an earlier demo of very strong system running on android https://github.com/harvardnlp/nmt-android

Without tricks the large models takes up about 700 megs, opennmt.net/Models/. They can run on a standard CPU relatively quickly. We also have a pure C++ decoder if you want to do translation without the NN framework.

These NMTs seem tragically bad at translating simple straightforward Chinese e.g.: 我要下班了。下班了再说。

  Google: I have to get off work. Off to say.
  OpenNMT: I am going to work. After work, again.
If this is the state of art for NLP. We have a long way to go.

It's not there yet, but the improvement has been quite significant in aggregate. Key table from the GoogleNMT paper, empirically showing a 60% relative improvement on this task:

                  PBMT  GNMT  Human Relative Improvement
English → Spanish 4.885 5.428 5.504 87%

English → French 4.932 5.295 5.496 64%

English → Chinese 4.035 4.594 4.987 58%

Spanish → English 4.872 5.187 5.372 63%

French → English 5.046 5.343 5.404 83%

Chinese → English 3.694 4.263 4.636 60%

This is obviously data dependent. I suspect that the advantage of human is much higher in colloquial content vs written (esp. news) content. "Universal Adversarial Perturbations"[1] last year showed that you can easily generate reasonable (to human) perturbations to completely fool the state of the art DNNs for images. I suspect that the same is true for the current batch of NMTs as well. As a simple demo, I change the example Chinese a little (就要下班了。下班了再说吧。Only aux characters changes with the same meaning) and all NMTs failed spectacularly in different ways.

  Google: It is necessary to get off work. To say it again.
  OpenNMT: it's going to work. Go back to work again.
  Baidu: It's going to work. After work.
[1] https://arxiv.org/abs/1610.08401

Yeah this is a nice connection. Note however that there has been much less success in using perturbation in language. The fact that inputs are discrete makes it harder to apply some of the tricks in the adverserial image work.

Could be the inane target language. http://pinyin.info/readings/texts/moser.html

Train top20 language pairs.

Make mobile app that takes picture, runs OCR, detects language and feeds into model.




All offline, after the models have loaded once.

Better: Train the same model for all twenty language pairs (http://opennmt.net/Models/#multi-way---fresptitrofresptitro).

Even better: Use OpenNMT to do the OCR too (https://github.com/opennmt/im2text).

Applications are open for YC Winter 2020

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact