One of the lasting principles in the "business of AI" is that the data is the true source of value. If you don't have a monopoly on the data, you're going to have a tough time sustaining a competitive business because everyone else will more or less be able to clone your technology. My PhD advisor taught me this based on his experience founding one of the first really successful neural networks companies, HNC, which was later acquired by Fair Isaac (FICO). HNC put together a consortium of banks in order to train their credit card fraud detection system and maintained a monopoly on the data. FICO still runs this business today and I've heard from recent employees it still runs on the same old single-hidden-layer multi-layer perceptron architecture as it did in the 90s.
Both tools seems to be better at translating to English, but maybe because I don't realize some of the mistakes they do.
Overall it's a great tool. I suggest to read the FAQ at http://opennmt.net/FAQ/
[^1]: Everyone hijacked the character.
These experiments make me curious about what kind of corpora the system was trained on.
* * *
The test sentences used were:
[^2]: This one is technically wrong.
As an additional experiment, I tested some variations of "mother" profanities without "操":
Edit: (#1) ... and is not supposed to be translated back into profanity due to high false positive rate.
It's great to see that the MT field has made such great progress in the last few years and that the latest NMT models are doing just that.
Several papers that really demonstrated this was possible at scale came out around over the next year, the most well known is "Sequence to Sequence Learning with Neural Networks" (https://papers.nips.cc/paper/5346-sequence-to-sequence-learn...). It's been quite fun watching something I assumed was too hard at the time, become essential to the field.
I'm curious about the minimum size of the dataset that would be required to get any reasonable output. I understand that this maybe dependent on the language pair, yet some concrete numbers (like for a few language-pairs shown in the demo) would help me get an idea. Also, what can I do to make it work for language pairs which have very small parallel/aligned data available. I would be grateful for any pointers regarding this.
(1) You can pre-initialize your model with monolingual word embeddings. We recommend using the Polyglot embeddings which exist for many different languages. See http://opennmt.net/Advanced for details.
(2) You can train your model with a nearby language. For instance we have a model that uses data from all the Romance languages simultaneously.
(3) You can use monolingual data in other ways. For instance, if you are translating into English, you can combine with a standard language model or pretrain on the English data.
There are a bunch of other approaches in the literature, but these are some of the more common tricks.
Of course assuming there is an available dataset of input-output pairs to train with.
Note though that NMT is particularly helpful for variable-length output. If you know the target is the same length as the source, then there are likely easier ways to go.
Do you happen to have links to the projects you mentioned for non-standard mappings? Would love to see the results and insights from them, before embarking on assembling training sets for my use case
But if you want to get involved more generally our gitter is http://gitter.im/OpenNMT and our forum is at http://forum.opennmt.net.
OpenNMT can also run amok given totally unreasonable input too (example below). I'm being unfair, but the more serious point is that translation systems might go crazy in less obvious ways. While fluency is greatly improved, they certainly don't always get the meaning correct. I'm excited to see what people will do to better diagnose when we can trust the translation.
Input: "Toll ach toll toll ach toll toll toll toll toll toll toll toll toll toll toll toll toll toll toll toll toll toll toll toll Deutsch."
Output: "Tell me great great, great great, great great great great great great great great great love great ? ? ? great love ? ? ? ? ? ? ? great ? ? ? ? great love ? ? ? great great ? ? ? great great ? ? ? great great ? ? ? great great ? ? great great ? ? ? great great ? ? great great ? ? great love ? ? Good ? ? ? great love ? ? great love ? ? Good love ? ? great love ? ? great love ? ? great love ? ? great love ? ? great love ? ? great love ? ? great love ? ? great love ? ? great love ? ? great love ? ? great love ? ? great love ? ? great love ? ? great love ? ? great love ? ? great love ? ? great love ? ? great love ? ? great love ? ? great love ? ? great love ? ? great love ? ? great love ? ? great love ? ? great love ? ? great love ? ? great love ? ? great love ? ? great love ? ? great love ? ? great love ? ? great love ? ? great love ? ? great love ? ? great love ? ? great love ? ? great love ? ? great love ? ? Good ?"
[Each ? was a musical note glyph, but HN ate the unicode character.]
I love your example too. I studied linguistics, and I always wondered about this idea that a sentence consisting of a determiner and an inifinite number of repetitions of the same adjective followed by a noun was supposed to be a valid sentence of any language. There has to be a little more to that story...
Back on topic, I really love the thought of running my own Google Translate server.
Without tricks the large models takes up about 700 megs, opennmt.net/Models/. They can run on a standard CPU relatively quickly. We also have a pure C++ decoder if you want to do translation without the NN framework.
Google: I have to get off work. Off to say.
OpenNMT: I am going to work. After work, again.
PBMT GNMT Human Relative Improvement
English → French 4.932 5.295 5.496 64%
English → Chinese 4.035 4.594 4.987 58%
Spanish → English 4.872 5.187 5.372 63%
French → English 5.046 5.343 5.404 83%
Chinese → English 3.694 4.263 4.636 60%
Google: It is necessary to get off work. To say it again.
OpenNMT: it's going to work. Go back to work again.
Baidu: It's going to work. After work.
Make mobile app that takes picture, runs OCR, detects language and feeds into model.
All offline, after the models have loaded once.
Even better: Use OpenNMT to do the OCR too (https://github.com/opennmt/im2text).