Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Introduction to Statistical Machine Translation (michaelnielsen.org)
21 points by kilo_bit on March 27, 2009 | hide | past | favorite | 9 comments



I wonder how much of this can be applied to code - to fix bugs, for example.


If you have a training corpus that consists of code with bugs and the same code with the bug fixed, I imagine you can get rather far.

The biggest challenge with any statistical or machine learning techniques is getting a good data set. If all you have is a lot of code but the code with bugs in it is not annotated, then you need unsupervised techniques which require better algorithms and more data (as I understand it).


Well, you probably sort-of can do that with open source projects that integrate version control and bug tracking. In these repositories you will find lots of commits names "Fixes #42".

The problem, I think, is that making a statistical n-gram model of a programming language doesn't look like it can produce encouraging results (think locally balanced-like, but globally unbalanced parenthesis, which should pop up often in ngrams for parentheses more than n tokens apart). The upper side is that you can get a tokenizer "for free" using the language's own grammar.


Maybe what you want is not a Markov model, but a stochastic context-free grammar:

http://en.wikipedia.org/wiki/Stochastic_context-free_grammar


What about using a combination of revision control and unit-tests?

A given unit test serves as the annotation for a static piece of functionality. The evolution of the code behind the unit-test shows the pre-bug (French) and post-bug (English) states.

A little harebrained perhaps, but there are a lot of open source projects with a long history behind them; maybe there are a few viable candidates to test this on.


It's shocking that such a naive model produces such good results. I suspect it has to do with the languages they are translating: if they wanted to translate Japanese to German it would take more sophisticated methods.


One reason current machine translation looks pretty good is that over the years a lot of human translation has been remarkably bad. (I say this as a former member of the American Translators Association who made my living for several years as a Chinese-English translator and interpreter.) Most clients of most translators can only check the accuracy of one of the translator's languages, which allows for a good bit of bluffing. I'm getting better and better satisfied since the 1990s with machine-translated texts I see on the Web, for example translations of my own personal homepage into other languages I know well enough to read.


I'm surprised there's little mention of either context or culture. I think those are two of the most important elements of translation. Without context, it's impossible to know the tone, or sometimes even the actual grammar/conjugation of translations for certain languages (I'm thinking Korean/Chinese but this probably applies for other languages as well). Context also helps tremendously with disambiguation, one huge flaw in machine translation.

Culture is another interesting aspect. What somebody might translate as a greeting might require a totally different literal translation in another language, but it's always hard to justify translations that aren't exactly literal. This usually impacts tone quite a bit. From my experience, conversational style for the major Asian languages is indirect and hinting. Most people already understand the importance of cultural differences, this is strongly present in speech as well.


You would probably need to worry a lot more about morphology with those languages.




Consider applying for YC's Fall 2025 batch! Applications are open till Aug 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: