

Introduction to Statistical Machine Translation - kilo_bit
http://michaelnielsen.org/blog/?p=577

======
andreyf
I wonder how much of this can be applied to code - to fix bugs, for example.

~~~
jimbokun
If you have a training corpus that consists of code with bugs and the same
code with the bug fixed, I imagine you can get rather far.

The biggest challenge with any statistical or machine learning techniques is
getting a good data set. If all you have is a lot of code but the code with
bugs in it is not annotated, then you need unsupervised techniques which
require better algorithms and more data (as I understand it).

~~~
alextp
Well, you probably sort-of can do that with open source projects that
integrate version control and bug tracking. In these repositories you will
find lots of commits names "Fixes #42".

The problem, I think, is that making a statistical n-gram model of a
programming language doesn't look like it can produce encouraging results
(think locally balanced-like, but globally unbalanced parenthesis, which
should pop up often in ngrams for parentheses more than n tokens apart). The
upper side is that you can get a tokenizer "for free" using the language's own
grammar.

~~~
jibiki
Maybe what you want is not a Markov model, but a stochastic context-free
grammar:

<http://en.wikipedia.org/wiki/Stochastic_context-free_grammar>

------
jibiki
It's shocking that such a naive model produces such good results. I suspect it
has to do with the languages they are translating: if they wanted to translate
Japanese to German it would take more sophisticated methods.

~~~
tokenadult
One reason current machine translation looks pretty good is that over the
years a lot of human translation has been remarkably bad. (I say this as a
former member of the American Translators Association who made my living for
several years as a Chinese-English translator and interpreter.) Most clients
of most translators can only check the accuracy of one of the translator's
languages, which allows for a good bit of bluffing. I'm getting better and
better satisfied since the 1990s with machine-translated texts I see on the
Web, for example translations of my own personal homepage into other languages
I know well enough to read.

