

Structured Generative Models of Natural Source Code [pdf] - dennybritz
http://jmlr.org/proceedings/papers/v32/maddison14.pdf

======
mallamanis
This is a new and very interesting area in machine learning and software
engineering. Anyone interested might also find the following papers
interesting too

Hindle, Abram, et al. "On the naturalness of software." Software Engineering
(ICSE), 2012 34th International Conference on. IEEE, 2012.

Tu, Zhaopeng, Zhendong Su, and Prem Devanbu. "On the Localness of Software."

Nguyen, Tung Thanh, et al. "A statistical semantic language model for source
code." Proceedings of the 2013 9th Joint Meeting on Foundations of Software
Engineering. ACM, 2013.

Campbell, Joshua Charles, Abram Hindle, and José Nelson Amaral. "Syntax errors
just aren't natural: improving error reporting with language models."
Proceedings of the 11th Working Conference on Mining Software Repositories.
ACM, 2014.

Allamanis, Miltiadis, and Charles Sutton. "Mining source code repositories at
massive scale using language modeling." Mining Software Repositories (MSR),
2013 10th IEEE Working Conference on. IEEE, 2013.

Movshovitz-Attias, Dana, and William W. Cohen. "Natural Language Models for
Predicting Programming Comments." ACL (2). 2013.

Allamanis, Miltiadis, Earl T. Barr, and Charles Sutton. "Learning Natural
Coding Conventions." arXiv preprint arXiv:1402.4182 (2014).

Allamanis, Miltiadis, and Charles Sutton. "Mining Idioms from Source Code."
arXiv preprint arXiv:1404.0417 (2014).

~~~
tomp
Assuming you're more familiar with this than I am, can you outline why this is
useful (either practically, or why it's theoretically more useful than e.g.
modelling natural language)?

~~~
mallamanis
From a machine learning perspective (i.e. theoretically according to my view)
this is useful because source code is highly structured with very complex
constraints and tons of data available (e.g. every project in GitHub). This
means that machine-learning methods for handling such problems need to be
developed. Such methods may eventually be useful in other application.

Now on the applied side (software engineering, programming languages), such
methods (probabilistic machine learning and probabilistic/statistical models)
can handle uncertainty with a principled way and provide useful tools to
software engineers that exploit the amounts of data that is available both in
internal and external codebases. This is something that is not 100% possible
with formal tools, that usually require some form of human knowledge to be
embedded. For example, you will see on the list above, tools that can do
autocompletion, others that suggest "reasonable" renamings or others that help
migration of source code between languages and all thanks to data.

Hopefully, at some point these methods will be so advanced that they will be
able to learn (i.e. trained) from every piece of code that is available online
and (e.g.) spot bugs in your code, semi-automatically refactor your code etc.

------
metronius


