
Towards a Universal Code Formatter through Machine Learning (2016) - breck
https://arxiv.org/abs/1606.08866
======
jph00
Great to see this picked up on HN - it's really interesting work. Terence is
my colleague at USF (in fact he's the guy that got me to go there!) and it was
great to see the journey from "could this possibly work" to "holy crap it does
work!" over a period of many months. He's known for his work with parsers, of
course (i.e. ANTLR), but recently he's been investing heavily in machine
learning. This project is a great example of how domain experts can go into
whole new areas by building on top of machine learning methods.

For anyone interested in getting into machine learning, I hope you'll consider
the MSAN program at USF ([https://www.usfca.edu/arts-sciences/graduate-
programs/analyt...](https://www.usfca.edu/arts-sciences/graduate-
programs/analytics)). Terence is the guy that made this program happen at USF
and it's turned into something really special.

Disclaimer: I'm a deep learning researcher and teacher at USF.

~~~
parrt
One possible angle for improvement of this technology: use a deep learning net
to conjure up a different feature vector than the one I handcrafted from
language/grammar expertise. I believe this was your idea. :) Glad to have you
on board teaching and doing research!

~~~
nl
I'm pretty sure this would work really well.

I've trained CharCNN on log files, and it generates really good examples
files. To me that shows that even a comparatively simple model can capture
syntax rules, so I'd imagine a LSTM would generate really good feature
vectors.

------
PNWChris
It would be spectacular to have a tool like this to enforce, and automatically
revise, code style guidelines as languages evolve. Keeping up with something
like the C++ Core Guidelines [0] can be a fair amount of work, as you need to
respect existing idioms to keep your codebase coherent (a higher priority than
respecting the "one true style", in my opinion).

A tool like this would be brilliant if it could automatically generate
suggestions for "hybrid styles" that fit your existing codebase, and provide
suggestions to migrate existing code patterns to "correct" ones over time.

[0]:
[https://isocpp.github.io/CppCoreGuidelines/CppCoreGuidelines](https://isocpp.github.io/CppCoreGuidelines/CppCoreGuidelines)

~~~
parrt
First step to using CodeBuff would be getting an ANTLR grammar for C++ or at
least a fuzzy version. It's still a prototype but should do pretty well.

------
sushisource
Is there a prototype anywhere? I want it!

~~~
farresito
This seems to be it:
[https://github.com/antlr/codebuff](https://github.com/antlr/codebuff)

~~~
parrt
Yep, that's it. Somebody has ported from Java to C# as well. Next step is
really to convert to use a Random Forest classifier. I'm stuck elsewhere at
the moment.

