
Show HN: Detecting the Programming Language of Code Using ML and Neural Networks - pprogrammer
http://danielheres.space/jekyll/update/2016/07/18/detecting-the-programming-language-of-source-code-snippets-using-machine-learning-and-neural-networks.html
======
riscy
From my testing, it seems this classifier is heavily influenced by the choice
of identifiers used in the program, which is not necessarily a hallmark of the
language, but of the particular programs trained with. Here's a made-up C
function that uses Python's colon + whitespace for the body instead of braces:

    
    
        int main(int* x):
            return *x + ((int)x);
    

This program is classified as being C (84% confidence), but if I change 'main'
to 'elephant' it drops to 49% C, 41% Python. If I change the name to 'snake'
it suddenly becomes Python code with 80% confidence. I guess Python
programmers like really like talking about animals? :)

This would be more interesting if it were able to accurately classify programs
by what family of syntax they're most similar to (Lisp, ML, C, etc).
Otherwise, at first glance this seems like another needless application of
something complicated like neural networks, since languages already have a
formal grammar describing their structure and reserved tokens/symbols that you
could take advantage of.

~~~
pprogrammer
Interesting example. It is not a valid Python/C example, you wouldn't find
such an example in the code (edit: dataset). The model also isn't trained on
small code snippets (only on running code), this is currently probably a weak
spot.

I agree a lot would be possible using formal grammar indeed but it would be
probably a lot of work to maintain all the parsers. Also some languages share
a lot of the same syntax, so in some cases this may lead to ambiguity. Maybe
you could use a combination of both approaches in that case.

~~~
riscy
> It is not a valid Python/C example, you wouldn't find such an example in the
> code.

That was the goal of the example, because it tests whether you're overfitting
to the training data. The point of machine learning is to use the training
data to help you make smart decisions in scenarios that you haven't seen
before. In this case, I expected it always say it is more C than Python.

~~~
Vraxx
I get the point of wanting to be able to address scenarios that you haven't
seen before, but in this particular instance, is it really that much help to
be able to classify a program that isn't technically written in any
programming language (ie. it's invalid)?

~~~
riscy
> be able to classify a program that isn't technically written in any
> programming language (ie. it's invalid)?

I wouldn't go so far as saying it's 'invalid'. It's a perfectly reasonable new
language that the classifier hasn't encountered, yet whose style is heavily
rooted in languages the classifier has been trained on.

If we require that all programs we feed it are a valid program in one of a
handful of languages it expects, we could've just run a parser for each
language on the given input and call it a day. Trying to classify a weird
language like this is where the machine learning aspect of the work really
shines.

~~~
Vraxx
Ah, yes. That makes a lot more sense than how I was thinking about it. It
would still be marginally better than just attempting to validate the code in
expected languages because it would probably handle typos and other simple
syntax errors, as well as snippet problems (like not declaring a certain
variable in the snippet), but I can see where the output on an "invalid"
program could be useful now.

------
Varinius
Scala 0.98

    
    
        #define def int
        
        def main(def x) {
         return x;
       }

~~~
sushisource
Oh man, that's a dirty trick. I like it.

------
dmichulke
Shouldn't one just use a naive bayesian classifier on top of the frequency-
weighted language keywords?

The data would then really only be needed to guess the frequencies.

I am also not sure whether word bigrams would help a lot, but character level
n-grams would be totally off my radar.

Any opinions on that?

~~~
pprogrammer
That would probably work OK, I guess, but some languages have the same
keywords, so I don't think you'll get very high accuracy. I also tried a word
level n-grams with linear classifiers, I couldn't get very high accuracy out
of it. Also n-grams with higher values for n improve accuracy a lot if you
train it with a lot of data, both for word-level models as for character
level.

------
gkbrk
It incorrectly classifies the following valid C++ code as Javascript. It seems
to be thrown off by the use of console.log without actually classifying based
on the syntax.

    
    
        #include <iostream>
    
        class Logger {
            public:
                void log(std::string a) {
                    std::cout << a << std::endl;
                }
        };
    
        int main() {
            Logger console = Logger();
            for (int i=0;i<10;i++) {
                console.log("test");
                console.log("test");
                console.log("test");
                console.log("test");
                console.log("test");
                console.log("test");
                console.log("test");
                console.log("test");
            }
        }

------
leoh
Interesting case:

    
    
        x = [1,2,3]
        x << 4
    

> haskell: 0.94, ruby: 0.03, swift: 0.01
    
    
        x = [1,2,3]
        x << 4
        puts x
    

> ruby: 0.89, haskell: 0.10, swift: 0.00

~~~
pprogrammer
puts is probably much more informative than <<

------
scott_s
Have you only tested syntactically valid code? For example, the code snippet:

    
    
      for (int i = 0; i ? N; ++i) {
        a[i]->val = i;
      }
    

Is clearly either C or C++, but it's not _valid_ C or C++ because "i ? N" is
not a valid boolean expression in the for loop. I actually think your
techniques would work on syntactically invalid code, but I'm curious if you
tried.

~~~
pprogrammer
Probably almost all of the samples are syntactically valid, but it probably
works OK for invalid code. Adding or removing symbols can have an impact on
the results though, as those may occur more frequently in other languages.

------
sweezyjeezy
This seems like it's probably massive overkill, have you compared with a
simple bag-of-words linear classifier?

~~~
nicolewhite
Not necessarily overkill; maybe they wanted to learn how to use neural
networks and this was a good project to do that. Simple is best for
professional projects, but for personal projects like this it's better for you
to explore something you don't know already for learning purposes.

------
wwwigham
Did the JS corpus not contain any es6 syntax? The statement

    
    
      let x = {}
    

doesn't even register as JS.

~~~
pprogrammer
It probably does not contain a lot of es6 syntax, and the let keyword probably
occurs a lot more in other languages (Swift, Haskell, Lua). You could try to
add more code to disambiguate it, for example adding a semicolon already
displays it as second.

------
haldean
It would be interesting to set up some sort of similarity comparison between
languages; then, if you introduced the date of release for the various
languages, you might get a cool tree of which-languages-influenced-what. Very
cool project!

~~~
pprogrammer
This is a great idea! I did think of something like this, because similar
languages often end up in the top-3 of the classifier.

------
mcphage
Given the >99% accuracy rating, it would be interesting to see some examples
of code it gets incorrect.

------
radarsat1
Would love to see this for detecting which markup language was used for
`README`.

~~~
pprogrammer
Great idea! Will think about this when I am going to add languages.

~~~
radarsat1
But seriously, this could be really useful for github and the like, instead of
having to rename project files to README.md etc, just have it display it based
on best guess. I've always wondered why github doesn't bother to do this, but
I guess it's better to not make a guess and get it wrong than to display a
file poorly.

