Hacker News new | past | comments | ask | show | jobs | submit login
Show HN: Detecting the Programming Language of Code Using ML and Neural Networks (danielheres.space)
53 points by pprogrammer on July 18, 2016 | hide | past | favorite | 28 comments



From my testing, it seems this classifier is heavily influenced by the choice of identifiers used in the program, which is not necessarily a hallmark of the language, but of the particular programs trained with. Here's a made-up C function that uses Python's colon + whitespace for the body instead of braces:

    int main(int* x):
        return *x + ((int)x);
This program is classified as being C (84% confidence), but if I change 'main' to 'elephant' it drops to 49% C, 41% Python. If I change the name to 'snake' it suddenly becomes Python code with 80% confidence. I guess Python programmers like really like talking about animals? :)

This would be more interesting if it were able to accurately classify programs by what family of syntax they're most similar to (Lisp, ML, C, etc). Otherwise, at first glance this seems like another needless application of something complicated like neural networks, since languages already have a formal grammar describing their structure and reserved tokens/symbols that you could take advantage of.


Interesting example. It is not a valid Python/C example, you wouldn't find such an example in the code (edit: dataset). The model also isn't trained on small code snippets (only on running code), this is currently probably a weak spot.

I agree a lot would be possible using formal grammar indeed but it would be probably a lot of work to maintain all the parsers. Also some languages share a lot of the same syntax, so in some cases this may lead to ambiguity. Maybe you could use a combination of both approaches in that case.


> It is not a valid Python/C example, you wouldn't find such an example in the code.

That was the goal of the example, because it tests whether you're overfitting to the training data. The point of machine learning is to use the training data to help you make smart decisions in scenarios that you haven't seen before. In this case, I expected it always say it is more C than Python.


I get the point of wanting to be able to address scenarios that you haven't seen before, but in this particular instance, is it really that much help to be able to classify a program that isn't technically written in any programming language (ie. it's invalid)?


> be able to classify a program that isn't technically written in any programming language (ie. it's invalid)?

I wouldn't go so far as saying it's 'invalid'. It's a perfectly reasonable new language that the classifier hasn't encountered, yet whose style is heavily rooted in languages the classifier has been trained on.

If we require that all programs we feed it are a valid program in one of a handful of languages it expects, we could've just run a parser for each language on the given input and call it a day. Trying to classify a weird language like this is where the machine learning aspect of the work really shines.


Ah, yes. That makes a lot more sense than how I was thinking about it. It would still be marginally better than just attempting to validate the code in expected languages because it would probably handle typos and other simple syntax errors, as well as snippet problems (like not declaring a certain variable in the snippet), but I can see where the output on an "invalid" program could be useful now.


The model does generalize to data it hasn't seen before (and that's validated using a validation set) but not to data that isn't similar to the data set. For example, it would probably not detect the programming language well when only using comments.


Scala 0.98

    #define def int
    
    def main(def x) {
     return x;
   }


Oh man, that's a dirty trick. I like it.


Shouldn't one just use a naive bayesian classifier on top of the frequency-weighted language keywords?

The data would then really only be needed to guess the frequencies.

I am also not sure whether word bigrams would help a lot, but character level n-grams would be totally off my radar.

Any opinions on that?


That would probably work OK, I guess, but some languages have the same keywords, so I don't think you'll get very high accuracy. I also tried a word level n-grams with linear classifiers, I couldn't get very high accuracy out of it. Also n-grams with higher values for n improve accuracy a lot if you train it with a lot of data, both for word-level models as for character level.


It incorrectly classifies the following valid C++ code as Javascript. It seems to be thrown off by the use of console.log without actually classifying based on the syntax.

    #include <iostream>

    class Logger {
        public:
            void log(std::string a) {
                std::cout << a << std::endl;
            }
    };

    int main() {
        Logger console = Logger();
        for (int i=0;i<10;i++) {
            console.log("test");
            console.log("test");
            console.log("test");
            console.log("test");
            console.log("test");
            console.log("test");
            console.log("test");
            console.log("test");
        }
    }


Interesting case:

    x = [1,2,3]
    x << 4
> haskell: 0.94, ruby: 0.03, swift: 0.01

    x = [1,2,3]
    x << 4
    puts x
> ruby: 0.89, haskell: 0.10, swift: 0.00


puts is probably much more informative than <<


Have you only tested syntactically valid code? For example, the code snippet:

  for (int i = 0; i ? N; ++i) {
    a[i]->val = i;
  }
Is clearly either C or C++, but it's not valid C or C++ because "i ? N" is not a valid boolean expression in the for loop. I actually think your techniques would work on syntactically invalid code, but I'm curious if you tried.


Probably almost all of the samples are syntactically valid, but it probably works OK for invalid code. Adding or removing symbols can have an impact on the results though, as those may occur more frequently in other languages.


Maybe not valid C code, depending on what version, because of the int in the loop?


This seems like it's probably massive overkill, have you compared with a simple bag-of-words linear classifier?


Not necessarily overkill; maybe they wanted to learn how to use neural networks and this was a good project to do that. Simple is best for professional projects, but for personal projects like this it's better for you to explore something you don't know already for learning purposes.


I tried a linear classifier, yes. It works decent too, but has lower accuracy when using a big dataset (+/- 98% top-1 accuracy vs. 99.4%).


Did the JS corpus not contain any es6 syntax? The statement

  let x = {}
doesn't even register as JS.


It probably does not contain a lot of es6 syntax, and the let keyword probably occurs a lot more in other languages (Swift, Haskell, Lua). You could try to add more code to disambiguate it, for example adding a semicolon already displays it as second.


It would be interesting to set up some sort of similarity comparison between languages; then, if you introduced the date of release for the various languages, you might get a cool tree of which-languages-influenced-what. Very cool project!


This is a great idea! I did think of something like this, because similar languages often end up in the top-3 of the classifier.


Given the >99% accuracy rating, it would be interesting to see some examples of code it gets incorrect.


Would love to see this for detecting which markup language was used for `README`.


Great idea! Will think about this when I am going to add languages.


But seriously, this could be really useful for github and the like, instead of having to rename project files to README.md etc, just have it display it based on best guess. I've always wondered why github doesn't bother to do this, but I guess it's better to not make a guess and get it wrong than to display a file poorly.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: