Hacker News new | past | comments | ask | show | jobs | submit login
Unsupervised translation of programming languages (arxiv.org)
166 points by elsewhen on June 9, 2020 | hide | past | favorite | 44 comments

Now that's interesting. The examples are a bit cherry-picked, but there are some interesting results.

Seeing C++

    int allocation[n];
    memset(allocation, -1, sizeof(allocation));
    for(int i = 0; i < n; i ++)
translated into Python:

    allocation = [-1] * n
    for i in range(n):
is impressive. The translator picked up on two idioms of C++ and translated them to more concise forms in Python. Most "transpilers" just compile into the target language as if compiling to something like machine code, generating worst case wordy output. There's a C to Rust translator which translated C pointer indexing into Rust unsafe pointer arithmetic, which is not a gain in safety.

It's also claimed that this system is good at guessing types from untyped languages.

So a good way to use this technology might be to have something that looks at source code and tries to translate all the function signatures and type definitions. This includes looking at function calls and bodies to help guess ("infer" is a stretch for this approach) the types of ambiguous variables.

Example is

    int foo(char* s);
Is "s" a single character or a pointer to an array? You can't tell without context. This system might be able to do that. Or

    int getdata(int *buf, size_t bufl);
A system like this should be able to recognize the intent there.

Then try to translate the executable code with that information available. Bad guesses about types will usually result in translation failures or code that compiles with type errors, so someone will notice.

So this is promising for modernizing code.

I want to see one able to turn C pointer arithmetic into slice syntax. Now that would be a big step forward.

The translator picked up on two idioms of C++ and translated them to more concise forms in Python.

It also either knew about the machine integer representation, or else made a combination of two mistakes that turned out to cancel each other out.


    int allocation[n];
    memset(allocation, 42, sizeof(allocation));
    for(int i = 0; i < n; i ++)
would NOT be equivalent to this:

    allocation = [42] * n
    for i in range(n):

    Python 3.6.9 (default, Apr 18 2020, 01:56:04) 
    [GCC 8.4.0] on linux
    >>> n = 10
    >>> allocation = [42] * n
    >>> allocation
    [42, 42, 42, 42, 42, 42, 42, 42, 42, 42]
That's what "*" does on lists in Python. The idea is that "abc" + def" is "abcdef", and then they overgeneralized.

  $ cat test.c
  #include <stdio.h>
  #include <string.h>
  int main() {
      int n = 10;
      int allocation[n];
      memset(allocation, 42, sizeof(allocation));
      for(int i = 0; i < n; i ++)
          printf("%d\n", allocation[i]);
  $ gcc test.c
  $ ./a.out

memset() is bytewise, the only reason it works with -1 is all the bits are set in every bit width. 0xff repeated still ends up being -1 at int size (0xffffffff). But the byte 0x2a repeated will be something more like 0x2a2a2a2a, which is not equivalent to 0x2a.

Right. So the translation was idiomatic but wrong.

I'd say the C++ was wrong/unclear. If you want all the ints in your array to be a certain number, use std::fill. If you want all their bits to be 1, use something like ~0 in the memset.

It would be interesting to test this and if they really don't rely on rules, which I would suspect for memset as well as sizeof.

Related: An IRS employee reportedly got 90% of their assembler (yes really) code translated to JAVA. IRS strangely abandoned the project.


Any developer with more than a few years of experience knows the first 90% is the easy part though.

This is code from the 60s/70s running on IBM mainframes.

It should not be surprising that it is written in Assembler, given those factors and the performance advantages over the alternatives.

I was writing new 370 Assembler in the 90s for high throughput electronic pre-press systems.

Was it by hand or automated? I couldn’t penetrate the gossip and court intrigue on that site.

This paper is very interesting, but as with many recent company-funded DNN-related publications, I am missing some technical information here.

Publications by universities will almost always include details like dataset sizes, hardware used, and durations for pre-processing, training, and inference.

As-is, this paper leaves me puzzled as to whether this novel solution is even practical in any way shape or form.

If I essentially need the computational resources of a medium-sized data centre to prepare the data, and train the system for adding a new language-pair or fine-tuning an existing one, it'd not be a practical method outside of the Four Horsemen (GAFA) and their closest competitors...

I applaud the use of computational accuracy as the primary metric for evaluating the results, though.

Hmm? Section 4 has the details of their setup (below). It’s 32 V100s, or just under $100/hr including reading the data out of BQ. They don’t even mention how long they trained for, because presumably it wasn’t material (if it were, they’d have probably used more than 32 GPUs so they wouldn’t have to wait).

The details section:

> 4.1 Training details We use a transformer with 6 layers, 8 attention heads, and set the dimensionality of the model to 1024. We use a single encoder and a single decoder for all programming languages. During XLM pretraining, we alternate between batches of C++, Java, and Python, composed of 32 sequences of source code of 512 tokens. At training time, we alternate between the denoising auto-encoding and back-translation objectives, and use batches of around 6000 tokens. We optimize TransCoder with the Adam optimizer [25], a learning rate of 10−4, and use the same learning rate scheduler as Vaswani et al. [45]. We implement our models in PyTorch [39] and train them on 32 V100 GPUs. We use float16 operations to speed up training and to reduce the memory usage of our models.

> 4.2 Training data

We download the GitHub public dataset available on Google BigQuery . It contains more than 2.8 million open source GitHub repositories. We filter projects whose license explicitly permits the re-distribution of parts of the project, and select the C++, Java, and Python files within those projects. Ideally, a transcompiler should be able to translate whole projects. In this work, we decide to translate at function level.

To add a new language you can simply reuse an old model that was trained on lots of languages and fine-tune it for your needs. The same practice exists for huge natural-language models such as BERT and GPT.

Companies / researchers in general have no strong requirement to show you any artifacts to reproduce their work. There are incentives to not even provide that information at all.

What we usually see is that someone else goes through the time to reproduce the results when reproducing it. Or internally from companies you can see projects moving from one company to another like MapReduce (Google) to Yahoo (Hadoop).

And then if enough companies don't want to manage the whole project the software gets donated to the Apache Software foundation.

> Companies / researchers in general have no strong requirement to show you any artifacts to reproduce their work.

That's quite literally the opposite of what science is all about! If they don't want others to reproduce their results, they might just as well end each paper with "You can take our word for it!" and skip the details altogether...

What I'm rather interested in are points of comparison. Performance in terms of a chosen metric is one thing, but research gets more useful if it can easily be reproduced. This is the norm in all other sciences - why not in AI research?

If I can see that their approach is 20% better than SOTA, but they require 1M LoC plus 3 weeks of total computation time on a 100 machine cluster with 8 V100 per node, I can safely say - sod it! - use the inferior commercial product instead and add 20% manual effort (since I need to add manual work anyways as the accuracy isn't 100%).

Yes, it is and I agree with you.

I worked in a lab where I had to reproduce others peoples code in Java. I never finished any of those projects.

For Example GPT-2 would need around ~$50k to reproduce from scratch. GPT-3 is probably a few orders of magnitude than that. How would anyone reproduce it unless they are a company? I've seen NVIDIA reproduce some results.

Also, most of the issues are you don't have the datasets and after the PhD students graduate and the professor gets a job your access of the datasets go away like bitrot.

> If I can see that their approach is 20% better than SOTA, but they require 1M LoC plus 3 weeks of total computation time on a 100 machine cluster with 8 V100 per node, I can safely say - sod it!

8 V100's cost about $20/h, 100 machines for 2 weeks (allowing for a long training time) will cost $638K. This is the salary of three to five engineers for a year. If your model reduces more than that amount of time it is worth it. It's just a matter of how much use you can get out of it. Of course a model can be reused by different teams and companies, so it could easily be worth the price.

I expect the number I calculated to be exaggerated for this task, though, you don't need that much compute for this model. GPT-3 cost $1.2M per run and it is the largest model in existence.

It says that it is possible to train the translator for any pair of languages. But they did it with C++, Java, and Python, which are all object oriented imperative languages. Would it work as well for a C++ to OCaml translation, or a Prolog to Java translation? Changing of programming paradigm can be quite tricky…

Yeah; I was far less impressed when I saw the languages in question when I ran across this. Yes, there are extra/fewer bits to consider when translating between these, but fundamentally there is a direct translation possible, that would still be reasonably idiomatic. Translate a Java program to Erlang, though...you have to completely change the design or it's not at all idiomatic. A task scheduler wouldn't rely on a priority queue; it would rely on spinning up a process per task that would sleep until it needs to execute, for instance.

It is possible but the quality will suffer.

In NLP land, En-Fr unsupervised translation yield significantly better result than say for En-Zh translation.

This is very impressive, but it can not handle large contexts.

E.g. in its current state it would not be able to translate large codebases, even more so using different libraries.

For example, the translator consistently fails to convert Python

into proper Java. It generates

instead, which is a non-existent overload of the function.

Another problem is that unless you can port tests too, the code is not very reliable.

I wonder if there's value there for in-IDE snippets. I need to make some changes to some Java code, I know how I'd say it in Python, and this'll do the translation for me?

Overall I don't think it is a convenient way to program. Besides, you still need to test it.

How do you know this? Do you have access to the program somewhere?

The max example is listed in the very last paragraph of the paper as one of the failure modes.

There's a related video at https://www.youtube.com/watch?v=xTzFJIknh7E.

(via https://news.ycombinator.com/item?id=23470668, but no comments there)

Curious about how dependencies could be handled. Translating every library used doesn't sound ideal, but replacing with an "equivalent" library for the language wouldn't always (or even usually) work either

It would be neat to see each language get a functionality subset defined that transpilation can smoothly translate from one language to another. The code itself wouldn't even have to be stored as one language, but rather in an intermediate representation.

I like the idea of shared VMs like Truffle and LLVM, but they really don't fix the problem of allowing developers with disparate backgrounds and preferences collaborate.

It would be interesting compile languages into an intermediate form - like a lisp dialect/form or so - to see if that makes things easier. This could be used to have a common foundation on top of LLVM, for example.

I don't know, the examples don't look too complicated. I wonder how it is possible that an approach based on rules would not work on those examples.

Next step would then be FP<->Imperative paradigms?

Could I use this to clean up my code? First write some garbage in java then do a round java -> python -> java. Does it result in cleaner code? :)

Probably more like doing a round trip in a natural language translator.

Next step would then be FP<->Imperative.. Beginning with the static ones?

For dynamic it would require introspection...

JavaScript > TypeScript translator might be a useful tool.

And then TS to C#, to make it compilable through things like il2cpp

what about template meta-programming C++ code? Can be converted in python or java?

To what degree must one consume the "prediction is more fundamental than understanding" koolaid before work like this looks reasonable?

This is a rude comment. The paper has no grander claim of AGI or 'fundamental' understanding. It's a novel algorithm that outperforms existing solutions. Not everything needs to be part of work towards 'understanding.'

It outperforms existing solutions on a publicly unavailable corpus with no clear way to reproduce the results?

Hmm... The authors literally ran a series of experiments and published them without giving a clear way to reproduce the results. How is that useful to anyone except to self advertise?

My comment about prediction vs understanding was simply meant to underscore this, albeit it might rub some people the wrong way. If you publish a paper purely about prediction (ie a set of experiments) be prepared to release all pertinent information to reproduce said experiments. If you choose to publish a paper that aims to improve mankind's understanding of the problem at hand, you are intrinsically required to provide all proof in your exposition.

Otherwise we might as well just believe everything anyone ever says with no proof.

In Table 2, we observe that TransCoder significantly outperforms both baselines in terms of computational accuracy, with 74.8% and 68.7% in the C++ → Java and Java → Python directions, compared to 61% and 38.3%

It beats existing commercial products that try to do the same, so to me it seems it has some value right now.

> It beats existing commercial products that try to do the same, so to me it seems it has some value right now.

While I agree with your conclusion, a minor correction is in order here: it beats a (as in a single selected) commercial product and a single free OSS transpiler.

The improvement for the commercial product is 61% vs. 75% accuracy (i.e. 23% better) which - while impressive given the unsupervised learning aspect - isn't a game changer (yet!).

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact