
Unsupervised translation of programming languages - elsewhen
https://arxiv.org/abs/2006.03511
======
Animats
Now that's interesting. The examples are a bit cherry-picked, but there are
some interesting results.

Seeing C++

    
    
        int allocation[n];
        memset(allocation, -1, sizeof(allocation));
        for(int i = 0; i < n; i ++)
    

translated into Python:

    
    
        allocation = [-1] * n
        for i in range(n):
    

is impressive. The translator picked up on two idioms of C++ and translated
them to more concise forms in Python. Most "transpilers" just compile into the
target language as if compiling to something like machine code, generating
worst case wordy output. There's a C to Rust translator which translated C
pointer indexing into Rust unsafe pointer arithmetic, which is not a gain in
safety.

It's also claimed that this system is good at guessing types from untyped
languages.

So a good way to use this technology might be to have something that looks at
source code and tries to translate all the function signatures and type
definitions. This includes looking at function calls and bodies to help guess
("infer" is a stretch for this approach) the types of ambiguous variables.

Example is

    
    
        int foo(char* s);
    

Is "s" a single character or a pointer to an array? You can't tell without
context. This system might be able to do that. Or

    
    
        int getdata(int *buf, size_t bufl);
    

A system like this should be able to recognize the intent there.

Then try to translate the executable code with that information available. Bad
guesses about types will usually result in translation failures or code that
compiles with type errors, so someone will notice.

So this is promising for modernizing code.

I want to see one able to turn C pointer arithmetic into slice syntax. Now
that would be a big step forward.

~~~
userbinator
_The translator picked up on two idioms of C++ and translated them to more
concise forms in Python._

It also either knew about the machine integer representation, or else made a
combination of two mistakes that turned out to cancel each other out.

This

    
    
        int allocation[n];
        memset(allocation, 42, sizeof(allocation));
        for(int i = 0; i < n; i ++)
    

would NOT be equivalent to this:

    
    
        allocation = [42] * n
        for i in range(n):

~~~
Animats

        Python 3.6.9 (default, Apr 18 2020, 01:56:04) 
        [GCC 8.4.0] on linux
        >>> n = 10
        >>> allocation = [42] * n
        >>> allocation
        [42, 42, 42, 42, 42, 42, 42, 42, 42, 42]
    

That's what "*" does on lists in Python. The idea is that "abc" \+ def" is
"abcdef", and then they overgeneralized.

~~~
lunixbochs
memset() is bytewise, the only reason it works with -1 is all the bits are set
in every bit width. 0xff repeated still ends up being -1 at int size
(0xffffffff). But the byte 0x2a repeated will be something more like
0x2a2a2a2a, which is not equivalent to 0x2a.

~~~
Animats
Right. So the translation was idiomatic but wrong.

~~~
hoseja
I'd say the C++ was wrong/unclear. If you want all the ints in your array to
be a certain number, use std::fill. If you want all their bits to be 1, use
something like ~0 in the memset.

------
avsteele
Related: An IRS employee reportedly got 90% of their assembler (yes really)
code translated to JAVA. IRS strangely abandoned the project.

[https://federalnewsnetwork.com/tom-temin-
commentary/2020/01/...](https://federalnewsnetwork.com/tom-temin-
commentary/2020/01/irs-programming-mystery-continues/)

~~~
cheschire
Any developer with more than a few years of experience knows the first 90% is
the easy part though.

------
qayxc
This paper is very interesting, but as with many recent company-funded DNN-
related publications, I am missing some technical information here.

Publications by universities will almost always include details like dataset
sizes, hardware used, and durations for pre-processing, training, and
inference.

As-is, this paper leaves me puzzled as to whether this novel solution is even
practical in any way shape or form.

If I essentially need the computational resources of a medium-sized data
centre to prepare the data, and train the system for adding a new language-
pair or fine-tuning an existing one, it'd not be a practical method outside of
the Four Horsemen (GAFA) and their closest competitors...

I applaud the use of computational accuracy as the primary metric for
evaluating the results, though.

~~~
zitterbewegung
Companies / researchers in general have no strong requirement to show you any
artifacts to reproduce their work. There are incentives to not even provide
that information at all.

What we usually see is that someone else goes through the time to reproduce
the results when reproducing it. Or internally from companies you can see
projects moving from one company to another like MapReduce (Google) to Yahoo
(Hadoop).

And then if enough companies don't want to manage the whole project the
software gets donated to the Apache Software foundation.

~~~
qayxc
> Companies / researchers in general have no strong requirement to show you
> any artifacts to reproduce their work.

That's quite literally the opposite of what science is all about! If they
don't want others to reproduce their results, they might just as well end each
paper with "You can take our word for it!" and skip the details altogether...

What I'm rather interested in are points of comparison. Performance in terms
of a chosen metric is one thing, but research gets more useful if it can
easily be reproduced. This is the norm in all other sciences - why not in AI
research?

If I can see that their approach is 20% better than SOTA, but they require 1M
LoC plus 3 weeks of total computation time on a 100 machine cluster with 8
V100 per node, I can safely say - sod it! - use the inferior commercial
product instead and add 20% manual effort (since I need to add manual work
anyways as the accuracy isn't 100%).

~~~
zitterbewegung
Yes, it is and I agree with you.

I worked in a lab where I had to reproduce others peoples code in Java. I
never finished any of those projects.

For Example GPT-2 would need around ~$50k to reproduce from scratch. GPT-3 is
probably a few orders of magnitude than that. How would anyone reproduce it
unless they are a company? I've seen NVIDIA reproduce some results.

Also, most of the issues are you don't have the datasets and after the PhD
students graduate and the professor gets a job your access of the datasets go
away like bitrot.

------
p4bl0
It says that it is possible to train the translator for any pair of languages.
But they did it with C++, Java, and Python, which are all object oriented
imperative languages. Would it work as well for a C++ to OCaml translation, or
a Prolog to Java translation? Changing of programming paradigm can be quite
tricky…

~~~
lostcolony
Yeah; I was far less impressed when I saw the languages in question when I ran
across this. Yes, there are extra/fewer bits to consider when translating
between these, but fundamentally there is a direct translation possible, that
would still be reasonably idiomatic. Translate a Java program to Erlang,
though...you have to completely change the design or it's not at all
idiomatic. A task scheduler wouldn't rely on a priority queue; it would rely
on spinning up a process per task that would sleep until it needs to execute,
for instance.

------
lostmsu
This is very impressive, but it can not handle large contexts.

E.g. in its current state it would not be able to translate large codebases,
even more so using different libraries.

For example, the translator consistently fails to convert Python

    
    
      max(arr)
    

into proper Java. It generates

    
    
      Math.max(arr)
    

instead, which is a non-existent overload of the function.

Another problem is that unless you can port tests too, the code is not very
reliable.

~~~
peteretep
I wonder if there's value there for in-IDE snippets. I need to make some
changes to some Java code, I know how I'd say it in Python, and this'll do the
translation for me?

~~~
lostmsu
Overall I don't think it is a convenient way to program. Besides, you still
need to test it.

------
dang
There's a related video at
[https://www.youtube.com/watch?v=xTzFJIknh7E](https://www.youtube.com/watch?v=xTzFJIknh7E).

(via
[https://news.ycombinator.com/item?id=23470668](https://news.ycombinator.com/item?id=23470668),
but no comments there)

------
zild3d
Curious about how dependencies could be handled. Translating every library
used doesn't sound ideal, but replacing with an "equivalent" library for the
language wouldn't always (or even usually) work either

------
vinceguidry
It would be neat to see each language get a functionality subset defined that
transpilation can smoothly translate from one language to another. The code
itself wouldn't even have to be stored as one language, but rather in an
intermediate representation.

I like the idea of shared VMs like Truffle and LLVM, but they really don't fix
the problem of allowing developers with disparate backgrounds and preferences
collaborate.

------
dunefox
It would be interesting compile languages into an intermediate form - like a
lisp dialect/form or so - to see if that makes things easier. This could be
used to have a common foundation on top of LLVM, for example.

------
blackbear_
I don't know, the examples don't look too complicated. I wonder how it is
possible that an approach based on rules would not work on those examples.

------
latrot
Next step would then be FP<->Imperative paradigms?

------
pontusrehula
Could I use this to clean up my code? First write some garbage in java then do
a round java -> python -> java. Does it result in cleaner code? :)

~~~
travisjungroth
Probably more like doing a round trip in a natural language translator.

------
latrot
Next step would then be FP<->Imperative.. Beginning with the static ones?

For dynamic it would require introspection...

------
scotty79
JavaScript > TypeScript translator might be a useful tool.

~~~
yalok
And then TS to C#, to make it compilable through things like il2cpp

------
mister_hn
what about template meta-programming C++ code? Can be converted in python or
java?

------
linux-hog
To what degree must one consume the "prediction is more fundamental than
understanding" koolaid before work like this looks reasonable?

~~~
whymauri
This is a rude comment. The paper has no grander claim of AGI or 'fundamental'
understanding. It's a novel algorithm that outperforms existing solutions. Not
everything needs to be part of work towards 'understanding.'

~~~
linux-hog
It outperforms existing solutions on a publicly unavailable corpus with no
clear way to reproduce the results?

Hmm... The authors literally ran a series of experiments and published them
without giving a clear way to reproduce the results. How is that useful to
anyone except to self advertise?

My comment about prediction vs understanding was simply meant to underscore
this, albeit it might rub some people the wrong way. If you publish a paper
purely about prediction (ie a set of experiments) be prepared to release all
pertinent information to reproduce said experiments. If you choose to publish
a paper that aims to improve mankind's understanding of the problem at hand,
you are intrinsically required to provide all proof in your exposition.

Otherwise we might as well just believe everything anyone ever says with no
proof.

