Hacker News new | past | comments | ask | show | jobs | submit login
Aroma: Using machine learning for code recommendation (facebook.com)
254 points by moneil971 77 days ago | hide | past | web | favorite | 57 comments

If people are interested in this area and not already aware, there has been an Eclipse project that operates in this general area for a while. https://www.eclipse.org/recommenders/

What would be nice is if could write recommendations automatically, then you pick and choose which one you want.

You could scaffold a project, and it runs, and then you check back to see what it recommends.

I want something like this for math. Write an equation, or a definition and see a bunch of different 'versions' of that snippet and where they are being used.

This would help so much with understanding concepts and merging fields. It is way too common for different fields to independently "discover" some concept and be completely ignorant of all the work that has been done on that concept by some other field.

It reminds me of Haskell's type based function search: https://hoogle.haskell.org/

That's actually a fascinating idea. Do you think it would have to operate on PDFs? How would that work without the LaTex source?

doesn't wolframalpha give you something like this?

Checkout the work on homotopy type theory and proof assistants like Coq

Interesting. In typical Facebook style, they do not attempt to fix the root problem (there's too much code, and much of it has been copy-pasted) but instead expend even more resources just to allow (or even encourage) it to proliferate. The effort would be far better expended on a tool to refactor out all that duplication, because they've created something that can clearly identify duplication.

It reminds me of how they hit a limit in the Android VM because their code had so many classes, and decided to work around it instead of reflecting (no pun intended) on how they ended up with so much code in the first place: https://news.ycombinator.com/item?id=5321634

Maybe ontopic: I would like a voice controlled system that works with me. For example: saying, "I need a loop over a list" and promptly I get served in my text editor the loop. Or "I need to open a file and read contents". Or "Create an object ThisAndThat, with three properties" ... etc. Of course ideally would even ask for more details like, what kind of list is that, or how shall the file be read.

How hard would be?

As a software engineer, a part of me dreads that day, as this may be the beginning of the end of our profession as it is. But then when I think of it, the AI that can accomplish this will be trained using existing code as a learning sample, and not just static code, it will learn by looking at commits and how code evolves over time, so it is bound to also learn to write bugs, change its mind on design, refactor things that need not refactoring, do premature optimization, rewrite it all in go/rust/the newest cool language on the block, then get stuck because all of its questions got closed as non constructive on stackoverflow, So maybe we'll still have a job after all.

Well, I’ve been discussing it on HN for years now. There really hasn’t been much interest.



Mention coding by voice and someone will explain how they can’t imagine not using a keyboard, or they bring up the open office problem.

Considering the limited scope, it’s probably just a matter of the proper editor integration.

I think my idea goes about a rapid prototyping, where i build the skeleton of a program faster, no matter the boiler code, and then workout the details.

Intellisense or shortcuts do this up to a level, but the current big IDEs are limited. Maybe some editor with the concept of VIM with a separate command and edit mode would be more fit to work like that.

I believe there will be of more interest now for two main reasons:

- Ubiquitous frictionless headphones with built in mics like the AirPods.

- More demand and access to remote working where people are alone at home.

I've envisioned something like this even for written code. Basically unify the language and the editor, so that you could (theoretically) right-click on main() and say "add loop" and have the correct code auto-generated. Not because a mouse is somehow better (it's much worse in fact), but basically an editor/UI that only allows you to produce valid code.

Currently for most languages, we have: "type productions of a particular syntax and try really-really-hard to color between the lines, and subject yourself to the chinese-water-torture of syntax errors till YOU get better at it".

Why not invert that, whether via mouse input, a visual (as in literally, visual, not microsoft-visual) connection, or a text editor that simply doesn't let you type invalid productions. Like Intellisense, but taken to the function or block level. You cannot save the file or even leave insert mode until the code compiles. Or even better, you cannot even temporarily input invalid syntax. From the first keystroke, it inserts a variable declaration, click/type up or down to choose a function call, conrol structure, etc.

Some vim-like integration would go as follows:

command mode:

* <space>+F outputs a function called func1, auto-highlighted for you to rename (or accept default). * <space>+R on the func name lets you set its return type * <space>+A for args, * <space>+B to edit the function body

At no point would you be allowed to input non-compiling syntax. Things like indentation would be non-issues, set uniformly by defaults.

This is easier with a minimalistic (syntactically) language like Lisp. Check this out, the editor refuses to let you write code that won’t compile: http://danmidwood.com/content/2014/11/21/animated-paredit.ht...

yes, that is awesome. Now imagine connecting that to a machine-learning backend and letting it slowly train itself on how to write software. yes, I know ML doesn't need this vim-type language specifically, but it should help by only feeding it valid productions.

Chain a text-to-speech component to tranx and you're done

Paper: https://arxiv.org/pdf/1810.02720v1.pdf

Code: https://github.com/pcyin/tranX

Live demo (try Django out): http://moto.clab.cs.cmu.edu:8081/

This seems to be a lot better those those Seq-to-Seq natural language to SQL networks. Can you elaborate more about them?

As you expanded your database of common tasks, do you think that would eventually become a repository of “things I shouldn’t have to remember”, which could then be used to redesign languages?

Machine code instructions designed by hand are not necessarily the best fit for the code we actually generate. Similarly, might our approach to language design lack pragmatic insight as to which constructs should be favoured, adopted, simplified etc?

“Data-driven language design”.

This is actually a good idea. Also let's not forget the StackOverflow huge "library" on quick solutions. Someone has to harness that vast knowledge source!

The most interesting (and I think difficult) approach here is properly representing the ASTs as vectors. There is a lot more possible when you get this right.

This. ML is so vector-y. And code is so graph-y. Can you point out some SOTA on bridging this ?

I think you'll find a good amount of material if you search for 'deep graph learning'.

Try code2vec. It's pretty good.

The primary use case I experience for searching for idiomatic usage patterns is to know how to do a higher level refactoring, meaning I don’t want results that have syntax tree similarity to what I’ve got or even the small bit I start from to create the query. I want the intention of my search query but expressed in a better design.

Separately, for very micro-level idiomatic things, like use of a certain data type operation or efficient constructor patterns, I need to search by natural language descriptions of the subtle differences between options. This is what makes Stack Overflow so helpful, the accompanying natural language description of intentionality or special cases, even if the code that is found isn’t precisely what’s needed, it demonstrates directionally what to do.

This tool seems like yet another example of trying to force machine learning solutions to problems nobody actually has.

Considering the idea that I’d need to integrate this into my coding environment, I’ll say No Thanks!

> This is what makes Stack Overflow so helpful, the accompanying natural language description of intentionality or special cases, even if the code that is found isn’t precisely what’s needed, it demonstrates directionally what to do.

You're entirely right, but if you're in an incredibly huge monorepo like Facebook, this information literally doesn't exist; that's part of the problem that Aroma is trying to solve - "how can we show people the Facebook App Way To Do That Thing, even if That Thing doesn't have current documentation"

(Disclaimer: I worked on the coding environment UX for Aroma)

Wouldn’t it make more sense to spend the effort annotating these things? Or building models to provide the annotation? I mean, I work professionally in embedding models for computer vision and NLP, and my reaction to the article is that this seems like totally the wrong approach. You’re putting all this effort to create the embedding model out of the part that is both most superficial and least human interpretable (the AST).

Building models for natural language _and_ code for either NL/intent-based code search or automatically annotating code is indeed another hot research area!

I'd argue Aroma solves a different problem in that it surfaces more idiomatic patterns based on the code you already have. This also can be important especially in production environment, when you need to do things "the right way".

If anyone wants to get a PhD in this topic, let me know :)

Your website is the first I've heard of "Information Foraging" as a field of study. Absolutely fascinating. Any recommendations on where I might dive into the topic?

A good start that is easy to read would be:

An Information Foraging Theory Perspective on Tools for Debugging, Refactoring, and Reuse Tasks https://dl.acm.org/citation.cfm?id=2430551

The paper applies IFT to software engineering, but IFT has also been applied to navigating websites or even physical offices. Use Scholar.Google.com to find a PDF of the paper if you don't have ACM access.

Not for PhD, but for research purposes ;)

Can u point out any work on refactoring existing code for reduced code complexity?

Obtaining same (non)functional behavior using less code.

EDIT: i will read the CodeDeviant paper

The related work cited in the CodeDeviant paper may help.

CodeDeviant itself is a tool to help programmers perform manual refactorings without unit tests (in a visual programming language), so it may not be helpful for you :)

It would be very helpful to help figure out how to do things in a large codebase with little documentation.

Was anyone able to find a link to Aroma in that document? I found the colours made it very difficult to differentiate the links from the text and I couldn't find it.

A quick search through Facebook's profile on Github turned up nothing.

There's a paper describing the approach in detail <https://arxiv.org/abs/1812.01158>, but Aroma itself is not open source yet.

Sorry, HN screwed up my URL: https://arxiv.org/abs/1812.01158

Great, thanks very much for finding that paper. Hopefully Facebook follows this up in the not to distant future with at least a rudimentary example.

I can learn from a paper, but I learn much faster from an example!

Seealso: IntelliCode for Visual Studio.

I also found for Code, https://visualstudio.microsoft.com/services/intellicode/

however when I try to install it on Mac OSX, I get

Couldn't find a compatible version of Visual Studio Intellicode - Preview with this version of Code

Not directly about the article but I am annoyed that the FB AI blog is very much a part of FB the social network. While reading the blog I got three useless notifications (boxes in the bottom left corner). The whole page has no other indication that I am logged into FB nor any option to log out.

Is this project open source? I would like to experiment with something like this to generate boilerplate code. Often when programming I copy something that already do kinda what I want, then modify it until it does exactly like I want.

Perhaps not as fancy, but I have been loving TabNine with Vim. (works with most editors) Its suggestions are scary good some of the time.


I wonder if anyone from Kite.com is reading this and if they have any comments?

Seems like this could be a useful feature for Github, BitBucket, etc.

From what I read, it's doing search & clustering on AST based feature vectors. I'm a bit lost on the learning part, how does the system improve over time?

I’m guessing by learning the vectors over lots of ASTs?

That's a pretty cool approach! Thanks for sharing! ;-)

i expect to see this in intellij and cider https://code.google.com/archive/p/cider-ide/ soon

What is the IDE they are using ?

That's Atom with Nuclide.

I guess it's VS Code

One danger I can see coming up is that if someone writes incorrect code it could end up propagating throughout other codebases. I guess this is still an issue without automatic tools, but I feel like this might make it easier…

Aroma would only surface what it thinks is "idiomatic" coding patterns. So if you have many instances of incorrect code, you might already be in trouble :)

Given how many "everybody does X wrong" articles we see here, I think we're already in trouble. :P

(Thinking specifically of the "binary search in the Java API was broken for X decades" one with the integer overflow.)

If you discover a better pattern, it might be easier to convert across the application if the same pattern is followed everywhere. So you might consider this a win.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact