Hacker News new | past | comments | ask | show | jobs | submit login
Autocompletion with Deep Learning (tabnine.com)
162 points by jacob-jackson on July 15, 2019 | hide | past | favorite | 44 comments

This looks incredibly cool; I'm wowed by the fact that the model has learnt to negate words in if-else statements, though I struggle to think of a case where that particular completion would have been useful.

At the same time, I'm less excited about the fact that the model is cloud-only, both for security/privacy reasons and because I spend a not-insignificant amount of my time on limited-bandwidth/high-latency internet connections.

I'm also curious as to why the survey didn't ask about GPU specifications; most of the time I use my laptop to code whilst plugged in, and I'd happily use only LSP completions when on battery, so power consumption wouldn't be an issue (though fan noise might), and allegedly my GPU (a GTX 1050) can pull off almost 2 TFLOPs, which is well over the "10 billion floating point operations" mentioned in the post.

> I'm wowed by the fact that the model has learnt to negate words in if-else statements

I know it learned natural languages from using GPT-2, but I am surprised it didn't get "confused" since words are used in such a different way in programming.

For example strong appears as the html tag <strong> with no corresponding <weak> tag. And weak appears in weak_ptr in C++ and there's no such thing as a strong_ptr.

Vanilla GPT2 actually does a half decent job of emulating the style of programming.

Try entering "public static int main() {" into https://talktotransformer.com/

I think the tokens are simply one hot encoded. Though some sort of word2vec embedding is also situationally plausible

Good point, I added it to the survey.

Since it's asking for FLOPS, are you ready for people typing in 14-digit numbers? ;)

I'm looking forward to some data cleaning :)

So for reference (and checking) I entered 1862000000000 for my 1050 ti.

I have been using regular TabNine with vim since January. It is considerably better than anything else I've used, and it saves me a ton of time. Sometimes the suggestions are eerily exactly what I wanted. O_o

Same experience. I try not to think how many keystrokes this would have saved me over my career had this type of deep learning been available.

I was using TabNine very happily until it started consuming excessive CPU. This bug has been known for more than 6 months but no fix is available at the moment. https://github.com/zxqfl/TabNine/issues/24

It's difficult to solve an issue like this without steps that can be used to reproduce the issue.

Here is a similar issue where someone provided instructions to reproduce the problem, which allowed me to fix the issue: https://github.com/zxqfl/TabNine/issues/43.

Even if it's technically interesting, to me this seems to epitomizes two trends I strongly dislike:

1. The road to digital serfdorm by removing the ability of running software directly on our individual machines.

2. Programming degenerating to internet and AI assisted copy-pasta code monkeying around byzantine boilerplate APIs. The upside is that such tooling makes filling in said boilerplate less painful. But I don't want garbage APIs to become even less painful (and hence less likely to be weeded out) than they have already become thanks to stack-overflow etc. I also wonder if this will end up producing more "plausible" and hence invidiously wrong code; similar to Xerox's infamous jbig2 smart image compression fiasco where copying sometimes changed the numbers in the document.

This is awesome. I've been thinking for a while that this would be a good idea and it's great to see someone actually do it.

Why not allow individual developers with desktop GPUs to run the model locally? I don't want to run a reduced size laptop model on my machine with a Titan GPU. It would be awesome to actually harness the GPU power for coding :)

(disclaimer: not the author)

One problem is that deploying GPU neural networks cross-platform is a huge pain. You basically have to get your users to install CUDA and figure out how to dynamically link against that on Windows and Linux, Mac users and people with AMD GPUs are out of luck of course.

The only way to do cross-platform GPU compute without your users installing a toolkit like CUDA is with Vulkan (and MoltenVk or gfx-rs). But then you don't have the super optimized GEMM kernels so your network might run slower than just using a super-optimized GEMM kernel on the CPU.

This is a big part of it. CPU-only models are much easier to deploy, and not all developers have GPUs, so the first iteration of the local model will be CPU-only. We might release a GPU version later.

Users don't have to install CUDA if you build your app correctly. Yes, AMD GPU users would be out of luck, but that's really on AMD for failing to get anything like cuDNN integrated with popular frameworks.

Looks cool. Iirc the transformer architecture doesn't allow any constraints on the learned language model. For code completion settings, a model aware of the (programming) language constructs explicitly and then augmented with code samples would be much more efficient (you could greatly reduce the search space for next token etc).

>a model aware of the (programming) language constructs explicitly

You could never include that in the model's training. The best you could do would be to construct an AST on the model output and discard suggestions with invalid syntax. And provide enough negative examples (invalid syntax) to reduce false positives.

What you proposed would never work with a language model, and makes no sense with how backprop works. The model will learn the grammar (syntax), but will always output some percentage of false positives (invalid syntax).

You can't hardcode the syntax into the model. Another approach is to encode token types after tokenization, which will give the model more information about the syntax/meaning of tokens.

I want an intellij plugin!

Looks really cool. Thx!

For IntelliJ try Codota (www.codota.com) which trains AI on semantic models of Java code

I'm impressed by the blog post and how it clearly addressed a lot of the concerns I had about a cloud based service.

Does it not support Elixir? I tried out and was impressed by the older version of TabNine, and could have sworn that was with an Elixir project.

The older TabNine works for any language because it's based on looking at the rest of your project. Deep TabNine is new and requires lots of open source training data for each language to work well. Maybe there is enough open source training data for Elixir though and it would work well, dunno.

(disclaimer: not the author)

It looks nice. It would be nice to have the ability to fine-tune the model to your own source code. Is that on the roadmap?

Yep. We will start by offering it to enterprises. We may extend it to individuals if we can offer it for sufficiently low cost.

This is great I'm happy there are so many projects going using machine learning with coding. Since I plan on applying ML in a future project for coding on phones and other mobile devices.

I'll have a few examples to learn from.

Can I use this with any intellij products like phpstorm or webstorm?

Not at the moment, but planned according to the developer: https://github.com/zxqfl/TabNine/issues/13

For IntelliJ try codota.com which trains AI on semantic models of the code

Only works for Java (and maybe JavaScript). No help for those of us using real languages.

Great job! Could you share how many parameters and the amount of training data per language you used?

We're still prototyping new models and modifying the dataset, so we plan to release more technical details in the future.

A blog post describing the deep learning details would be awesome!

Awesome, thank you! I was waiting for someone to do this! :)

Greatest autocompletion tool ever created just got better!

Cool, finally people can utilize the statistical human average by automatically boiler-plating their code with what the average developer would produce.

Sounds like this has good chance of bringing the "human intelligence on deep-learning auto-pilot" into the world of developers. Don't think, just accept what the computer tells you. Over time, why think at all.

There is no bigger enemy to concise, clean code, than too much auto-completion. If it doesn't pain you to repeat patterns, then there is literally no incentive no to litter your code with anti-patterns and copy paste. Now it's even worse, because this copy paste gets "smart" enough to look alright, which is in itself the very definition of an anti-pattern.

Language models mimic the style of the surrounding text (as you can see from the examples in [1]), so the model will only try to give you low-quality boilerplate code if your codebase already contains lots of low-quality boilerplate code.

[1] https://openai.com/blog/better-language-models/

VSCode Intellisense also uses a deep learning approach for autocompletion.

How does this compare?

As far as I can tell, it suggests one token at a time and uses its model to help rank these tokens. This is useful, but there is a lot to be gained by suggesting multiple tokens at once.

TabNine has always included a logistic regression model to help rank completions. It uses features such as the occurrence frequency of the token and the number of similar contexts in which it occurs.

Intellisense also outputs multiple suggestions.

Also, TabNine mentions they are using transformers, which is not logistic regression. The context will be inferred using attention.

Sorry, I worded that poorly. What I mean is that each individual suggestion consists of a single token -- at least, that is what I see in https://visualstudio.microsoft.com/services/intellicode/. Compare the videos in https://tabnine.com/blog/deep, where most suggestions consist of multiple tokens.

Deep TabNine (announced today) uses transformers. TabNine (released last year) uses logistic regression.

Interesting, thanks.

CTRL-F "emacs" <no results> back.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact