
Autocompletion with Deep Learning - jacob-jackson
https://tabnine.com/blog/deep
======
_9jgl
This looks incredibly cool; I'm wowed by the fact that the model has learnt to
negate words in if-else statements, though I struggle to think of a case where
that particular completion would have been useful.

At the same time, I'm less excited about the fact that the model is cloud-
only, both for security/privacy reasons and because I spend a not-
insignificant amount of my time on limited-bandwidth/high-latency internet
connections.

I'm also curious as to why the survey didn't ask about GPU specifications;
most of the time I use my laptop to code whilst plugged in, and I'd happily
use only LSP completions when on battery, so power consumption wouldn't be an
issue (though fan noise might), and allegedly my GPU (a GTX 1050) can pull off
almost 2 TFLOPs, which is well over the "10 billion floating point operations"
mentioned in the post.

~~~
zawerf
> I'm wowed by the fact that the model has learnt to negate words in if-else
> statements

I know it learned natural languages from using GPT-2, but I am surprised it
didn't get "confused" since words are used in such a different way in
programming.

For example strong appears as the html tag <strong> with no corresponding
<weak> tag. And weak appears in weak_ptr in C++ and there's no such thing as a
strong_ptr.

~~~
Felz
Vanilla GPT2 actually does a half decent job of emulating the style of
programming.

Try entering "public static int main() {" into
[https://talktotransformer.com/](https://talktotransformer.com/)

------
pgib
I have been using regular TabNine with vim since January. It is considerably
better than anything else I've used, and it saves me a ton of time. Sometimes
the suggestions are eerily exactly what I wanted. O_o

~~~
ahallock
Same experience. I try not to think how many keystrokes this would have saved
me over my career had this type of deep learning been available.

------
ogirginc
I was using TabNine __very happily __until it started consuming excessive CPU.
This bug has been known for more than 6 months but no fix is available at the
moment.[https://github.com/zxqfl/TabNine/issues/24](https://github.com/zxqfl/TabNine/issues/24)

~~~
jacob-jackson
It's difficult to solve an issue like this without steps that can be used to
reproduce the issue.

Here is a similar issue where someone provided instructions to reproduce the
problem, which allowed me to fix the issue:
[https://github.com/zxqfl/TabNine/issues/43](https://github.com/zxqfl/TabNine/issues/43).

------
patrec
Even if it's technically interesting, to me this seems to epitomizes two
trends I strongly dislike:

1\. The road to digital serfdorm by removing the ability of running software
directly on our individual machines.

2\. Programming degenerating to internet and AI assisted copy-pasta code
monkeying around byzantine boilerplate APIs. The upside is that such tooling
makes filling in said boilerplate less painful. But I don't want garbage APIs
to become even less painful (and hence less likely to be weeded out) than they
have already become thanks to stack-overflow etc. I also wonder if this will
end up producing more "plausible" and hence invidiously wrong code; similar to
Xerox's infamous jbig2 smart image compression fiasco where copying sometimes
changed the numbers in the document.

------
modeless
This is awesome. I've been thinking for a while that this would be a good idea
and it's great to see someone actually do it.

Why not allow individual developers with desktop GPUs to run the model
locally? I don't want to run a reduced size laptop model on my machine with a
Titan GPU. It would be awesome to actually harness the GPU power for coding :)

~~~
trishume
(disclaimer: not the author)

One problem is that deploying GPU neural networks cross-platform is a huge
pain. You basically have to get your users to install CUDA and figure out how
to dynamically link against that on Windows and Linux, Mac users and people
with AMD GPUs are out of luck of course.

The only way to do cross-platform GPU compute without your users installing a
toolkit like CUDA is with Vulkan (and MoltenVk or gfx-rs). But then you don't
have the super optimized GEMM kernels so your network might run slower than
just using a super-optimized GEMM kernel on the CPU.

~~~
jacob-jackson
This is a big part of it. CPU-only models are much easier to deploy, and not
all developers have GPUs, so the first iteration of the local model will be
CPU-only. We might release a GPU version later.

------
arathore
Looks cool. Iirc the transformer architecture doesn't allow any constraints on
the learned language model. For code completion settings, a model aware of the
(programming) language constructs explicitly and then augmented with code
samples would be much more efficient (you could greatly reduce the search
space for next token etc).

~~~
codesushi42
_> a model aware of the (programming) language constructs explicitly_

You could never include that in the model's training. The best you could do
would be to construct an AST on the model output and discard suggestions with
invalid syntax. And provide enough negative examples (invalid syntax) to
reduce false positives.

What you proposed would never work with a language model, and makes no sense
with how backprop works. The model will learn the grammar (syntax), but will
always output some percentage of false positives (invalid syntax).

You can't hardcode the syntax into the model. Another approach is to encode
token types after tokenization, which will give the model more information
about the syntax/meaning of tokens.

------
okr
I want an intellij plugin!

Looks really cool. Thx!

~~~
droid_w
For IntelliJ try Codota (www.codota.com) which trains AI on semantic models of
Java code

------
losvedir
I'm impressed by the blog post and how it clearly addressed a lot of the
concerns I had about a cloud based service.

Does it not support Elixir? I tried out and was impressed by the older version
of TabNine, and could have sworn that was with an Elixir project.

~~~
trishume
The older TabNine works for any language because it's based on looking at the
rest of your project. Deep TabNine is new and requires lots of open source
training data for each language to work well. Maybe there is enough open
source training data for Elixir though and it would work well, dunno.

(disclaimer: not the author)

------
onurcel
It looks nice. It would be nice to have the ability to fine-tune the model to
your own source code. Is that on the roadmap?

~~~
jacob-jackson
Yep. We will start by offering it to enterprises. We may extend it to
individuals if we can offer it for sufficiently low cost.

------
bobajeff
This is great I'm happy there are so many projects going using machine
learning with coding. Since I plan on applying ML in a future project for
coding on phones and other mobile devices.

I'll have a few examples to learn from.

------
superasn
Can I use this with any intellij products like phpstorm or webstorm?

~~~
droid_w
For IntelliJ try codota.com which trains AI on semantic models of the code

~~~
angrygopher
Only works for Java (and maybe JavaScript). No help for those of us using real
languages.

------
lucidrains
Great job! Could you share how many parameters and the amount of training data
per language you used?

~~~
jacob-jackson
We're still prototyping new models and modifying the dataset, so we plan to
release more technical details in the future.

~~~
hpen
A blog post describing the deep learning details would be awesome!

------
0b01
Greatest autocompletion tool ever created just got better!

------
tempsolution
Cool, finally people can utilize the statistical human average by
automatically boiler-plating their code with what the average developer would
produce.

Sounds like this has good chance of bringing the "human intelligence on deep-
learning auto-pilot" into the world of developers. Don't think, just accept
what the computer tells you. Over time, why think at all.

There is no bigger enemy to concise, clean code, than too much auto-
completion. If it doesn't pain you to repeat patterns, then there is literally
no incentive no to litter your code with anti-patterns and copy paste. Now
it's even worse, because this copy paste gets "smart" enough to look alright,
which is in itself the very definition of an anti-pattern.

~~~
jacob-jackson
Language models mimic the style of the surrounding text (as you can see from
the examples in [1]), so the model will only try to give you low-quality
boilerplate code if your codebase already contains lots of low-quality
boilerplate code.

[1] [https://openai.com/blog/better-language-
models/](https://openai.com/blog/better-language-models/)

------
codesushi42
VSCode Intellisense also uses a deep learning approach for autocompletion.

How does this compare?

~~~
jacob-jackson
As far as I can tell, it suggests one token at a time and uses its model to
help rank these tokens. This is useful, but there is a lot to be gained by
suggesting multiple tokens at once.

TabNine has always included a logistic regression model to help rank
completions. It uses features such as the occurrence frequency of the token
and the number of similar contexts in which it occurs.

~~~
codesushi42
Intellisense also outputs multiple suggestions.

Also, TabNine mentions they are using transformers, which is not logistic
regression. The context will be inferred using attention.

~~~
jacob-jackson
Sorry, I worded that poorly. What I mean is that each individual suggestion
consists of a single token -- at least, that is what I see in
[https://visualstudio.microsoft.com/services/intellicode/](https://visualstudio.microsoft.com/services/intellicode/).
Compare the videos in
[https://tabnine.com/blog/deep](https://tabnine.com/blog/deep), where most
suggestions consist of multiple tokens.

Deep TabNine (announced today) uses transformers. TabNine (released last year)
uses logistic regression.

~~~
codesushi42
Interesting, thanks.

------
vincent-toups
CTRL-F "emacs" <no results> back.

~~~
jacob-jackson
[https://github.com/TommyX12/company-
tabnine](https://github.com/TommyX12/company-tabnine)

