It's funny how discoveries in NLP & computer vision complement each other. The replacement of multiplication by additions made me think about the AdderNet paper (https://arxiv.org/abs/1912.13200), which concluded as you had to suffer almost no performance drop.
Perhaps the accumulators in current hardware cannot leverage this to its full potential, but combined with such a strict quantization, this would open LLM to the wider ML community much earlier than expected (when consumer hardware allows you to train near SOTA LLMs from scratch on your machine).
That last part feels very relatable to me: I've seen organizations who are mindful of the licenses of tools they use to avoid further problems, and others assuming that because it's closed source the problem won't ever arise.
License-wise, we're getting more and more transparency on the permissions that apply to the training sets of each OSS model. But I would argue that once we're passed that, developers are gonna raise their expectations:
- control over dependency multiplicity ~= "rewrite this using only a single linear algebra library with Apache 2 license" or even "rewrite this in pure Node JS"
- adding corresponding reference/license notice: the model copies/adapts a section of a library that requires copyright notice reproduction.
- transparency on the similarity with the source material if it was copied/adapted from somewhere else (even if the license allows this, this enters the realm of social courtesy/community codes)
Haha I don't know what your poison is, but the same goes for:
- using the syntax of Python 3.11 for asynchronous tasks;
- using Promises vs. Observables in Javascript
Was the demo example confusing, or not challenging enough perhaps? If you have tough coding guidelines you've been enforcing manually in code reviews up until now, please do share
Thanks for sharing, that's an interesting social component of the equation. From your comment, I assume you're referring to something I've also encountered as a maintainer: we filter out signals where no efforts were put in. If I get the feeling that a PR is perhaps a bit useful but that the author has committed an LLM-generated piece of code, I'll be on the fence. If I'm asked to review a PR with the bare minimal added value, but the author has tried their best and is seeking help to get them started with OSS contributions, I will help. Was that your experience as well?
In that regard, the proxy for "no effort" usually defaults to "it looks like the PR doesn't check any of the guidelines in the CONTRIBUTING.md or the PR template". Here we're trying to always bring that guideline context, make it requestable, and inject it into your coding workflow. In the process, we want to educate those developers about your specific engineering culture.
Besides, code generation is inevitably going to become a growing part of software engineering. Here we're making sure this transition isn't operated without proper alignment or context. It's already challenging to get everyone on the same page in code reviews, so team alignment isn't a trivial problem and it's not gonna improve with the extra thousands of LoC developers will be able to produce each day. Or do you foresee a significant proportion of OSS maintainers consistently rejecting automatically-generated code?
We'll do our best to consistently report it since this can indeed influence the financial decisions of developers, especially if they go through third-party paying LLM APIs. In our early experiments, we've seen about 200-250 tokens per request (~= autocompletion), of which about 40-50 tokens are generated.
Two things we're doing this:
- right now our API response contains more than what's required for autocompletion, so there is room for improvement there. And since we focus on team alignment, the goal is to boost the suggestion acceptance rate compared to alternatives. So in the end, fewer calls and lower token consumption.
- since we're working on fully migrating to hostable OSS models of reasonable size, the financial aspect of token consumption should be mostly moved out of the picture to focus on latency.
I appreciate the feedback about clarity, thanks! We'll update the documentation and agree to reflect that more accurately.
For now, we've started with VSCode as an IDE and used GitHub for authentication. But actually, we're already working with GitLab to add support. For other VCS, the prioritization will be demand-based as we don't want to spread thin early on.
Regarding the OpenAI part, as stated in the post, we're currently migrating the community version to self-hosted OSS models. If you sniff around the backend API repo, you'll see there is already a third-party service registered for Ollama and a corresponding docker-compose (https://github.com/quack-ai/contribution-api/blob/main/docke...). Our next release was already planned to switch to Ollama (keeping OpenAI as an alternative as well), so I'm thrilled if that goes along with the community preference!
This approach feels like pruning, but the speedup is considerably higher. Interestingly, I'm curious how this will play out on more recent transformer architectures though: I guess the speedup will be more important for the largest architectures, but even if we can get 2x or 10x speedup on Mistral/Zephyr, Orca 2 or OpenChat3.5, that would be a tremendous achievement!
Orca 2-13B consistently beat Llama 2-70B on most benchmarks in 0-shot. Hopefully, research papers will start to include Mistral/Zephyr 7B & Openchat 3.5. Even though they're smaller, they're getting competitive against much larger models and they're much cheaper to orchestrate.
The Alignement AI Lab just published OpenChat 3.5, which is outperforming ChatGPT (march version) on most benchmarks apart from MMLU (67.3% vs 64.3%) & BBH-CoT (70.1% vs 63.5%).
Perhaps the accumulators in current hardware cannot leverage this to its full potential, but combined with such a strict quantization, this would open LLM to the wider ML community much earlier than expected (when consumer hardware allows you to train near SOTA LLMs from scratch on your machine).