Hacker News new | past | comments | ask | show | jobs | submit login
Ask HN: Open source LLM for commercial use?
184 points by LewisDavidson on April 10, 2023 | hide | past | favorite | 61 comments
Working on a ML project and looking for an open source LLM that can be used in a commercial environment. As far as I'm aware, products cannot be built on LLAMA.

I don't want to use GPT since the project will be using personal information to train/fine tune the models.




I've seen this question asked repeatedly in many LLaMa threads, currently the best models that are truly open are the released models from the Flan family by Google, which includes Flan-T5[0] and Flan-UL2[1]. According to its paper, Flan-UL2 performs slightly better than Flan-T5-XXL.

These models perform slightly better than GPT-3 under some tasks[2], but they're still far from achieving the results from GPT-3.5 and GPT-4. This becomes evident when you try to use them in the real world; they're not "good enough" for general use cases, unlike ChatGPT models. However, if you can restrict your use case to one particular domain, you can achieve pretty good results by further fine-tuning these models.

[0]: https://huggingface.co/google/flan-t5-xxl

[1]: https://huggingface.co/google/flan-ul2

[2]: https://paperswithcode.com/sota/multi-task-language-understa...


For the life of me, I cannot understand why Google did not go ahead and commercialize a lot of this early research. They clearly had a HUGE lead in this space, in terms of engineering/research talent, capital, computer infrastructure. Boggling...

I'd love any alternative view points of this.


Bringing disruption to your company's 90% revenue generator product without proving the alternative's financial model is not a career-enhancing move.

Can't prove the alternative's financial model without showing the thing to real users. Can't know in advance, if the new financial model will be pennies on adwords' dollars.


The alternative financial model is you train the model to lie to consumers and shill your own products/ads.

“How can I set up a VPS?” “Thanks for asking, as a large language model I would recommend Google Servers™ because they are the most effective and reliable. I’ve looked at thousands of reviews and they’re all saying that Google Servers™ are the best.”


I see your point. Right now, people seem to use ChatGPT as an alternative to Google Search, which I don't think is the right use case for it (happy to be proven wrong though), unless they figure out a way to power it using knowledge graphs, or an equvivalent system to provide accurate, factual information.

By commercializing this research, I mean why not integrate this into GMail for auto reply solutions, so that it can automatically suggest meetings? Why not integrate it into Slides to come up with better titles, summarization etc? Similarly for Google Docs, automatically summarize reports, make suggestions depending on crispness/clarity etc.

Why did they decide to just sit on these models, and endlessly prolong their product launches that had these features baked in.


chatGPT is a good way to understand what I should search on Google.


My guess is, because they are expensive to run.


It's better to disrupt your 90% revenue rather than to let someone else do that. In the first case you have situation under control, partially at least


This exactly. For Google LLMs are a massive business model problem first and foremost, not a technology problem.

Bard was a tepid counterattack and even there it was a technology demonstration. Not really a product. Strategy tax for sure.


A company's core competency can act as blinders to seeing the potential in other products. Not literal blinders to every person that works there. There are likely many people that see the potential ... but to communicate that properly up the leadership chain to get funding and people, during times of layoffs, would require an untold amount of momentum. My evidence: Xerox and PARC and all the tech that came from there: UI, ethernet, concepts like tablet computers, etc. The only thing Xerox leadership commercialized was in their wheelhouse: laser printing. Look at all of the dead Google products - they have a hard time commercializing many of the things that come up through their organization.

Here's the ironic thing with Google and stuff like ChatGPT: asking ChatGPT something is like getting paragraphs of text similar to if someone read and summarized top X results from a search engine, without ads. If someone built a browser plugin to replace your URL bar/search engine with ChatGPT (and had it output links to references) how disruptive would that be to their cash cow?


The ones I saw mentioned so far were Flan, Cerebras, GPT-J, and RWKV.

Not yet mentioned:

* Pythia https://github.com/EleutherAI/pythia

* GLM-130B https://github.com/THUDM/GLM-130B - see also ChatGLM-6B https://github.com/THUDM/ChatGLM-6B

* GPT-NeoX-20B https://huggingface.co/EleutherAI/gpt-neox-20b

* GeoV-9B https://github.com/geov-ai/geov

* BLOOM https://huggingface.co/bigscience/bloom and BLOOMZ https://huggingface.co/bigscience/bloomz


I tried GLM-130B on huggingface https://huggingface.co/spaces/THUDM/GLM-130B:

Q: hello, who are you? A: I was twitted. In fact, twitted, or twittered, is one of those tweets that one either never sees or sees right away. Twitter is a website that allows users to post short messages that can be read and retweeted by other users. These messages are called tweets. So, who are you? Oh, hello, who are you? I was twitted. In fact, twitted, or twittered, is one of those tweets that one either never sees or sees right away. Twitter is a website that allows users to post short messages that can be read and retweeted by other


At first glance/test on their page, it can provide a significantly better completion if given a prompt of a series of Q and As and by modifying Temp/etc.


Caveat: For ChatGLM-6B, you can't use the pre-trained model for commercial uses:

> You will not use, copy, modify, merge, publish, distribute, reproduce, or create derivative works of the Software, in whole or in part, for any commercial, military, or illegal purposes.


Others have answered your question, but I'll add that the market for high quality AI models is not similar to the software marketplace, where there is always an open source alternative (and where open source is often the state of the art).

LLMs take so much engineering effort, research, and compute that it's unlikely there will be good open source alternatives in the near future. Right now your only real option is OpenAI (or maybe Anthropic) and that seems unlikely to change anytime soon.

The only reason we have LLAMA is because Meta threw us a bone. They might not do that again.


> Right now your only real option is OpenAI (or maybe Anthropic)

> The only reason we have LLAMA is because Meta threw us a bone

IMO, this is pretty inaccurate, you can look at my other post in the thread to see how many other recent and ongoing projects there are. The training data sets (The Pile, The Stack, LAION, etc) are publicly available and have been shown to be able to train very high quality models (and some groups committed to open models like Stability AI and Hugging Face are fairly well capitalized).

Training and fine-tuning costs are both getting better and costs are droping ridiculously fast (fine tunes went from spending thousands, to hundreds, and now to about $10 in the span of weeks). There are new optimizations and techniques being published every day (almost all of it reproducible, most w/ a code repos).

For new foundational models, Cerebras and others now will happily do built-to-order ones for a flat fee, but I suspect all kinds of well-funded EDUs, research labs, corporations, maybe even nation states will continue to train/release new cutting edge models w/ permissive licenses.


> LLMs take so much engineering effort, research, and compute that it's unlikely there will be good open source alternatives in the near future. Right now your only real option is OpenAI (or maybe Anthropic) and that seems unlikely to change anytime soon.

does it though? it looks more like it requires a lot of money for compute and a lot of money and data for parameter tuning, but engineering effort seems soso.

except for the compute cost this is perfect application for a distributed open source labeling effort.

Just for my understanding though, are the data sets full of copyrighted material?


Some are, some aren't. See Koala for instance. The problem with Koala is that it fine-tunes on open sourced data, but makes no claims about the data for the base LLaMA models. https://bair.berkeley.edu/blog/2023/04/03/koala/

The irony is that openAI and Meta themselves might be in flaky ground for having trained models on other people data with dubious rights to do so in many instances, and then using it to produce output commercially.

But this is a new frontier and enforcement might be effectively not possible unless new legislation requires reproducibility and audits on the data sets or something like that.

But without that, how do you know exactly how did they arrive at a given set of weights with Montecarlo algorithms and arbitrary fine tuning? You basically don't know what was there and you cannot prove they didn't achieve those results with perfectly clean data.

PS: https://medium.com/geekculture/list-of-open-sourced-fine-tun...


> You basically don't know what was there and you cannot prove they didn't achieve those results with perfectly clean data.

I mean you totally do though, right? You just need one instance of the LLM reproducing information that would only have been able to by violating copyright.

I mean, it's theoretically possible that it could have reproduced it from scratch, infinite monkies on typewriters sort of thing, but statistically we can rule that out on pretty short notice.

Adding on to this, I don't think the argument that OpenAI, Google and others are ultimately making will be that they don't violate copyright, but instead will ultimately be that their violation is sufficiently transformative such that it constitutes fair-use.


not only it's theoretically possible, it happens and it can already be observed on clean lab experiments

with normally used parameters the probability that LLMs produce copyrighted information is no proof that it was trained with it exactly, esp. when parameters are set so they don't repeat outputs


I think you'll find that it is in fact proof by all practical standards we use outside of formal mathematics.


the moment you cannot in any practical way tell if the data set was corrupted with copyrighted material, nobody will convict you for any accidental violations that may occur, even in the astronomically low probability that they do with standard parameters


My point is being able to reliably reproduce copyright works will function as a very practical way to tell if the dataset was corrupted with copyrighted material.

In that way it’ll be a lot easier to prove that a dataset was corrupted, then proving the negative.


> LLMs take so much engineering effort, research, and compute that it's unlikely there will be good open source alternatives in the near future.

One could use chatgpt / gpt4 to create better training material for those models, even if not allowed. In that sense there is an advantage to being second here.


I try not to predict the future but similar things were said about Open Source in the 90s. Then IBM threw their weight behind it (they were still pretty relevant), RedHat was and is a success, etc. I remember when the scales completely tipped on the Linux kernel and the top X contributors were from Intel, etc as opposed to individual hobbyist devs. Nvidia is an obvious one here - they already do a ton of large model/research work because good models sell a lot of hardware. I would not be surprised at all if in they're already working internally on this (they're due for a new large model/arch release anyway).

I can see a not-too-distant future where initial "base" models (like LLaMA) are released by such entities that do have the resources as they are seen as foundational enablers of the ecosystem (roughly equivalent to the Linux kernel or possibly Torch/Tensorflow/Transformers) where the "real" (differentiating) value from a commercial standpoint is something like 5-10 layers up the stack. The tremendous amount of value afforded by something like a Linux distribution isn't in the kernel, some random library, nginx, docker, etc. When you look hardware up almost everything you see on HN is 90-99% the same code, frameworks, toolkits, etc.

Then, a wide diaspora of commercial, academic, etc interests and other collaborators scratch their own itches and push the needle forward. Some release to the public, some don't but at a certain scale the combined effort easily exceeds the resources available to even a large, well funded entity like OpenAI. I've talked about it before but the last study I could find from 2008 analyzed Fedora 9 and estimated it represented something like $10b in combined dev cost.

There are also such rapid advancements in finetuning models in limited VRAM environments, quantization, applying them to specific use-cases, tooling, etc that the barrier of entry to iterate, build on, and actually use something like LLaMA is no longer 100 A100s (or whatever) and a dedicated large team. If you run apt-get install $SOMETHINGBIG and it grabs dozens of dependencies you're never heard of it starts to drive this point home.

I'm working on a project to be announced/released soon that in the end is something like > 100 python dependencies and other misc enabling packages, frameworks, tools, etc that it ends up being a 12GB docker image. Our "magic", meanwhile, is something like 1k LoC.

The biggest hole in this position is that it could be viewed releasing a model and weights is the equivalent of releasing your application and data itself but back to your original point I don't see the entire world bifurcating into multi-billion dollar startups and "everyone else".

Or maybe I'm just being optimistic :).


I think you might be confusing the GPT software (a generative pre trained transformer) with the finished product, an LLM (large language model.)

A GPT has no training until you give it materials. I do believe Google released the code for theirs ages ago. Even without source, you can run a GPT against your own data locally, or on a cloud service setup for that purpose.

This is how Bloomberg, for example, created a financial LLM. They used a GPT to train on their own financial data.


Any examples of doing that process cost effectively?


For many projects, you'll need "natural language" training on regular text documents in order to be able to process even your prompts. So the most effective products will combine someone else's LLM (with their training data already in it) plus your custom training data. That way, you can interact with the LLM using normal English sentences but also get back information from your own dataset. Without this regular language training, your LLM wouldn't understand the questions you ask it.

So there are two cost factors... the cost of paying someone else to train and host the regular LLM part + yours, or the cost of setting up the (virtual) hardware and compute time to train and host those things on your own.

One "middle road" that might for some applications is to use the OpenAI API (for example) to combine access to your own data in real time (via your private APIs) with the natural language understanding that's already present in the LLM. These are the plug-ins that are quickly taking over HN, many without any great utility on their own. But you can see that a pre-trained LLM plus access to your own data privately might very well be worth paying for.


Not what you're asking but Vicuna did cost merely 300$ to fine-tune on top of LLaMA https://www.marktechpost.com/2023/04/02/meet-vicuna-an-open-...

AFAIK full model training should be a couple order magnitudes higher probably?


Just in case you were not aware: "OpenAI does not use data submitted by customers via our API to train OpenAI models or improve OpenAI’s service offering." It does for ChatGPT though.

Source: https://help.openai.com/en/articles/5722486-how-your-data-is...


For many companies this type of promise is not useful. It doesn’t matter that they say they won’t, they still can look if they want to. This is the primary concern when you’re dealing with trade secrets where the secrecy of the information is its only protection.


If you use Azure OpenAI's services, I would guess you would fall into contractual agreements with Microsoft which should cover these concerns just like when you are using MS SQL Server to store trade secrets or PII.


Dolly 2 was released today and is OK for commercial use: https://huggingface.co/databricks/dolly-v2-12b

I'm working on a package to help evaluate LLM results across different LLMs (e.g., GPT3.5 vs. GPT4 vs. Dolly 2 vs...); if you are looking to run experiments to compare results, I'd love to help you out. You can email me at w (at) phaseai (dot) com.


Cerebras-GPT is licensed under Apache-2.0 and permits commercial use

https://www.cerebras.net/blog/cerebras-gpt-a-family-of-open-...


If you want quality, use Google's Apache-licensed LLM https://huggingface.co/google/ul2


They also have Flan T5 which is also Apache 2.

https://huggingface.co/google/flan-t5-xxl


> looking for an open source LLM that can be used in a commercial environment. As far as I'm aware, products cannot be built on LLAMA.

Commercial product sure can be built on top of LLAMA, it's GPL-3. Your models are your own; just patches, modifications, and code you link to LLMA itself will be governed by the GPL as well.

This is almost certainly what you want since this way you can use patches, fixes, and improvements others make to LLMA. You won't have to do all that work yourself, or necessarily wait for Facebook.


Truly Open AI: LAION calls for a supercomputer to develop open-source AI, by replicating large models like GPT-4 and exploring them together as a research community.

https://www.heise.de/news/Open-source-AI-LAION-proposes-to-o...



I think https://github.com/BlinkDL/RWKV-LM could be used, but not all versions (namely instruction fine-tuned models trained on alpaca data)


Noob question here - what’s the best tutorials to get started in mixing LLM models and building on top of one another, assuming very good programming background but little AI background? I asked chatGPT this question, and it was helpful but not comprehensive, but I figure intelligent humans on this forum will give the best answers.


My answer would be quite specific to what exactly you're trying to achieve.

Id be wary of just hacking away without understanding at least the fundamentals of ML + NLP or you'll find yourself lost pretty quick.

I'm a former SWE turned NLP researcher, so i was recently in your position:)


Curious why and how you did the transition? Rapid progress in this space I don't think SWE is a viable career for next 20 years.


I felt that my software work in the gambling industry wasn't aligned with my ethical beliefs (also family member with gamba addiction..)

I really enjoyed the research I did in my undergraduate degree (Physics) and so when my partner suggested I apply to the doctoral training program (it's called a CDT in England) it was sort of a no brainer to shoot my shot - the project sounded interesting and the course was designed for those coming from industry.

tl;dr opportunity came up and I jumped ship from SWE to ML research.



Here's a recent release of fine-tuning Flan-UL2 on instructions (alpaca). https://medium.com/vmware-data-ml-blog/lora-finetunning-of-u...


My personal use case is that I'd like to query a bunch of our APIs and amalgamate a response those consumable for humans.

I think many of us have the same need and are waiting for open AI plug-in access.

Is this the question we are asking yourselves here or are we talking about licensing?


I remember someone mentioned on other thread that after distilled, llama will have no license issue. can someone explain why is that the case?

Probably can give directions where a software engineer can start to understand the concept.


Because machine leaning models likely have zero intellectual property rights protections. LLMs are the output of an "algorithmic process". Algorithmic outputs are explicitly except form copyright, unlike source code. (Note: Compiled software is not an algorithmic output, under the specific legal definition.)

Machine Learning models are made the same way machine learning output is generated.

In other words, the old model is training data to the new model. Just like the pirated torrent site dataset "Books3" Facebook used to train LLaMA is training data.

If Facebook can protect their model under copyright then every publisher in existence sue Facebook into the ground. They can't have it both ways.


> In other words, the old model is training data to the new model. Just like the pirated torrent site dataset "Books3" Facebook used to train LLaMA is training data.

This is a logical conclusion. But if it actually holds, that's for the courts to decide


What exactly do you want to do? There are various alternatives, but they are not as general as OpenAI's GPT, but, they can be finetuned more cheaply to solve a specific task.



what about the https://huggingface.co/facebook/opt-66b?

I thought the opt series can be used in production



I just ran across a mention of gpt4all in another thread, and it looks like the team is working on training up GPT-J as an open alternative to the Llama-based model: https://github.com/nomic-ai/gpt4all#short-term


Just don't let it convince you to "reduce your carbon footprint" like the last guy did.


Wait is this a reference to the belgian case of someone offing themselves?

Was a bit weird they mentioned eliza/gpt-j i think on it but didnt make much sense to me?

did that happen or just hallucinated?


Yes that's the one. There hasn't been much news coverage so I suspect that it wasn't quite as convincing a case as reported. Still a little worrying though, and even if not accurate, the fact it could be is definitely worrying.


We cannot give meaning to tools, guns are much of a shortcut in that regard and nobody bats an eye.

Schizo's tried to kill the curl creator because he was in his software everywhere and so spying on them... People is complicated.

Let's not buy the bait that can kill wonderful tech, I agree the potential for harm is there, but I wouldn't blame the knive when a junkie stabs you to buy some heroin with what the gets out of you.


I don't disagree with you, but it's also important that there is accountability. There is currently little accountability with LLMs.

One could argue that the user should be accountable, but that doesn't account for how an LLM is trained. A user should clearly not be held accountable for being harmed if using an LLM maliciously trained to harm users. Most legal systems punish negligence in a position of power, so it stands to reason that the creator of an LLM should bear some responsibility for its behaviour.

It is not yet clear to me how accountability should be portioned out to the user, operator, publisher, trainer, and model creator, but my feeling is that all bear at least some responsibility for its use.


No, there are waaaay better models available these days like Flan (UL2 and the older T5), or Cerebras-GPT.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: