Only if it is truly open source (open data sets, transparent curation/moderation/censorship of data sets, open training source code, open evaluation suites, and an OSI approved open source license).
Open weights (and open inference code) is NOT open source, but just some weak open washing marketing.
The model that comes closest to being TRULY open is AI2’s OLMo. See their blog post on their approach:
I think the only thing they’re not open about is how they’ve curated/censored their “Dolma” training data set, as I don’t think they explicitly share each decision made or the original uncensored dataset:
> Only if it is truly open source (open data sets, transparent curation/moderation/censorship of data sets, open training source code, open evaluation suites, and an OSI approved open source license)
You’re missing a then to your if. What happens if it’s “truly” open per your definition versus not?
I think you are asking what the benefits are? The main benefit is that we can trust what these systems are doing better. Or we can self host them. If we just take the weights, then it is unclear how these systems might be lying to us or manipulating us.
Another benefit is that we can learn from how the training and other steps actually work. We can change them to suit our needs (although costs are impractical today). Etc. It’s all the usual open source benefits.
Yeah, though I do wonder for a big model like 405B if the original training recipe, really matters for where models are heading, practically speaking which is smaller and more specific?
I imagine its main use would be to train other models by distilling them down with LoRA/Quantization etc(assuming we have a tokenizer). Or use them to generate training data for smaller models directly.
But, I do think there is always a way to share without disclosing too many specifics, like this[1] lecture from this year's spring course at Stanford. You can always say, for example:
- The most common technique for filtering was using voting LLMs (without disclosing said llms or quantity of data).
- We built on top of a filtering technique for removing poor code using ____ by ____ authors (without disclosing or handwaving how you exactly filtered, but saying that you had to filter).
- We mixed certain proportion of this data with that data to make it better (without saying what proportion)
Open weights (and open inference code) is NOT open source, but just some weak open washing marketing.
The model that comes closest to being TRULY open is AI2’s OLMo. See their blog post on their approach:
https://blog.allenai.org/hello-olmo-a-truly-open-llm-43f7e73...
I think the only thing they’re not open about is how they’ve curated/censored their “Dolma” training data set, as I don’t think they explicitly share each decision made or the original uncensored dataset:
https://blog.allenai.org/dolma-3-trillion-tokens-open-llm-co...
By the way, OSI is working on defining open source for AI. They post weekly updates to their blog. Example:
https://opensource.org/blog/open-source-ai-definition-weekly...