Only if it is truly open source (open data sets, transparent curation/moderation...

JumpCrisscross · on July 23, 2024

> Only if it is truly open source (open data sets, transparent curation/moderation/censorship of data sets, open training source code, open evaluation suites, and an OSI approved open source license)

You’re missing a then to your if. What happens if it’s “truly” open per your definition versus not?

blackeyeblitzar · on July 23, 2024

I think you are asking what the benefits are? The main benefit is that we can trust what these systems are doing better. Or we can self host them. If we just take the weights, then it is unclear how these systems might be lying to us or manipulating us.

Another benefit is that we can learn from how the training and other steps actually work. We can change them to suit our needs (although costs are impractical today). Etc. It’s all the usual open source benefits.

haolez · on July 23, 2024

There is also the risk of companies like Meta introducing ads in the training itself, instead of inference time.

itissid · on July 23, 2024

Yeah, though I do wonder for a big model like 405B if the original training recipe, really matters for where models are heading, practically speaking which is smaller and more specific?

I imagine its main use would be to train other models by distilling them down with LoRA/Quantization etc(assuming we have a tokenizer). Or use them to generate training data for smaller models directly.

But, I do think there is always a way to share without disclosing too many specifics, like this[1] lecture from this year's spring course at Stanford. You can always say, for example:

- The most common technique for filtering was using voting LLMs (without disclosing said llms or quantity of data).

- We built on top of a filtering technique for removing poor code using ____ by ____ authors (without disclosing or handwaving how you exactly filtered, but saying that you had to filter).

- We mixed certain proportion of this data with that data to make it better (without saying what proportion)

[1] https://www.youtube.com/watch?v=jm2hyJLFfN8&list=PLoROMvodv4...