The concept of open-source for a million-dollar scale LLM is not very useful, es...

GeekyBear · 2024-04-26T18:12:27

Publicly available datasets were used.

> our release includes the complete framework for training and evaluation of the language model on publicly available datasets, including training logs, multiple checkpoints, and pre-training configurations

https://arxiv.org/abs/2404.14619v1

gattilorenz · 2024-04-26T09:20:40

The distinction between the two terms is what is useful.

Imagine if people only referred to open source software as "free software", and there's no distinction made whether the software is free as in beer or free as in freedom.

taneq · 2024-04-26T09:48:15

Maybe both are useful? There's "open source" in the FSF sense, which isn't always as useful when talking about modern neural networks (remember when papers used to be published with "open source" Python code for initializing and training the model, but no training data or weights?) Then there's "in the spirit of open source" where the weights and training data are also GPL'd. And there's the whole range in between.

Having the training data available is nice, but for large models, having weights provided under a GPL-style license (or even a MIT-style permissive license) is far better in terms of "being able to modify the program" than having the training data that you don't have enough compute resources to use. The distinction between the two, though, is also useful.

(I've even seen a few posters here argue that it's not really 'free software' even if everything's provided and GPL'd, if it would take more compute to re-train than they have available, which frankly I think is silly. Free-as-in-freedom software was never guaranteed to be cheap to build or run, and nobody owes you CPU time.)

gattilorenz · 2024-04-26T16:18:20

Of course both things are useful in practice, and unless you’re a free-as-in-freedom software purist, free-as-in-beer software is also very useful!

But the point is exactly that the distinction matters, and conflating the terms doesn’t do either thing a favor (it also doesn’t really work well with “free software”, since the beer trope is needed to explain what you mean. “Libre” is at least non ambiguous).

Having the training data is not useful just for retraining, but also to know what the model can reasonably answer in 0-shot, to make sure it is evaluated on things it hasn’t seen before during pretraining (e.g. winograd schemas), to estimate biases or problems in the training data… if all you have is the weights, these tasks are much harder!

simion314 · 2024-04-26T09:54:27

>Maybe both are useful? There's "open source" in the FSF sense,

FSF uses the term free/libre software and they do not like the vague term of open source. The problem is that in English language free also means costs zero and you can get confused with frewware.

This is why OP is correct in pointing out wrong term usage, you should not call a GPL software Freeware or a freeware acall it "Free software", it is wrong and causes confusion,

So for models would be great we could know if a model is actually open source, open weights with no restrictions or open weights with restrictions on top.

exe34 · 2024-04-26T09:52:25

It's hard to read sarcasm online, but as far as I can tell, very few people realise there's a difference!

stavros · 2024-04-26T14:57:12

Linux is libre software. Facebook is gratis.