One issue with AI2 (who makes Olmo) is that some of their input data sets are filtered and curated, to remove things that they deem politically or ethically unacceptable like “toxic” content. This step obviously influences the LLM, but as far as I know it isn’t done transparently in an auditable way. I think being truly open would require a raw dataset to be available without this curation, along with a log of every single instance of curation. The dataset should also not be attached to a restrictive license as they have done.
I guess it's a good paper. But often times complaints that an AI system "isn't really open-source" coming from folks who are really against open-source AI to begin with strike me as a bad faith debate tactic. Just call them open weights. But I do think that Llama etc being open weights has been a huge benefit and do not wish any more barriers to be enacted to make them stop releasing open weight models.
> coming from folks who are really against open-source AI to begin with strike me as a bad faith debate tactic
Meaning: really against AI? But Nietzsche said it's our enemies who are honest with us.
> do not wish any more barriers to be enacted to make them stop releasing open weight models
I think there are two models:
1. AI's potential for harm in open society requires some mitigation. Openness could be a mitigation. How effective is it?
2. I'm so glad the big guys are releasing bits that I can work with to build companies or at least an interview vignette. Please don't mess with that.
Honestly, the second is a direct, personal motivator for me and the first is a kind of fuzzy problem for unspecified leaders of society, but I probably shouldn't be oblivious to it if I want to maintain any respect.
So, perhaps not bad faith in the legal or the philosophical senses, but certainly a modern mix of incentives.
I for one am very glad someone's done the work of normalizing the list of features so we can have a product discussion.
Although ChatGPT never billed itself as open source, the company behind it called itself OpenAI which is an oxymoron as almost everything they do is as closed as possible in comparison to more open alternatives. So it's good it's included in the chart just for comparison.
Their statement that the LLMs "bill themselves" as open source is like saying in an open source operating system list lets include Microsoft Windows, because Microsoft say they're open source friendly.
I.e. it perpetuates the idea that OpenAI is open source, but just not in the categories listed on that table.
Nobody would think Microsoft is really open source friendly, but inclusion on a list like the above would suggest it is.
If OpenAI were called MicroAI, there would be no issue. Likewise, if Microsoft were to rebrand as Opensoft then there would be similar confusion as to what is open about it.
I've seen enough people think that there is something "open" about OpenAIs' products to learn to anticipate.
The thing that's open is their deposit account. You're welcome to put as much money in there as you want.
Most people using ChatGPT don't even have an expectation that "Open" means anything. They have never heard of Open Source Software nor have they any association load on the word. It's just a few random sounds that add up to a Brand.
My guess is that it's there for the publicity of the article. It'll get better press coverage mentioning ChatGPT than without it. I think their work is great and very important, so I'll let allow them that usage.
Only the post-RL weights ("instruct") of Phi3 are available. There are two categories of weights in this table: LLM weights and RL weights. FWIW it says "RL". I can't explain the ChatGPT though, but i haven't read the article
Edit: Actually when you hover over API in ChatGPT it explains "API only accessible to commercial users". Can't say I really understand what they are trying to say. They want a completely free API?
You can try to replicate ChatGPT with the Chat Completions API, but I don't think you can get access to the actual API that ChatGPT itself is using (without reverse-engineer the API the OpenAI clients use). I'm guessing this is what they're referring to here? Seems like splitting hairs a bit though...
An open source AI model should be "if I run what you give me, for the amount of cycles you told me, on the data you give me, I get byte-for-byte your weights".
Byte for byte reproducibility is an extremely high bar. We don't even have that for most open source software. The reproducibility standard that is practice is perform within -+ epsilon on the test sets. That would be sufficient, and more practical, in my opinion.
We don't have that for software because compiler introduce a lot of randomness. Maybe training introduces randomness as well, I don't know (floating point errors accumulating?). If it does, your metric sounds good, if not, why not go for byte-for-byte...
When does a compiler produce randomness? I think non reproducible builds are usually due to environmental inputs (for example: build a date macro) or an undeespecified build process (such as: what compiler version, what build optimizations are used, etc). All of that can be constrained / defined, it's just work that people often don't do.
I would go even further and say that any data set must be available with the original raw data and a fully transparent log of everything filtered out, with reasoning. AI2’s Olmo is the most open LLM but relies on a dataset that they’ve editorialized without transparency.
Another consideration is licenses. Everything (code, data) must be released under an OSI approved license, not some made up proprietary one.
thank you god for whoever invented the term "open-washing." finally, a term that perfectly hits the nail on the head. finally, a name given to this cynical BS. a name for what yann lecun is doing.
Also Olmo didn’t release a paper per se but a very detailed blog post / cookbook for reproduction which should count to a perfect score imo