Most importantly, 2023 was the year when "open source" got watered down to mean "you can look at the source code / weights" if you agree with a bunch of stuff. Most of the models referenced, like stable diffusion (RAIL license) and Llama & derivatives (proprietary facebook license with conditions on uses and some commercial terms) are not open source in the sense that it was understood a year ago. People protested a bit when the terminology started being abused, and now that's mostly died down and people now call these restrictive licenses open source. This (plus ongoing regulatory capture) is going to be the wedge that destroys software freedom and brings us back to a regime where a few companies dictate how computers can be used.
Mistral 7B [1] and many models stemming from it are released under permissive Apache license.
Some might argue that a "pure" open-source would require the dataset and the training "recipe" as it would be needed to reproduce the training, but it would be so expensive that most people wouldn't be able to do much with it.
IMO, a release with open weights without the "source" is much better than the opposite, a release with open source and no trained weights.
And it's not like there was no progress on the open dataset front:
- Together just released RedPajama V2, with enough tokens to train a very sizeable base model.
- Tsinghua released UltraFeedback which allowed more people to align models using RLHF methods (like the Zephyr models from Hugging Face)
- and many many others
> Some might argue that a "pure" open-source would require the dataset and the training "recipe" as it would be needed to reproduce the training, but it would be so expensive that most people wouldn't be able to do much with it.
Yes, I think most of the misunderstanding comes from people who are not in ML. A lot of the times you can't release the training pipeline without getting into grey areas of training data copyright. A ML model is a high dimensional distribution that can be sampled from infinitely, given enough time and money you can train another model (possibly even with a different architecture) that closely models the original model in generation capability.
It really depends on the release. It's not the case for the base Mistral 7B model. But many finetuned models are released with some information about the dataset used.
In practice this matters less than you think. You can’t easily prove that any outputs were generated by a particular model in general, so any user can simply ignore your licenses and do as they please.
I know it rustles purist feathers, but I don’t understand why we live in this pretend world that assumes that folks particularly care about respecting licenses. Consider how little success that the GNU folks have had with using the courts for any enforcement of their licenses, and that’s by stallmans own admission.
AI is itself a subversive technology, whose current versions rely on subversive training techniques. Why should we expect everyone to suddenly want to follow the rules when they read a poorly written restrictive open source license?
For personal or noncommercial use I agree the restrictions are meaningless. As they are for "bad actors" that would potentially abuse the tools in contravention of the license. But the license terms are a risk for commercial users especially when dealing with a big company like Meta. These risks werent previously there in say pytorch that is MIT licensed. The ironic thing with these licenses is that they are least enforceable on those who would be most likely to abuse them: https://katedowninglaw.com/2023/07/13/ai-licensing-cant-bala...
Re success of free licenses, linux (other than a few arguable abuses) has remained free and unencumbered thanks to GPL licensing.
Somehow the "AI defense" (namely that it is not possible to "prove" anything was used illegally) will open Pandora's box in terms of providing viable channels for whitewashing explicit theft activity. Steal anything proprietary, run it through an AI filter that mixes it with other stuff and claim it as your own.
It's like the difference between a bunch of shops having security cameras that they can look at the footage from, versus having every such camera connected to a wide surveillance network with facial recognition and querying abilities. Both are technically just a bunch of cameras in the same places, but not many people would argue it's the same thing.
Nope. Its the other way around. A human could not do it, but a machine can. Or at least it can fake that it can. And the trend seems to be to put the burden on the victim to prove the wrongdoing.
The AI bros who celebrate this development probably think the benefit they will extract by committing this on others exceeds the negatives of others committing it on them.
> You can’t easily prove that any outputs were generated by a particular model in general
But we also don’t know if there’s a trick up the sleeve that allows any of those models to be asked a very specific question that will result in a very specific answer.
Similarly to how map makers hide mistakes in their maps that they can point to when they want to prove that someone took their maps and presented it as their own.
Imagine using Llama for a customer support bot with a custom prompt, but there exist some phrase you don’t know that will make it say something that only Llama would say in response to that.
> But we also don’t know if there’s a trick up the sleeve that allows any of those models to be asked a very specific question that will result in a very specific answer.
So sanitize user input. Don't let the question get asked the specific way the user wanted it to.
Take a page out of OpenAI's book and sanitize the output too.
Matters a lot for any company trying to use them. There are licensing agreements, security audits, legal approval, etc. that make using any of these proprietary license models quite hard if not impossible
> 2023 was the year when "open source" got watered down to mean "you can look at the source code / weights"
Disagree, this has long been a problem, alongside all the other familiar deceptive-but-legal false advertising out there. Fortunately the HackerNews community knows enough to call it out. Last year I posted a big list of this happening on HN: https://news.ycombinator.com/item?id=31203209
I think it's time I rip off the GPL 3 in spirit and structure, but change everything into a license one issues himself called the Universal Pirates' License. It would be a document which sounds like a software license but in fact be a statement of reality: he who has and can copy the code gets to use the code, like a stick lying on the ground.
Is there a truly open source effort in the LLM space? Like a collaborative, crowd-sourced effort (possibly with academic institutions playing a major role) that relies on creative commons licensed or otherwise open data and produces a public good as final outcome?
There is this ridiculous idea of AI moats and other machinations for the next big VC thing (god bless them, people have spend their energy on worse pursuits) but in a fundamental sense there is a public good type infrastructure crying out to be developed for each major linguistic domain.
Maybe such an effort would not be cutting edge enough to power the next corporate chatbot that will eliminate 99% of all jobs, but it would be a significant step up in our ability to process text.
ah yes RWKV, always great to mention, crazy about how no one talks about it, it literally the most powerful multilang model at 1b and 3b scales, probs going for 14b and 7b too
> that relies on creative commons licensed or otherwise open data
You can try very hard to make neural network stuff a holistic social experience. There is a lot of value in that! I think it's meaningless though, a colossal waste of time.
In the objective reality we live in: We wouldn't be talking about transformers, attention, etc. if it weren't for papers that used so called "not" "open data."
It's all tainted. There's no shortcuts.
If you buy into holistic social experiences as an essential part of your chatbot or whatever, you expose yourself to being sniped in some basic way by merely one comment on the Internet. Bullshit Street is a two lane road.
ElutherAi fits that I believe. In the olden days (1.5 years ago) they probably had the best open source model with their NeoX model, but it’s been ellipsed by Llamma and other “open source” models I believe. They still have an active discord with a great community pushing forward.
On a retrospective take of the state of AI 1 year this month into LLMs post ChatGPT, I would like to single out Simon Wilson as the MVP for Open AI tooling contribution. His datasett projects are a great work in in progress and the prodigious blog posts and TIL snips are state of the art. Great onboarding to the whole ecosystem. I find myself using something he has produced in some way everyday.
$NVDA went to the moon, AI stocks skyrocketed including any beer with "AI" in its name. The rest of the story is typical by now: vc money flows, companies hide their trade secrets (prompts), public research is derailed. It's all very premature, LLMs was not the end of the road.
Sharing research stops the second REVENUE starts flowing in. AI was always funded by VC/Research budget. When you are on a timer without a viable product, its beneficial to open source research to accelerate the time to reach a sellable product.
With GPT3.5, that point was reached, it is commercially very valuable, and OpenAI's revenue is very high even just as a consumer product, let alone all the API integrations that produce real, real value (All the outsourcing agents in India and Philippine are not more sophisticated than GPT4).
Now it makes money, time to make everything a trade secret. Google doesn't open source their search algorithm, and there's nothing wrong with that.
LLMs are not the end of the road, and the massive amounts of revenue generated by LLMs will fund the next generation of AIs to be developed. GPUs are closed source, doesn't stop them from advancing rapidly. We are in a high interest rate environment, and AI is still getting titanic amounts of investment, because its proven to generate real income, not some speculative bet anymore.
I think open models are more like closed source freemium applications. You got the weights, which are "compiled" from the source material. You're free to use it, but you can't, for example, remove one source material from it.
As soon as DRM for text and images is implemented companies such openai will be in for a ride. Unfortunately though open source models will be sacrificed in the process, but we need means to protect against the rampant ip theft ai companies do.
Which means that companies will just license the data used to train models because they have the money to do so, or use their own data instead. That's how Adobe's Firefly works right now, and OpenAI just signed a licensing agreement with Shutterstock: https://venturebeat.com/ai/shutterstock-signs-6-year-trainin...
Even if it became impossible to train AI on internet-accessible data, there's no change to the proliferation of generative AI other than keeping it entrenched and centralized in the power of a few players, and it has no impact on potentially taking jobs from artists, other than making it harder for artists to compete due to the lack of open-source alternatives.
No problem then, people willing to make their content available to ai can do so by using such websites, people that value their work can use something else.
That has the same vibe as responding to the invention of the Jacquard loom by saying: "No problem then, people willing to make their designs available to automation can do so by using such punched cards, people that value their work can use something else."
Home weaving does still exist. Not a very big employer any more, though.
I doubt procedural text and image generators will have the same impact, plus the Jacquard loom didnt involve stealing someone else’s textiles to build new products.
It was difficult for me to find even this half-baked[0] analogy, given we've never had AI before.
The point I was aiming for is that even if your stuff isn't ever stolen (regardless of what exactly you mean by that), you're still going to be out of a job because the automaton is good enough to out-compete you economically.
At least, that's my expectation, though I hope not for a few more years at least.
I am not worried about job losses, not that it would be easy to achieve by current ai, nor am i against ai. I am against corporations taking what doesnt belong to them and using it against us.
Technically speaking if ai would be as capable as you think it would mean that we wouldnt need said corporations since ai can provide all the services they do - otherwise it would be illogical and contradictory to claim ai can do everything under the sun.
My beef is with the oligarchy thats eroding our freedoms, privacy, and ownership. You should be too because our democracy and way of life is threatened by them. Ai is just another tool they want to monopolise.
If the generative AI isn't good enough to take jobs, then the corporations can't use what they've taken against anyone in creative roles.
(The personal profile/ad tracking/face tracking AI probably still could be used against people, but that doesn't seem to be part of this thread? Correct me if I'm wrong).
> Technically speaking if ai would be as capable as you think it would mean that we wouldnt need said corporations since ai can provide all the services they do - otherwise it would be illogical and contradictory to claim ai can do everything under the sun.
That requires all economic tasks to be simultaneously equally susceptible to AI. So far, not so, though this is certainly the goal.
> Ai is just another tool they want to monopolise.
Can you name a single thing that would go less badly, in the event that all the AI were made public domain?
First, LLMs and Diffusion models infer a design from a product or other permanent artefact. They're not designed to, nor good at, literal copying.
Second, our species didn't evolve into being in 1787:
"""
The Calico Printers’ Act 1787[19] was the first statute to explicitly provide protection for designs, conferring rights enduring for 2 months upon:
“Every person who shall invent, design and print or cause to be invented, designed and
printed and become the proprietors of any new and original pattern or patterns for printing linens, cottons, calicos or muslins…”
…
[19] The full title of the Calico Printers’ Act 1787 was “An Act for encouragement of the Arts of Designing and Printing Linens, Cottons, Calicos and Muslins by vesting the properties thereof in the Designers, Printers and Proprietors for a limited time”.
Watermarking images, particularly very high resolution images, I can understand, but I fail to see how with text, you would watermark it in a way that provides sufficient evidence it has been used for training data, unless the model is just quoting it at length.
Text watermarks are possible - one previously-used method is to use different ascii characters for spaces and dashes, that looks invisible to the user but can be detected if it’s a straight copy-paste (even when printed due to slightly differing lengths).
Of course it’s possible to process these out, same as an image watermark.
There are all sorts of things that are legal and immoral or disagreeable so even if ai art theft is legalised it’s still theft if the author doesnt want it to be used that way. It seems like “ai” is quite reliant on ingesting and storing massive property data to emulate “intelligence” - and thats equal to people downloading and storing movies and music. A thing we are not permitted to by the same corporations that you wish to help.
Let me guess - you think ip and copyright are “rent seeking”? What a weird age we live in. Where people defend corporations from stealing our work. Quite a shift from the reverse.
You're probably getting downvoted because "DRM" was nearly a complete technical failure already, and there's no reason believe it would be different for AI?
Unfortunatley I think you are wrong about this. DRM schemes are evolving to be nearly unbreakable in the future with the widespread adoption of security processors in everything.
As long as there is a massive fundamental asymmetry between assembling a chip with a small amount of ROM and disassembling & reading that ROM while still making the chip usable, DRM schemes using PKI methods will become widespread and nigh unbreakable.
Most likely you could also use AI to clean up the image/audio/text too, after all one of the earlier uses of GPU accelerated neural networks i remember was Waifu2x for upscaling lowres anime :-P
Normally i wouldnt advocate for drm but there needs to be a way to protect our content from this madness. I understand the backlash though and I am not worried about downvotes.
Your content was never protected in the sense you want it to be protected.
Since the moment you put it up online for people to see and hear, they were able to move on and create something else based upon this. Most of the time unconsciously. This is how humanity works. This is the reason we're still on this planet. AI accelerates the process like any other tool we've come up with since we climbed down the trees.
You can complain and scream as much as you want, but it won't change. Even if you manage to regulate the whole western part of the internet. The rest of the world is bigger and won't sleep.
"Normally i wouldnt advocate for drm but there needs to be a way to protect our content from this madness," said everybody you've ever disagreed with up until now.
There are lots of ways, it's just that none of them should or can meaningfully involve pretending that the technology of conversion of content into ones and zeros can go backwards.