Hacker News new | past | comments | ask | show | jobs | submit login
AI and Open Source in 2023 (sebastianraschka.com)
123 points by belter 11 months ago | hide | past | favorite | 67 comments



Most importantly, 2023 was the year when "open source" got watered down to mean "you can look at the source code / weights" if you agree with a bunch of stuff. Most of the models referenced, like stable diffusion (RAIL license) and Llama & derivatives (proprietary facebook license with conditions on uses and some commercial terms) are not open source in the sense that it was understood a year ago. People protested a bit when the terminology started being abused, and now that's mostly died down and people now call these restrictive licenses open source. This (plus ongoing regulatory capture) is going to be the wedge that destroys software freedom and brings us back to a regime where a few companies dictate how computers can be used.


Mistral 7B [1] and many models stemming from it are released under permissive Apache license.

Some might argue that a "pure" open-source would require the dataset and the training "recipe" as it would be needed to reproduce the training, but it would be so expensive that most people wouldn't be able to do much with it.

IMO, a release with open weights without the "source" is much better than the opposite, a release with open source and no trained weights.

And it's not like there was no progress on the open dataset front: - Together just released RedPajama V2, with enough tokens to train a very sizeable base model. - Tsinghua released UltraFeedback which allowed more people to align models using RLHF methods (like the Zephyr models from Hugging Face) - and many many others

[1] https://mistral.ai/news/announcing-mistral-7b/ [2] https://github.com/togethercomputer/RedPajama-Data


> IMO, a release with open weights without the "source" is much better than the opposite, a release with open source and no trained weights.

Why not both?


Isn’t everyone using data sets they don’t have rights to? Or at least not right to release.


Both would be even better but it would be much smaller progress for the open source community in comparison to the release of the weights.


> Some might argue that a "pure" open-source would require the dataset and the training "recipe" as it would be needed to reproduce the training, but it would be so expensive that most people wouldn't be able to do much with it.

I've actually argued the opposite. https://www.marble.onl/posts/considerations_for_copyrighting... When you look at the freedoms underlying open source, you can exercise them without the training data.

Most of what disqualifies the licenses I mentioned above from being classically open source is use restrictions.


Yes, I think most of the misunderstanding comes from people who are not in ML. A lot of the times you can't release the training pipeline without getting into grey areas of training data copyright. A ML model is a high dimensional distribution that can be sampled from infinitely, given enough time and money you can train another model (possibly even with a different architecture) that closely models the original model in generation capability.


Even if they don't release the (copyrighted) training data itself, do they provide a "recipe" to reproduce it?

Like, a list of sources used and how they were harvested?


It really depends on the release. It's not the case for the base Mistral 7B model. But many finetuned models are released with some information about the dataset used.


How about "Open Binary"?


In practice this matters less than you think. You can’t easily prove that any outputs were generated by a particular model in general, so any user can simply ignore your licenses and do as they please.

I know it rustles purist feathers, but I don’t understand why we live in this pretend world that assumes that folks particularly care about respecting licenses. Consider how little success that the GNU folks have had with using the courts for any enforcement of their licenses, and that’s by stallmans own admission.

AI is itself a subversive technology, whose current versions rely on subversive training techniques. Why should we expect everyone to suddenly want to follow the rules when they read a poorly written restrictive open source license?


For personal or noncommercial use I agree the restrictions are meaningless. As they are for "bad actors" that would potentially abuse the tools in contravention of the license. But the license terms are a risk for commercial users especially when dealing with a big company like Meta. These risks werent previously there in say pytorch that is MIT licensed. The ironic thing with these licenses is that they are least enforceable on those who would be most likely to abuse them: https://katedowninglaw.com/2023/07/13/ai-licensing-cant-bala...

Re success of free licenses, linux (other than a few arguable abuses) has remained free and unencumbered thanks to GPL licensing.


Somehow the "AI defense" (namely that it is not possible to "prove" anything was used illegally) will open Pandora's box in terms of providing viable channels for whitewashing explicit theft activity. Steal anything proprietary, run it through an AI filter that mixes it with other stuff and claim it as your own.


A human could have done that at any point in the past. The only novelty is now a machine can do it.


The scale and ease of doing it matter.

It's like the difference between a bunch of shops having security cameras that they can look at the footage from, versus having every such camera connected to a wide surveillance network with facial recognition and querying abilities. Both are technically just a bunch of cameras in the same places, but not many people would argue it's the same thing.


Nope. Its the other way around. A human could not do it, but a machine can. Or at least it can fake that it can. And the trend seems to be to put the burden on the victim to prove the wrongdoing.

The AI bros who celebrate this development probably think the benefit they will extract by committing this on others exceeds the negatives of others committing it on them.

Good luck.


> You can’t easily prove that any outputs were generated by a particular model in general

But we also don’t know if there’s a trick up the sleeve that allows any of those models to be asked a very specific question that will result in a very specific answer.

Similarly to how map makers hide mistakes in their maps that they can point to when they want to prove that someone took their maps and presented it as their own.

Imagine using Llama for a customer support bot with a custom prompt, but there exist some phrase you don’t know that will make it say something that only Llama would say in response to that.


> But we also don’t know if there’s a trick up the sleeve that allows any of those models to be asked a very specific question that will result in a very specific answer.

So sanitize user input. Don't let the question get asked the specific way the user wanted it to.

Take a page out of OpenAI's book and sanitize the output too.


Matters a lot for any company trying to use them. There are licensing agreements, security audits, legal approval, etc. that make using any of these proprietary license models quite hard if not impossible


mistral appears to be quite open, and even better than llama imho


> 2023 was the year when "open source" got watered down to mean "you can look at the source code / weights"

Disagree, this has long been a problem, alongside all the other familiar deceptive-but-legal false advertising out there. Fortunately the HackerNews community knows enough to call it out. Last year I posted a big list of this happening on HN: https://news.ycombinator.com/item?id=31203209


I think it's time I rip off the GPL 3 in spirit and structure, but change everything into a license one issues himself called the Universal Pirates' License. It would be a document which sounds like a software license but in fact be a statement of reality: he who has and can copy the code gets to use the code, like a stick lying on the ground.


Check out our fully open recent 3b model which outperforms most 7b models and runs on an iPhone/cpu, fully open including data and details

Tuned versions outperform 13b vicuña, wizard etc

https://stability.wandb.io/stability-llm/stable-lm/reports/S...


Is there a truly open source effort in the LLM space? Like a collaborative, crowd-sourced effort (possibly with academic institutions playing a major role) that relies on creative commons licensed or otherwise open data and produces a public good as final outcome?

There is this ridiculous idea of AI moats and other machinations for the next big VC thing (god bless them, people have spend their energy on worse pursuits) but in a fundamental sense there is a public good type infrastructure crying out to be developed for each major linguistic domain.

Maybe such an effort would not be cutting edge enough to power the next corporate chatbot that will eliminate 99% of all jobs, but it would be a significant step up in our ability to process text.


We back rwkv, eleuther ai and others at stability ai

We also have our carper.ai lab for the rl buts

We are rolling out open language models and datasets soon for a number of languages too, see our recent Japanese language models for example

Got some big plans soon, have funded it all ourself but sure other would like to help


ah yes RWKV, always great to mention, crazy about how no one talks about it, it literally the most powerful multilang model at 1b and 3b scales, probs going for 14b and 7b too


RWKV is fully open source and even part of the Linux foundation

Idk why nobody ever talks about it


> that relies on creative commons licensed or otherwise open data

You can try very hard to make neural network stuff a holistic social experience. There is a lot of value in that! I think it's meaningless though, a colossal waste of time.

In the objective reality we live in: We wouldn't be talking about transformers, attention, etc. if it weren't for papers that used so called "not" "open data."

It's all tainted. There's no shortcuts.

If you buy into holistic social experiences as an essential part of your chatbot or whatever, you expose yourself to being sniped in some basic way by merely one comment on the Internet. Bullshit Street is a two lane road.


I think OpenAssistant is the closest to what you are describing but their models are not yet that great. https://open-assistant.io/


Open Assistant just shut down: https://www.youtube.com/watch?v=gqtmUHhaplo

Cited reasons: Lack of resources, lack of maintainer time and there being many new good alternatives.


oh I didn’t know that.


ElutherAi fits that I believe. In the olden days (1.5 years ago) they probably had the best open source model with their NeoX model, but it’s been ellipsed by Llamma and other “open source” models I believe. They still have an active discord with a great community pushing forward.


On a retrospective take of the state of AI 1 year this month into LLMs post ChatGPT, I would like to single out Simon Wilson as the MVP for Open AI tooling contribution. His datasett projects are a great work in in progress and the prodigious blog posts and TIL snips are state of the art. Great onboarding to the whole ecosystem. I find myself using something he has produced in some way everyday.

https://simonwillison.net/


$NVDA went to the moon, AI stocks skyrocketed including any beer with "AI" in its name. The rest of the story is typical by now: vc money flows, companies hide their trade secrets (prompts), public research is derailed. It's all very premature, LLMs was not the end of the road.


Its not premature at all.

Sharing research stops the second REVENUE starts flowing in. AI was always funded by VC/Research budget. When you are on a timer without a viable product, its beneficial to open source research to accelerate the time to reach a sellable product.

With GPT3.5, that point was reached, it is commercially very valuable, and OpenAI's revenue is very high even just as a consumer product, let alone all the API integrations that produce real, real value (All the outsourcing agents in India and Philippine are not more sophisticated than GPT4).

Now it makes money, time to make everything a trade secret. Google doesn't open source their search algorithm, and there's nothing wrong with that.

LLMs are not the end of the road, and the massive amounts of revenue generated by LLMs will fund the next generation of AIs to be developed. GPUs are closed source, doesn't stop them from advancing rapidly. We are in a high interest rate environment, and AI is still getting titanic amounts of investment, because its proven to generate real income, not some speculative bet anymore.


Why do you say “prompts” is the canonical trade secret?


I think open models are more like closed source freemium applications. You got the weights, which are "compiled" from the source material. You're free to use it, but you can't, for example, remove one source material from it.


Which part of e.g. the llama2 license is stopping you from doing that?


As soon as DRM for text and images is implemented companies such openai will be in for a ride. Unfortunately though open source models will be sacrificed in the process, but we need means to protect against the rampant ip theft ai companies do.


Which means that companies will just license the data used to train models because they have the money to do so, or use their own data instead. That's how Adobe's Firefly works right now, and OpenAI just signed a licensing agreement with Shutterstock: https://venturebeat.com/ai/shutterstock-signs-6-year-trainin...

Even if it became impossible to train AI on internet-accessible data, there's no change to the proliferation of generative AI other than keeping it entrenched and centralized in the power of a few players, and it has no impact on potentially taking jobs from artists, other than making it harder for artists to compete due to the lack of open-source alternatives.


No problem then, people willing to make their content available to ai can do so by using such websites, people that value their work can use something else.


That has the same vibe as responding to the invention of the Jacquard loom by saying: "No problem then, people willing to make their designs available to automation can do so by using such punched cards, people that value their work can use something else."

Home weaving does still exist. Not a very big employer any more, though.


I doubt procedural text and image generators will have the same impact, plus the Jacquard loom didnt involve stealing someone else’s textiles to build new products.


All analogies are fraught but this one takes the cake. A more apt one is not wanting the Jaquard loom people to steal my designs.


It was difficult for me to find even this half-baked[0] analogy, given we've never had AI before.

The point I was aiming for is that even if your stuff isn't ever stolen (regardless of what exactly you mean by that), you're still going to be out of a job because the automaton is good enough to out-compete you economically.

At least, that's my expectation, though I hope not for a few more years at least.

[0] yes that pun was deliberate


I am not worried about job losses, not that it would be easy to achieve by current ai, nor am i against ai. I am against corporations taking what doesnt belong to them and using it against us.

Technically speaking if ai would be as capable as you think it would mean that we wouldnt need said corporations since ai can provide all the services they do - otherwise it would be illogical and contradictory to claim ai can do everything under the sun.

My beef is with the oligarchy thats eroding our freedoms, privacy, and ownership. You should be too because our democracy and way of life is threatened by them. Ai is just another tool they want to monopolise.


If the generative AI isn't good enough to take jobs, then the corporations can't use what they've taken against anyone in creative roles.

(The personal profile/ad tracking/face tracking AI probably still could be used against people, but that doesn't seem to be part of this thread? Correct me if I'm wrong).

> Technically speaking if ai would be as capable as you think it would mean that we wouldnt need said corporations since ai can provide all the services they do - otherwise it would be illogical and contradictory to claim ai can do everything under the sun.

That requires all economic tasks to be simultaneously equally susceptible to AI. So far, not so, though this is certainly the goal.

> Ai is just another tool they want to monopolise.

Can you name a single thing that would go less badly, in the event that all the AI were made public domain?


Except its not designs being stolen but rather products. Also stealing designs is not allowed in modern times since you know we evolved.


First, LLMs and Diffusion models infer a design from a product or other permanent artefact. They're not designed to, nor good at, literal copying.

Second, our species didn't evolve into being in 1787:

"""

The Calico Printers’ Act 1787[19] was the first statute to explicitly provide protection for designs, conferring rights enduring for 2 months upon:

“Every person who shall invent, design and print or cause to be invented, designed and printed and become the proprietors of any new and original pattern or patterns for printing linens, cottons, calicos or muslins…”

[19] The full title of the Calico Printers’ Act 1787 was “An Act for encouragement of the Arts of Designing and Printing Linens, Cottons, Calicos and Muslins by vesting the properties thereof in the Designers, Printers and Proprietors for a limited time”.

"""

- https://assets.publishing.service.gov.uk/media/5a7d90b2e5274...


Watermarking images, particularly very high resolution images, I can understand, but I fail to see how with text, you would watermark it in a way that provides sufficient evidence it has been used for training data, unless the model is just quoting it at length.


Text watermarks are possible - one previously-used method is to use different ascii characters for spaces and dashes, that looks invisible to the user but can be detected if it’s a straight copy-paste (even when printed due to slightly differing lengths).

Of course it’s possible to process these out, same as an image watermark.


No such thing as IP theft


It's entirely possible to steal IP, but the "AI art is theft" part of it is still legally up in the air.


I think what OP is referring to is the entirely reasonable legal argument that IP infringement is not actually "theft"

The idea being: "Theft" isn't about "you get something you don't own," it means "you deprive someone else of THEIR property."


There are all sorts of things that are legal and immoral or disagreeable so even if ai art theft is legalised it’s still theft if the author doesnt want it to be used that way. It seems like “ai” is quite reliant on ingesting and storing massive property data to emulate “intelligence” - and thats equal to people downloading and storing movies and music. A thing we are not permitted to by the same corporations that you wish to help.


Let me guess - you think ip and copyright are “rent seeking”? What a weird age we live in. Where people defend corporations from stealing our work. Quite a shift from the reverse.


You're probably getting downvoted because "DRM" was nearly a complete technical failure already, and there's no reason believe it would be different for AI?


Unfortunatley I think you are wrong about this. DRM schemes are evolving to be nearly unbreakable in the future with the widespread adoption of security processors in everything.

As long as there is a massive fundamental asymmetry between assembling a chip with a small amount of ROM and disassembling & reading that ROM while still making the chip usable, DRM schemes using PKI methods will become widespread and nigh unbreakable.


Point [camera/microphone/eyeball] at [video/audio/text], [press record/press record/start writing down what you see].


Most likely you could also use AI to clean up the image/audio/text too, after all one of the earlier uses of GPU accelerated neural networks i remember was Waifu2x for upscaling lowres anime :-P


If this truly happens, it will be easier to bribe an employee to leak the private keys, so the DRM will be useless...


Analog hole, fam.

As correctly noted below, if the eye can see it or the ear can hear it, you have no meaningful DRM.


Normally i wouldnt advocate for drm but there needs to be a way to protect our content from this madness. I understand the backlash though and I am not worried about downvotes.


Your content was never protected in the sense you want it to be protected.

Since the moment you put it up online for people to see and hear, they were able to move on and create something else based upon this. Most of the time unconsciously. This is how humanity works. This is the reason we're still on this planet. AI accelerates the process like any other tool we've come up with since we climbed down the trees.

You can complain and scream as much as you want, but it won't change. Even if you manage to regulate the whole western part of the internet. The rest of the world is bigger and won't sleep.


"Normally i wouldnt advocate for drm but there needs to be a way to protect our content from this madness," said everybody you've ever disagreed with up until now.


There are lots of ways, it's just that none of them should or can meaningfully involve pretending that the technology of conversion of content into ones and zeros can go backwards.

Policy, education, law, etc.


IMO, I think the entire "train on as much data as possible" is nearing its end. There are diminishing returns and it seems like a dead end strategy.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: