Meta to release open-source commercial AI model

foob · on July 14, 2023

From the recent story about the Sarah Silverman lawsuit:

The complaint lays out in steps why the plaintiffs believe the datasets have illicit origins — in a Meta paper detailing LLaMA, the company points to sources for its training datasets, one of which is called ThePile, which was assembled by a company called EleutherAI. ThePile, the complaint points out, was described in an EleutherAI paper as being put together from “a copy of the contents of the Bibliotik private tracker.” Bibliotik and the other “shadow libraries” listed, says the lawsuit, are “flagrantly illegal.”

IANAL, but this basically sounds like LLaMa was trained on illegally obtained books by Meta's own admission. It's an exciting development that Meta is releasing a commercial-use version of the model, but I wonder if this is going to cause issues down the road. It's not like Meta can remove these books from the training set without retraining from scratch (or at least the last checkpoint before they were used).

[1] https://news.ycombinator.com/item?id=36657540

ramshanker · on July 14, 2023

Sometimes, I wonder what if someone in XYZ country downloads whole of Z-Library/Libgen, all the books ever printed, and all the papers ever published, all the newspapers and so on. and releases the model open source. There are jurisdictions with Lax rules.

And they will have much better knowledge, answers, etc than the western, Lawyer approved models.

Sometimes knowledge needs to be set free I guess.

TX81Z · on July 14, 2023

The production of knowledge needs to be funded as it isn’t “free”. Copyright and licensing is one model that has worked for a long time. It has flaws, but it has produced good things.

At this point with the quality of current web content and the collapse of journalism as an industry I think we can say online ads have utterly failed as a replacement income stream.

Unless you want all LLM to say “I’m sorry the data I was trained on ends in 2023” you still need a content funding model. Maybe not copyright, but sure as hell not ads either.

pmoriarty · on July 14, 2023

"Copyright and licensing is one model that has worked for a long time. It has flaws, but it has produced good things."

By some definition of "worked". If we define "worked" as "made money for", who it worked mostly for are the middlemen and a minority of writers... a minority that with the advent of LLMs is likely to shrink even further.

TX81Z · on July 14, 2023

Not friends with any journalists I’m assuming?

pmoriarty · on July 14, 2023

There aren't many of them left.

411111111111111 · on July 14, 2023

You state this as a fact, but it's actually much less certain wherever it's ever been net-positive.

It was probably intended that way, but the reality is that the power has been with the publisher since the beginning, and they've absolutly been screwing over the author's as well. Only the most successful author's have gotten decent deals.

I don't have an answer to this either though, i just wanted to point out that copyright has arguably never been successful at getting money to the content creators proportional to the value the Publisher extracted from the work either.

AuryGlenz · on July 14, 2023

The only way you’d know is to A/B test with a country with no copyright, and see how their authors get by.

My guess is extremely poorly. Again, the biggest might be fine. Instead of publishers paying fairly little to authors they could just literally take the best books and print them, taking all of the profits…not to mention ebooks.

I’m not an author so I can’t speak to how much publishers make, but I’d assume that if one was way better than the others in how much they’re distribute to authors all of the best authors would jump ship. Markets have a way of working things out.

A lot of people want to be authors, and any time that happens - game dev, teachers, musicians, etc. - you’re going to take on a bit of extra hardship compared to other jobs.

cesarb · on July 14, 2023

> The only way you’d know is to A/B test with a country with no copyright, and see how their authors get by.

According to https://www.spiegel.de/international/zeitgeist/no-copyright-... we already had that A/B experiment.

411111111111111 · on July 14, 2023

I'm not saying that it would be better for authors without copyright. That would indeed be hard to ascertain without a/b testing.

My point was that it doesn't improve their lives, and that's much easier to check in isolation just by reading the news about the current writers strike and how the industry just ignores it until fall, expecting their savings to run out.

Really, copyright just doesn't give the content creators any meaningful power as this right is generally owned by the industry/publisher, not the authors.

IshKebab · on July 14, 2023

The production of knowledge (I assume you're mainly talking about scientific research here) is absolutely not funded by copyright royalties or anything like that.

Journals get their content for free. Actually often they charge the authors for it.

Research is mainly funded by governments and taxes.

mensetmanusman · on July 15, 2023

Industrial R&D is actually almost 3x larger than government funded R&D.

https://www.brookings.edu/articles/rd-for-the-public-good-wa....

IshKebab · on July 15, 2023

Yeah fair, but still 0% funded by copyright.

Industrial R&D also tends to me more "research for hire" rather than pure research. A bit closer to consulting.

Anyway my point still stands.

jrm4 · on July 14, 2023

But again, "funding" is merely common and/or one step in the process. It's not always necessary and is definitely never sufficient, and I think when you bring it up, the mental model that people have is of the incorrect scale?

Put differently, we consider -- but don't think a whole lot about -- about Wikipedia's "funding," because that's NOT the most important part/innovation of that model.

We should better answer what is?

Vt71fcAqt7 · on July 14, 2023

>The production of knowledge needs to be funded as it isn’t “free”. Copyright and licensing is one model that has worked for a long time.

Can you give some examples of new knowledge that was copyrighted? Generally copyright is used to protect art, software and textbooks. People who produce new knowledge generally are not paid by copyright. The knowledge is either kept secret or published in a journal from which the author recieves no compensation.

dogma1138 · on July 14, 2023

Training and copyright is going to be interesting, people can be trained on “illegally obtained” books too yet you’ll probably going to be hard pressed to make an argument that any employee who downloaded a book or a paper from “libre library” could be used as fruit of the poisonous tree argument down the line.

l33t233372 · on July 14, 2023

If the company supplied the employee with the “illegally obtained” books, that could be reason to view the situation differently than an employee acting on their own.

Since the company is obtaining + providing these models with 100% of their input data, it could be argued they have some responsibility to verify the legality of their procurement of the data.

stainablesteel · on July 14, 2023

its not deemed illegal yet

its in a weird place imo, with japan ruling that anything goes for AI data, other countries are put under pressure to allow the same

ie,

you're allowed to scrape the web

you're allowed to take what you scrape and put it in a database

you're allowed to use your database to inform on decisions you might make, or content you might create

but once you put AI model in the mix, all of a sudden there's problems, despite the fact that making the model is 10000% harder than doing all of the points mentioned above, the problem of using someone else's work somehow becomes a problem when it never was before

and if truly free and open source LLMs come into the game, then might the corporate ones become crippled from copyright? that's bad for business

brucethemoose2 · on July 14, 2023

> It's not like Meta can remove these books from the training set without retraining from scratch (or at least the last checkpoint before they were used).

They probably can:

https://github.com/zjunlp/EasyEdit

> I wonder if this is going to cause issues down the road.

There are some popular Stable Diffusion models, being run in small businesses, that I am certain have CSAM in them because they have a particular 4chan model in their merging lineage.

... And yet, it hasn't blown up yet? I have no explanation, but running "illegal" weights seems more sustainable than I would expect.

Der_Einzige · on July 14, 2023

I’ve been wondering when the landmark moral panic would start against Civit.AI and the coomer crowd. People have no idea just how much porn is being produced by this stuff. One of the top textual inversions right now is a… age slider… (https://civitai.com/models/65214/age-slider) ewww. It’s also extremely well rated and reviewed on there. I’m terrified at the impending backlash because depending on what happens the party going on in AI could end

brucethemoose2 · on July 14, 2023

People have been saying this about underage hand drawn hentai forever, but its still around.

Not that I am disagreeing with you. What I find particularly disturbing are the paid services for this.

Also, I have seen 2 seperate OnlyFans pimps ask for help in a text generation chatroom. Something about automating "private" texting from their "girls."

Der_Einzige · on July 14, 2023

It’s trivial to use these methods to produce real looking images, or even stuff in the likeness of real people…

AuryGlenz · on July 14, 2023

Yeah. I did a fine tuned model of my daughter and niece and I definitely have to put in “sexy, naked,” and the like in the negative prompt when using them.

I don’t think society is going to have a hissyfit until some app comes along that makes it super easy for people to train good models locally on people and then generate whatever they want. That day’s coming really soon though.

brucethemoose2 · on July 14, 2023

There are tons of web services for this. They are just obscure and distributed enough to avoid public ire.

The pieces to do local LORA training are all there, but honestly the tyranny of CUDA is the biggest blocker for the average person.

AuryGlenz · on July 15, 2023

Sure, but it's still not super user friendly. You upload photos, get a 2 GB checkpoint file that you run on some obscure, sometimes hard to install programs.

I know there was a phone app that did a limited thing where they gave you profile images and they made bank. I'm a little surprised nobody has tried going whole hog, if the app stores would even allow it.

whimsicalism · on July 14, 2023

That is not at all the same thing as removing the books.

twayt · on July 14, 2023

> They probably can:

No, actually they probably can’t. There is no verifiable way to remove the data from the model apart from completely removing all instances of information from the training data. The project you linked only describes a selective finetuning approach.

xnx · on July 14, 2023

It's an area of active research: https://ai.googleblog.com/2023/06/announcing-first-machine-u...

twayt · on July 14, 2023

Until you get models with completely disentangled feature spaces such that you know that the influence of a piece of data is completely removed (at the limit this is something like an embedding DB), there is absolutely no way you can claim you’ve removed the data from the model.

At most, these efforts will amount to data laundering where it will be impossible to prove that a piece of data was used to train the model, not provide conclusive proof that it was removed.

NBJack · on July 14, 2023

Which means we are probably at least 5-10 years away from verifiable action that a court of law will recognize.

nomel · on July 14, 2023

This assumes it's possible. I naively assume it's not, in a way that doesn't harm the model, beyond the content of the book.

brucethemoose2 · on July 14, 2023

They can probably prevent LLaMA from spitting out verbatim quotes from the books well enough to make proof difficult.

... But yeah, fundamentally the only way to throw out the books is to throw out the weights.

potsandpans · on July 14, 2023

that is quite the spicy claim

wongarsu · on July 14, 2023

If we accept the argument that you can train a ML model on data scraped from the internet because the model is sufficiently transformative and thus isn't impacted by the copyright of that data, then how does that change simply because somebody else distributed the data illegally? Either the ML model breaks the copyright chain or it doesn't. Or is the argument that using data that was provided to you in violation of copyright is illegal in general?

_Algernon_ · on July 14, 2023

How is it different than training from random blogs, or stack overflow or in general "The Internet"?

schleck8 · on July 14, 2023

Really, really bad look for Eleuther if this is true. I did not expect them do something like this and not even see the issue with it.

Ancalagon · on July 14, 2023

Move fast and break the law.

andybak · on July 14, 2023

It's far from certain at this stage whether this does break the law.

haswell · on July 14, 2023

While this may be true, the reverse is also true, and even if it’s legal, there are other ways to frame this that are worth considering, e.g. It could technically be legal, but not in accordance with the spirit of the law. Updates to laws are required. The fact that the model is legal is an additional problem on top of the gap in the law.

I think my main point here is that “legal” does not imply moral or acceptable to society, and our understanding of the technical legal status is not a prerequisite for exploring those factors, which may be the thing that changes the legal status in response to the major shift in landscape.

Spivak · on July 14, 2023

Right but if you have a plausible case you weren't breaking the law and it was a legal unknown the most that will happen is "we've decided this is officially illegal, stop doing it."

You risk nothing by assuming things are legal until explicitly illegal.

haswell · on July 14, 2023

If you limit the framing of the conversation to that of an amoral corporate entity, sure. But I don’t think there was ever a question that companies can legally do things that are potentially (or unequivocally) distasteful if not outright unethical/immoral.

More interesting is the broader conversation which involves society’s response to a major shift in the information economy, new questions about what role these tools should play, and how laws should evolve accordingly.

The factors surrounding the emergence/unfolding of AI tooling can’t be stripped down to just the corporate interests involved.

innagadadavida · on July 14, 2023

Copyright laws should be amended to allow this scenario. If I read a book and write about it in a blog, it is considered review. Why shouldn’t we allow companies to do the same to train their models? Overall it will benefit society more than it hurts some rich authors.

haswell · on July 14, 2023

I think it’s a mistake/fallacy to equate the human acquisition of knowledge and resulting synthesis of value with that of large-scale computers ingesting the sum total of written human knowledge and the outcomes that enables.

They are not similar, and I suspect that if they were (i.e. humans could absorb that much information), the information landscape and the market models for exchanging value would look nothing like they do today, and AI wouldn’t be rocking the boat, it’d just be another adherent to the resulting laws.

ethbr0 · on July 14, 2023

That's one thing I'm consistently surprised HN fails to draw a distinction on: copyright regimes are fundamentally about copy rate.

You can't take a regime that works decently with human-rate copying and convert it to computer-rate copying, because fundamentally the give-and-take of rights to each side is balanced against feasible limits of reproduction.

Or, to put it another way, if you can copy/synthesize at most 1 book a day, I can extend you a lot more implicit rights... than I can afford to someone who can copy/synthesize every book ever in a day.

doctoboggan · on July 14, 2023

I think the difference is you presumably obtained that book legally before writing the review. In this case the book was pirated (the definitely illegal part), and then used for training (the possibly illegal part, but I suspect this would be deemed fair use).

IMO google and their massive google books DB would have a better leg to stand on here if they trained on that dataset as they owned physical copies of all the books.

Spivak · on July 14, 2023

I don't think it matters. Your review isn't copyright infringement because you pirated the movie.

simion314 · on July 14, 2023

>Copyright laws should be amended to allow this scenario. If I read a book and write about it in a blog, it is considered review.

The problem with current AI is that they memorize stuff, there is the case with the AI memorizing an algorithm perfectly, or reciting quotes from Dune and then getting censored.

Now you as a paying user of this AI tools are not making reviews but probably using them for commercial purposes and it would not be fiar if your proprietary code would use code copy pasted from GPL code.

If this AI would be so clever then IMO you could have them laarn say Python exactly like a human, a few books and some exercises on python, some books on algorithms, some books on html or whatever tech. But today they train with the full github and you get a mix of stuff. My suggestion would also improve the sorry state of JS in ChatGPT where it uses super old syntax and still uses outdated pattern like it is coding for IE6. My guess this is because it is train with old or bad code and this mean a=most of the code from now one will be old syntax and bad

TX81Z · on July 14, 2023

“Rich authors”.

Citation needed.

innagadadavida · on July 14, 2023

I meant the authors that are suing - if you have the money to sue, you can be considered rich? no?

TX81Z · on July 14, 2023

Going to go with “no, you don’t need to be rich to sue”. Likewise to be included in a class action you don’t have to pay anything, or even participate any way, you just get a cut of the settlement.

rmilejczz · on July 14, 2023

Couldn’t they just buy the ebook and call it a day? The rich people are the people training LLMs not the authors lol

sebzim4500 · on July 14, 2023

I doubt it makes a difference whether they purchase the ebook or not. And probably a bunch of them aren't even available as ebooks legitimately, people scan books and upload them to zlibrary etc.

TX81Z · on July 14, 2023

It worked for Uber!

amelius · on July 14, 2023

It certainly worked for their founders.

CamperBob2 · on July 14, 2023

And their customers. Show of hands: who wants to go back to taxis?

Der_Einzige · on July 14, 2023

Most large datasets are full of copyrighted content. They aren’t unique.

cameldrv · on July 14, 2023

It seems difficult to argue that Meta can copy every ebook in existence to train a model, but then other people cannot copy the resulting model.

zargon · on July 14, 2023

It's not open source, it's freeware or something like that. Weights aren't the source code of LLMs, they're the binaries.

spmurrayzzz · on July 14, 2023

Maybe this is just semantics, but I don't know if the OSS-vs-freemium distinction matters all that much (I'd have to think about the potential downsides a bit more tbh).

Virtually every discussion in the LLM space right now is almost immediately bifurcated by the "can I use this commercially?" question which has a somewhat chilling effect on innovation. The best performing open source LLMs we have today are llama-based, particularly the WizardLM variants, so giving them more actual industry exposure will hopefully be a force multiplier.

zargon · on July 14, 2023

Llama isn't open source either. But if I understand your point correctly, you're saying that the commercial use axis is what is important to people, and it's orthogonal to freeware vs open source. In the present environment, I agree. But I don't think we should let companies get away with poisoning the term open source for things which are not. I also believe that actual open source models have the near-term opportunity to make an impact and shape the future landscape, with red pajamas and others in the works. The distinction could be important in the near term, at the rate this field is developing at.

Vetch · on July 14, 2023

Neural network weights are better viewed as source code because they specify what function the network computes. As we're operating purely on feed-forward networks, there are no loops. Therefore, weights fully describe everything relevant for executing their represented function on inputs. Weights can be seen as a sort of intermediate language (with lots of stored data and partially computed states) interpretable by some deep learning library.

The network architecture itself is not source code, but a rough specification constraining the optimizer, which searches for possible program descriptions that within the specified constraints, minimize some loss function with respect to the data.

Neither data nor network architecture are the actual source, they are better seen as recipes which if followed (will at great expense), allow finding behaviorally similar programs. As you can see, the standard ideas of open source don’t quite carry over because the actual "source-code" is not human interpretable.

spmurrayzzz · on July 14, 2023

> Weights can be seen as a sort of intermediate language (with lots of stored data and partially computed states) interpretable by some deep learning library.

I've often talked about weights being the equivalent to assembly, your note seems to map to a similar intuition. And in that sense provided we ever solve the interpretability problem, we could in theory disassemble the weights to achieve similar outcomes as we do in asm-to-C. Interesting thought experiment insofar as, if the weights ought not be classified as open source (notwithstanding your first point which I agree with), can the disassembled output be classified as open source?

spmurrayzzz · on July 14, 2023

> But I don't think we should let companies get away with poisoning the term open source for things which are not.

Thats totally fair. And you're correct in that I was making an argument for positive outcomes being orthogonal to the semantics distinction.

> I also believe that actual open source models have the near-term opportunity to make an impact and shape the future landscape, with red pajamas and others in the works. The distinction could be very important in the near term, at the rate this field is developing at.

I think Falcon and MPT support your point as well, but those are still models that were trained on very small budgets relative to llama or gpt-3/4. There's a clear quality delta, albeit that gap is closing. Through that lens, I think having a large, well-funded org doing the pre-training work for the OSS community and releasing the weights permissively is a net positive.

axus · on July 14, 2023

Sen. Marsha Blackburn said “fair use” protections have become a “fairly useful way to steal” intellectual property. Some people would like to use this situation to get rid of "fair use".

slimebot80 · on July 16, 2023

Forgive my ignorance, but might it matter if a country was hoping to limit another countries advancement into weaponising AI?

whimsicalism · on July 14, 2023

Strong disagree - I think OSS is fine framing of this. Weights are a third category, you can 'fork' them in an a way that you can't with standard binaries.

l33t233372 · on July 14, 2023

You can add hooks to functions and “fork” binaries, which is a pretty similar effort to adding training data to given model weights.

IshKebab · on July 14, 2023

Nobody does that because if you only have binaries you probably don't have permission to do that. Plus it's impractical to make any significant changes that way.

l33t233372 · on July 15, 2023

If you have binaries you almost always have “permission” to do that — you can do whatever you’d like with files on your own system.

williamstein · on July 14, 2023

Maybe there is no source code? I imagine an LLM is like output of the following process. There's a huge room full of programmers that can directly edit machine code. You give them a random binary, which they then hack on for a while and publish the result. You then inspect it and tell them it isn't quite optimal in some way and ask them for a new version. Iterate on this process a bazillion times. At the end you get a binary that you're reasonably happy with. Nobody ever has the source code.

powersnail · on July 14, 2023

Source code is the preferred form for development.

In your scenario, despite the unrealistic coding process, the machine code is the source code, because that's what everyone is working on.

In the development of LLM, the weights is in no way the preferred form of development. Programmers don't work on weights. They work on data, infrastructure, the model, the code for training, etc. The point of machine learning is not to work on weights.

Unless you anthropomorphize optimizers, in which case the weights are indeed the preferred form of editing, but I had never seen anyone---even the most forward AGI supportors---argue that optimziers are intelligent agents.

whimsicalism · on July 14, 2023

What? You work on the weights - you just do it using tools like the optimizers, etc.

You release your weights, others can build on top of that, fine tune it in different ways, produce new weights they can share with others. Seems very OSS-y.

I feel like there is some semantic nitpicky point being made here that is completely going over my head.

powersnail · on July 14, 2023

By "work on", I mean "making direct edits". If we take broad definition of "work on", we lose all the distinction between source code and output. Any binary code is source code in any project, because the programmers simply is using tools to work on them, like the compiler.

For all practical purposes, if you are part of the team who released the LLMs, you would be writing and modifying the code of data processing, of the model, and of the training process. Those should be considered source code.

And we do have the model, which is pretty Oss-y, and which is why we can fine-tune the weights. But from a broader perspective, it's not fully Oss-y, because we don't have the code for anything else. There's no way to change, for example, how the training is done in the first place.

tomrod · on July 14, 2023

Agreed. Unfortunately it's those semantics that keep from losing lawsuits.

ssd532 · on July 14, 2023

I read it in all such discussions. What does it mean? I just have a very high level understanding of AI models. No idea how things work under the hood or what knobs can be tweaked.

doctoboggan · on July 14, 2023

The source code is all the supporting code needed to run inference on the weights. This is usually python and in the case of llama it's already open source. Usually the source code is referred to as the "model". You can kind of think of the weights as a settings file in a normal desktop application. The desktop app has its own source code and loads in the settings file at runtime. It can load different settings files for different behaviors.

briancleland · on July 14, 2023

This is almost completely wrong. When peope who work in AI refer to the "model", they are generally referring to the weights. It is the weights which are the most important determinant of how the model performs, and it is the weights that require the most resources to develop. Associated code and other assets are also important, but they not the core asset. The intuitive sense of open sourcing a model therefore typically means releasing the weights under an open licence (ideally along with the training and inference code, data, training info, etc).

doctoboggan · on July 14, 2023

I am not making a value judgement on what's the "most important" aspect when comparing the code vs the weights. I am just explaining the terminology as I understand it. Your intuitive sense of open sourcing certainly makes sense to me. I think a lay person would expect to be able to generate content with an "open source ai model" and that wouldn't be possible if only the code was open sourced and not the weights.

If you can show me people who work in AI calling just the weights a "model" then I would happily update my internal definition of the word. I am certainly not an expert in the subject, I am just going off what I've read from the community over the past few years.

zargon · on July 14, 2023

Open source is about freedom to modify the product. So in the context of an LLM, the source code is the data and the code that processes the data during *training* (not only inference), as that is what generates the weights.

ssd532 · on July 14, 2023

I thought model is the output of training. It's a binary file black box. That's what I had read somewhere.

doctoboggan · on July 14, 2023

I think it's a little context dependent, and the definition seems to be fluid right now. I've seen "model" be used to refer to just the code, or to refer to the combination of the code and weights. I don't think I've seen it used to refer to just the weights, but I wouldn't be surprised if its used that way in some contexts.

hoofedear · on July 14, 2023

Thank you for succinctly explaining the difference, I learned something today

StackOverlord · on July 14, 2023

Compiling source code doesn't cost million of dollars though

ynniv · on July 14, 2023

That doesn't change the meaning of Open Source. These are "free as in beer", not "free as in [modify the sources and rebuild it]". There are LLMs for which that is true, which include a specific list of training data. If you wanted to "uncensor" one of those, you could curate the source data and rebuild it, instead of trying to get it to unlearn what it was taught.

pphysch · on July 14, 2023

If you had petabytes of highly interconnected source code, it could.

In a rough way, a NN is just a compiler designed to translate a boatload of simple data into a useful program that operates on similar data.

chejazi · on July 14, 2023

Yea the weights are the secret sauce that OpenAI and competitors generally protect.

greatpostman · on July 14, 2023

Meta is going to ruin open ais moat on purpose. Great business strategy and good for everyone but metas competitors

jonnat · on July 14, 2023

Quite the opposite, this is great for Meta's competitors. Meta is not trying to get market share with this strategy, it's trying to commoditize their complements (https://www.joelonsoftware.com/2002/06/12/strategy-letter-v/)

Content is a complement to a social network: the cheaper it is to create content, the more content is available, the easier it is to optimize a feed, the larger the time people spend in the platform, the higher the revenue. GenAI is just a method to drive the cost of content creation to zero.

strikelaserclaw · on July 14, 2023

kind of a dystopian nightmare world in which large corporations utilize AI to create low cost, infinite content that humans engage with (mostly content catering to the human tendency for tribalism, prestige, sexual desires etc...), sounds like we are creating a world similar to the Matrix.

roody15 · on July 14, 2023

I think we may have already entered it. Infinite scroll based feeds like TikTok, Instagram, and Threads (and possible Reddit these days) … just AI algorithm deciding what you should find “entertaining” or “important”.

It’s really the ultimate nightmare with the internet becoming just TV 3.0 in which content is controlled and curated … you just consume mindlessly.

Any attempts to create a Reddit clone.. or system in which people freely communicate is now “regulated” for “hate” speech or “terrorism”. The days of open discourse … appear to be numbered. Even email will be analyzed by AI to look for “trends” or “optimize” employee efficiency.

It really is time for a new internet.

l33t233372 · on July 14, 2023

I agree with the sentiment, but I disagree with this:

> Any attempts to create a Reddit clone.. or system in which people freely communicate is now “regulated” for “hate” speech or “terrorism”.

The fact is that many “systems in which people freely communicate” are regularly first adopted by people who participate in hate speech and terrorism.

nomel · on July 14, 2023

https://en.wikipedia.org/wiki/Infinite_Jest

> ... Infinite Jest, also called "the Entertainment" or "the samizdat". The film is so compelling that its viewers lose all interest in anything other than repeatedly viewing it, and thus eventually die.

_Algernon_ · on July 14, 2023

Infinite flame wars. Can't wait!

freedomben · on July 14, 2023

I wish I could upvote this a dozen times. This is a very insightful comment. Read the link above first if you aren't sure what "commoditize their complements" means.

herodoturtle · on July 14, 2023

Reminds me of Joel Spolsky’s essay on “Commoditize your complement”:

https://www.joelonsoftware.com/2002/06/12/strategy-letter-v/

ekojs · on July 14, 2023

Seems that the source is a FT article that was discussed yesterday: https://news.ycombinator.com/item?id=36712168

From the FT article: '“The goal is to diminish the current dominance of OpenAI,” said one person with knowledge of high-level strategy at Meta.'

forgingahead · on July 14, 2023

Zuck is a total killer. What better way to fight Google and Microsoft than to effectively spawn thousands of potential competitors to their AI businesses with this (and other) releases. There will be a mad scramble over the released weights to develop new tech, these startups will raise tons of money, and then fight the larger incumbents.

This is not charity, this is a shrewd business move.

amelius · on July 14, 2023

"Commoditize your complement"

whimsicalism · on July 14, 2023

If you read past the title, this article is not at all clear if they are referring to a commercial offering (ie. license our model for $$) or an open-source license with commercial usage (Apache, etc.)

My guess is still the latter because that's what I've heard the rumors about, but this article is pretty unclear on this fact.

brucethemoose2 · on July 14, 2023

Falcon 40B was released as a "free with royalties above a certain amount of profit" license, and got roasted for it. It was so bad that they changed the license to Apache.

I don't think any business would run such a "licensed" model over MPT 30B or Falcon 40B, unless its way better than LLaMA 65b.

whimsicalism · on July 14, 2023

I think it is supposed to be better than LLaMA 65B. Plenty of businesses are paying for OAI API access.

pmarreck · on July 14, 2023

I have a 128 core Threadripper, a 2080 Ti and a 3080 Ti.

How can I play with open source LLM's locally?

brucethemoose2 · on July 14, 2023

Kobold.cpp is your best bet.

You can leverage those big CPUs while still loading both GPUs with a 65B model.

... If you are feeling extra nice, you should set that up as an AI horde worker whenever you run koboldcpp to play with models. It will run API requests for others in the background whenever its not crunching your own requests, in return allowing you priority access to models other hosts are running: https://aihorde.net/

pmarreck · on July 14, 2023

oooh, this is a great idea

brucethemoose2 · on July 14, 2023

Also, I would suggest this model as one to play with:

https://huggingface.co/ycros/airoboros-65b-gpt4-1.4.1-PI-819...

Check the prompting syntax here, it has a huge effect on the output:

https://huggingface.co/jondurbin/airoboros-65b-gpt4-1.4

estreeper · on July 14, 2023

If you're just looking to play with something locally for the first time, this is the simplest project I've found and has a simple web UI: https://github.com/cocktailpeanut/dalai

It works for 7B/13B/30B/65B LLaMA and Alpaca (fine-tuned LLaMA which definitely works better). The smaller models at least should run on pretty much any computer.

brucethemoose2 · on July 14, 2023

That project seems unmaintained, which is a problem because llama.cpp is changing extremely rapidly.

Also, it has no "1 click" exe release like kobold.

freedomben · on July 14, 2023

May I ask why you have such an amazing machine, and two nice graphics cards? Feel free to tell me it's none of my business, it's just very interesting to me :-)

pmarreck · on July 14, 2023

Career dev who had the cash and wanted to experiment with anything that can be done concurrently, such as in my language of choice lately, which features high concurrency (https://elixir-lang.org/) or these LLM's, or anything else that can be done in massively parallel fashion (which is, perhaps surprisingly, only a minority of possible computer work, but it still means I can run many apps without much slowdown!)

I originally had 2 2080ti's to experiment also with virtio/proxmox (you need 1 for the host and 1 for any VM you run). I never got that running successfully at the time, but then Proton got really good (I mainly just wanted to run windows games fast in a VM, but that circumvented that). Later on I upgraded one of them to a 3080ti.

It's a System76 machine, they make good stuff

nickthegreek · on July 14, 2023

Check out r/LocalLlama for a bunch of resources.

loufe · on July 14, 2023

I'm surprised nobody here has brought up the sensorship in this model. Listening to Mark Zuckerberg on Lex Friedman's podcast talk about it, it sounds like the model will be significantly blunted vs its "research" version release.

stale2002 · on July 14, 2023

I remember arguing with people who honest to god thought that LLAMA was some sort of secret ploy, to trick startups into using it, so that meta could sue them for using it commercially.

Well now there is a commerical release. I guess it wasn't some corporate plot after all!

Some people just can't admit when a corporation does a good thing.

(In this case, the good thing is being done to obsolete their competitors, but it is good none the less, that a commerical LLM is available for people to use for free)

obblekk · on July 14, 2023

Maybe they've solved the fingerprinting problem and can identify text generated from their model, and this is a way of discovering the market they can sell more advanced models to directly. B2B leadgen...

vlovich123 · on July 14, 2023

I don't think so because I believe you can train AI models against other AI models. I believe you can fingerprint a family of models, but that's not going to tell whether you just used the general approach outlined in the academic papers.

sva_ · on July 14, 2023

I mean you could probably just train it on some sequence s.t. the model identifies itself, would be hard to detect that

sebzim4500 · on July 14, 2023

That would prbably work to detect if e.g. OpenAI or Anthropic start using their weights directly. It wouldn't detect whether e.g. a blog was generated with their model or not.

0cf8612b2e1e · on July 14, 2023

From my quick skim I could not find a date. Any idea when this might happen?

rvz · on July 14, 2023

See. They don't care about the LLaMA model leak. It turns out that it was OpenAI that cares because it ruins their moat. It costs Meta nothing to release a better open-source or freely available version of LLaMA again.

Still waiting for the 'Meta is dying' and 'Fire Mark Zuckerberg' calls from last year. A year later, where are they now?

sebzim4500 · on July 14, 2023

To be fair to those commenters in the past, I don't think anyone could have forseen that Zuckerberg would turn out to be the "the good guy".

TheBengaluruGuy · on July 15, 2023

This conversation triggers a thought.

Does it mean that any blogs that I wrote from my own insights, will automatically be trained on the model… without my permission?

As an author, it feels like it’s stealing the knowledge and insight without appropriate attribution.

anaganisk · on July 15, 2023

I think we are at a looking where we just have to let go unless we are Disney, with an army of lawyers. May be it's time for the change in thinking. Having said that. Attribution allows a person to trace the source, it's not a success marker anymore. Probably, if enough negative statements generated by AI get popular, that could potentially piss of countries/people for example some LLM recognizing Taiwan as independent country you can bet China will push for attribution to sources. We have bills pending in multiple countries that want access to personal of encrypted messages to trace the source.

Jeff_Brown · on July 14, 2023

What's the monetization model here? Is this a closed-source version of their open-source model? (That's suggested by the phrase in the article, "a commercial version of LLaMA, its open-source large language model".)

TechnicolorByte · on July 14, 2023

Like others said it’s probably to commoditize their competition. The models don’t matter so much as ownership of the platform and critical data. Which is why OpenAI is in a tricky position (although I guess they’re partnered with Microsoft).

It seems like the existing large platforms of today—Microsoft’s enterprise moat, Google’s ads and internet services, Meta’s social networks, Apple’s consumer and mobile products—will remain the primary platforms of the future. So having models that can operate exclusively on those platforms via integration to their key products and date will only continue this trend. If you’re an outsider with an AI model, you’ll have a harder time getting access to critical data and your standalone AI product (e.g., ChatGPT) won’t be as useful.

More broadly speaking, I believe the days where the top X largest companies in the stock company would be displaced by newer companies every decade or so is over. The FAANGs just control so many major platforms in so many aspects of our lives.

isaacremuant · on July 14, 2023

> More broadly speaking, I believe the days where the top X largest companies in the stock company would be displaced by newer companies every decade or so is over. The FAANGs just control so many major platforms in so many aspects of our lives.

It also helps that they buy or otherwise cooperate to destroy their competition in questionable ways while heavily lobbying the gov to favor them over others in a quid-pro-quo that benefits politicians and not their constituents.

sangnoir · on July 14, 2023

> More broadly speaking, I believe the days where the top X largest companies in the stock company would be displaced by newer companies every decade or so is over.

I disagree: I think big tech is hard to disrupt ATM because the companies are still young and nimble. In the last cycle, the companies being displaced were ancient (by tech standards). When Google and Facebook are 30 years old, their DNA will get in the way of adopting to a new paradigm that will change the world. A paradigm that may be to the Metaverse what the smartphone was to the Apple Newton

dundun · on July 14, 2023

Google opensourced Tensorflow because they believed it would help with the hiring process: if researchers could use the same framework to do their PhDs as Google used in their production systems, that was seen as an advantage.

Maybe that's Meta's play here? Maybe the idea is that the ecosystem around a model could be as valuable or more valuable than the model itself too, so an OSS model could benefit Meta a lot more by gaining more of the ecosystem mind share?

Or Maybe Yann LeCun is just a hippie that dreams of free love, hard drugs and open-source models?

andy99 · on July 14, 2023

LLaMA is already in it's way to becoming an industry standard (in my opinion, look at llama.cpp plus everything build on LLaMA). There are benefits to being able to set direction like that. Same as pytorch for example, it's not just about direct revenue it's about everyone building on and contributing to your platform.

They might have done well to make gg an offer he couldn't refuse and take on ggml and llama.cpp as an open source project.

valine · on July 14, 2023

LLaMA isn’t licensed for commercial use. It’s probably an update to the licensing.

Facebook benefits heavily from the open source development done on LLaMA. There was a report I saw that facebook has started using llama.cpp internally for inference. Updates to the licensing will cement facebook as the go to choice for open source language models.

jerrygenser · on July 14, 2023

Based on the podcast with Lex Friedman and Mark Zuckerberg, see ~minute 30.

My hypothesis based on the context of Mark discussing the release is that it's going to be completely open source and can licensed to be used commercially. Not that Meta is going to add a whole new revenue side of business to compete with OpenAI. i.e. "Here is model, with commercially permissive licensing" not "Here is model that you can use commercially but must pay me"

https://www.youtube.com/watch?v=Ff4fRgnuFgQ&ab_channel=LexFr...

hospitalJail · on July 14, 2023

Another hypothesis is that they are trying to rehab their brand.

They can even write it as 'good will' on their financial statements.

It kind of is working.

justapassenger · on July 14, 2023

Meta has been one of the major open source contributors for about a decade now. They open source/contribute to a lot of tech, as their business isn’t about tech, but products.

smoldesu · on July 14, 2023

This isn't some recent revelation or anything. Facebook's AI team (FAIR) open sourced their major technology in 2017 with Pytorch. In 2018 they published Pytext in an age when most people didn't know what a Large Language Model even meant. Seeing LLaMA get made should not be a surprise to anyone who is familiar with the history of AI research. It's like hearing people call CUDA an "unfair advantage" while ignoring billions of Nvidia R&D dollars getting spent in the AI sector over the course of a decade.

It might feel like "brand rehab" or "good will" as a consumer, but a lot of this work was put in motion a while ago.

dpflan · on July 14, 2023

I don’t know, maybe they don’t need to monetize their model? I don’t know if they have to, they need their models to be the best and to support their core business of ads, anything that keeps users on their platform for any reason is their goal. They need their models to be an industry standard and one upon which other things are built.

pueblito · on July 14, 2023

I think the strategy is more to prevent competitors from monetizing

fullshark · on July 14, 2023

That's a huge reason to do it also, but it also makes sense if you have researchers + developers improving the engine of something that powers your product. The moat / competitive advantage at FB is their network, not so much the proprietary underlying tech.

mtillman · on July 14, 2023

People often say this but having interviewed ~200 facebook engineers over the years, their scaling tech around both software and hardware is pretty impressive.

fullshark · on July 14, 2023

Yeah I guess it's a competitive advantage when a competitor (Twitter) is showing to have technical problems operating at global scale with a smaller team. Their scale is not trivial by any means. But people aren't going to go to FB because they have the best LLM, makes sense to offload that development to the open source community.

treprinum · on July 14, 2023

You still need to build real-time serving infrastructure on top of LLaMA/Vicuna/Alpaca in order to compete with ChatGPT/OpenAI so it's not going to be done by that many companies and OpenAI already has a mindshare/first mover advantage.

staticman2 · on July 14, 2023

When you use ChatGPT you are leasing their GPU infrastructure and their proprietary model, this opens the possibility of leasing GPU infrastructure from another company and using an open source model. You don't necessarily need to do the hard parts yourself, you can hire it out to competing companies.

treprinum · on July 14, 2023

Sure, but it's extra work slowing you down as your competitor is surfing the wave at full speed. Moreover, you are relying on an old LLM whereas OpenAI is developing newer versions of theirs, keeping their competitive advantage. Even Google who has the infra has a ridiculously bad LLM to compete.

dpflan · on July 14, 2023

Yes, commoditize the competition.

Roark66 · on July 14, 2023

Well, if they really released it as open source, I guess depending on the exact license a company that modifies(fine tunes) it and wants to make money on that modified version would have to distribute the weights and/or disclose the details about how they fine tuned it. On what data etc. By offering a commercial license , the buyer can do anything they want.

fnands · on July 14, 2023

There is an open source version available, with weights that were leaked, but licensed as "for academic purposes only".

This seems they will release the weights under some license that allows commercial usage.

How they monetise it (which I assume they will try and do?) is an interesting question.

Maybe some variant of paying a licencing fee?

CharlesW · on July 14, 2023

> What's the monetization model here?

There doesn't necessarily have to be one. Facebook's goal may be to help commoditize its complements. https://gwern.net/complement

zpeti · on July 14, 2023

Free AI models mean more free content, which is exactly what drives facebooks moat.

thatguymike · on July 14, 2023

Commoditize Your Complement, it’s a strategic play since Meta is behind on LLMs.

elorant · on July 14, 2023

You could pay to customize it and/or retrain it for your use case. Or you pay a subscription and every few months you receive updated weights.

RobotToaster · on July 14, 2023

Maybe they just want the de facto standard LLM to be one that only says nice things about facebook and Zuckerberg?

discmonkey · on July 14, 2023

Meta is a company that makes money off of users endlessly browsing content. It would follow that making it easier/faster to generate content would benefit Meta.

sagebird · on July 14, 2023

repeat after me:

hardware is the only moat

If you want to live the good life before you are exquisitely extinguished, spend every other day figuring out how to buy more NVDA, the other days exercising outside, being human.

sifar · on July 14, 2023

until better algorithms or newer paradigm obviate the need for large memory/computations

bilsbie · on July 14, 2023

Is it possible to do further training on the weights they release?

brucethemoose2 · on July 14, 2023

Yes, and there are a sea of finetunes. See: https://huggingface.co/models?sort=modified&search=Ggml

QLORA is the most cost effective method so far. Some people also do finetuning on Google TPUs

40yearoldman · on July 14, 2023

Is the title an oxymoron?

Open-source commercial?

RobotToaster · on July 14, 2023

If anything it's a tautology, open source by definition allows commercial use.

satvikpendem · on July 14, 2023

No, you can sell open source software commercially. That being said, I'm wondering if the license will truly be open source or more like Stable Diffusion's license which is not really open source.

schleck8 · on July 14, 2023

Because deep learning weights aren't source code.

https://huggingface.co/blog/open_rail

isaacremuant · on July 14, 2023

I think you could've googled that one and founds years of knowledge on that one.

Free as in beer Vs free as in speech and the whole thing.

gpm · on July 14, 2023

Commercial presumably as opposed to non-commercial licensing (e.g. the CC BY-NC license, or the weird situation LLaMa is in).

If you listen to the definition the Open Source Initiative would have applied to the term open source had they succeeded in acquiring rights to the term, then commercial is redundant with open source, not the opposite of it.