From the recent story about the Sarah Silverman lawsuit:
The complaint lays out in steps why the plaintiffs believe the datasets have illicit origins — in a Meta paper detailing LLaMA, the company points to sources for its training datasets, one of which is called ThePile, which was assembled by a company called EleutherAI. ThePile, the complaint points out, was described in an EleutherAI paper as being put together from “a copy of the contents of the Bibliotik private tracker.” Bibliotik and the other “shadow libraries” listed, says the lawsuit, are “flagrantly illegal.”
IANAL, but this basically sounds like LLaMa was trained on illegally obtained books by Meta's own admission. It's an exciting development that Meta is releasing a commercial-use version of the model, but I wonder if this is going to cause issues down the road. It's not like Meta can remove these books from the training set without retraining from scratch (or at least the last checkpoint before they were used).
Sometimes, I wonder what if someone in XYZ country downloads whole of Z-Library/Libgen, all the books ever printed, and all the papers ever published, all the newspapers and so on. and releases the model open source. There are jurisdictions with Lax rules.
And they will have much better knowledge, answers, etc than the western, Lawyer approved models.
The production of knowledge needs to be funded as it isn’t “free”. Copyright and licensing is one model that has worked for a long time. It has flaws, but it has produced good things.
At this point with the quality of current web content and the collapse of journalism as an industry I think we can say online ads have utterly failed as a replacement income stream.
Unless you want all LLM to say “I’m sorry the data I was trained on ends in 2023” you still need a content funding model. Maybe not copyright, but sure as hell not ads either.
"Copyright and licensing is one model that has worked for a long time. It has flaws, but it has produced good things."
By some definition of "worked". If we define "worked" as "made money for", who it worked mostly for are the middlemen and a minority of writers... a minority that with the advent of LLMs is likely to shrink even further.
You state this as a fact, but it's actually much less certain wherever it's ever been net-positive.
It was probably intended that way, but the reality is that the power has been with the publisher since the beginning, and they've absolutly been screwing over the author's as well. Only the most successful author's have gotten decent deals.
I don't have an answer to this either though, i just wanted to point out that copyright has arguably never been successful at getting money to the content creators proportional to the value the Publisher extracted from the work either.
The only way you’d know is to A/B test with a country with no copyright, and see how their authors get by.
My guess is extremely poorly. Again, the biggest might be fine. Instead of publishers paying fairly little to authors they could just literally take the best books and print them, taking all of the profits…not to mention ebooks.
I’m not an author so I can’t speak to how much publishers make, but I’d assume that if one was way better than the others in how much they’re distribute to authors all of the best authors would jump ship. Markets have a way of working things out.
A lot of people want to be authors, and any time that happens - game dev, teachers, musicians, etc. - you’re going to take on a bit of extra hardship compared to other jobs.
I'm not saying that it would be better for authors without copyright. That would indeed be hard to ascertain without a/b testing.
My point was that it doesn't improve their lives, and that's much easier to check in isolation just by reading the news about the current writers strike and how the industry just ignores it until fall, expecting their savings to run out.
Really, copyright just doesn't give the content creators any meaningful power as this right is generally owned by the industry/publisher, not the authors.
The production of knowledge (I assume you're mainly talking about scientific research here) is absolutely not funded by copyright royalties or anything like that.
Journals get their content for free. Actually often they charge the authors for it.
Research is mainly funded by governments and taxes.
But again, "funding" is merely common and/or one step in the process. It's not always necessary and is definitely never sufficient, and I think when you bring it up, the mental model that people have is of the incorrect scale?
Put differently, we consider -- but don't think a whole lot about -- about Wikipedia's "funding," because that's NOT the most important part/innovation of that model.
>The production of knowledge needs to be funded as it isn’t “free”. Copyright and licensing is one model that has worked for a long time.
Can you give some examples of new knowledge that was copyrighted? Generally copyright is used to protect art, software and textbooks. People who produce new knowledge generally are not paid by copyright. The knowledge is either kept secret or published in a journal from which the author recieves no compensation.
Training and copyright is going to be interesting, people can be trained on “illegally obtained” books too yet you’ll probably going to be hard pressed to make an argument that any employee who downloaded a book or a paper from “libre library” could be used as fruit of the poisonous tree argument down the line.
If the company supplied the employee with the “illegally obtained” books, that could be reason to view the situation differently than an employee acting on their own.
Since the company is obtaining + providing these models with 100% of their input data, it could be argued they have some responsibility to verify the legality of their procurement of the data.
its in a weird place imo, with japan ruling that anything goes for AI data, other countries are put under pressure to allow the same
ie,
you're allowed to scrape the web
you're allowed to take what you scrape and put it in a database
you're allowed to use your database to inform on decisions you might make, or content you might create
but once you put AI model in the mix, all of a sudden there's problems, despite the fact that making the model is 10000% harder than doing all of the points mentioned above, the problem of using someone else's work somehow becomes a problem when it never was before
and if truly free and open source LLMs come into the game, then might the corporate ones become crippled from copyright? that's bad for business
> It's not like Meta can remove these books from the training set without retraining from scratch (or at least the last checkpoint before they were used).
> I wonder if this is going to cause issues down the road.
There are some popular Stable Diffusion models, being run in small businesses, that I am certain have CSAM in them because they have a particular 4chan model in their merging lineage.
... And yet, it hasn't blown up yet? I have no explanation, but running "illegal" weights seems more sustainable than I would expect.
I’ve been wondering when the landmark moral panic would start against Civit.AI and the coomer crowd. People have no idea just how much porn is being produced by this stuff. One of the top textual inversions right now is a… age slider… (https://civitai.com/models/65214/age-slider) ewww. It’s also extremely well rated and reviewed on there. I’m terrified at the impending backlash because depending on what happens the party going on in AI could end
People have been saying this about underage hand drawn hentai forever, but its still around.
Not that I am disagreeing with you. What I find particularly disturbing are the paid services for this.
Also, I have seen 2 seperate OnlyFans pimps ask for help in a text generation chatroom. Something about automating "private" texting from their "girls."
Yeah. I did a fine tuned model of my daughter and niece and I definitely have to put in “sexy, naked,” and the like in the negative prompt when using them.
I don’t think society is going to have a hissyfit until some app comes along that makes it super easy for people to train good models locally on people and then generate whatever they want. That day’s coming really soon though.
Sure, but it's still not super user friendly. You upload photos, get a 2 GB checkpoint file that you run on some obscure, sometimes hard to install programs.
I know there was a phone app that did a limited thing where they gave you profile images and they made bank. I'm a little surprised nobody has tried going whole hog, if the app stores would even allow it.
No, actually they probably can’t. There is no verifiable way to remove the data from the model apart from completely removing all instances of information from the training data. The project you linked only describes a selective finetuning approach.
Until you get models with completely disentangled feature spaces such that you know that the influence of a piece of data is completely removed (at the limit this is something like an embedding DB), there is absolutely no way you can claim you’ve removed the data from the model.
At most, these efforts will amount to data laundering where it will be impossible to prove that a piece of data was used to train the model, not provide conclusive proof that it was removed.
If we accept the argument that you can train a ML model on data scraped from the internet because the model is sufficiently transformative and thus isn't impacted by the copyright of that data, then how does that change simply because somebody else distributed the data illegally? Either the ML model breaks the copyright chain or it doesn't. Or is the argument that using data that was provided to you in violation of copyright is illegal in general?
While this may be true, the reverse is also true, and even if it’s legal, there are other ways to frame this that are worth considering, e.g. It could technically be legal, but not in accordance with the spirit of the law. Updates to laws are required. The fact that the model is legal is an additional problem on top of the gap in the law.
I think my main point here is that “legal” does not imply moral or acceptable to society, and our understanding of the technical legal status is not a prerequisite for exploring those factors, which may be the thing that changes the legal status in response to the major shift in landscape.
Right but if you have a plausible case you weren't breaking the law and it was a legal unknown the most that will happen is "we've decided this is officially illegal, stop doing it."
You risk nothing by assuming things are legal until explicitly illegal.
If you limit the framing of the conversation to that of an amoral corporate entity, sure. But I don’t think there was ever a question that companies can legally do things that are potentially (or unequivocally) distasteful if not outright unethical/immoral.
More interesting is the broader conversation which involves society’s response to a major shift in the information economy, new questions about what role these tools should play, and how laws should evolve accordingly.
The factors surrounding the emergence/unfolding of AI tooling can’t be stripped down to just the corporate interests involved.
Copyright laws should be amended to allow this scenario. If I read a book and write about it in a blog, it is considered review. Why shouldn’t we allow companies to do the same to train their models? Overall it will benefit society more than it hurts some rich authors.
I think it’s a mistake/fallacy to equate the human acquisition of knowledge and resulting synthesis of value with that of large-scale computers ingesting the sum total of written human knowledge and the outcomes that enables.
They are not similar, and I suspect that if they were (i.e. humans could absorb that much information), the information landscape and the market models for exchanging value would look nothing like they do today, and AI wouldn’t be rocking the boat, it’d just be another adherent to the resulting laws.
That's one thing I'm consistently surprised HN fails to draw a distinction on: copyright regimes are fundamentally about copy rate.
You can't take a regime that works decently with human-rate copying and convert it to computer-rate copying, because fundamentally the give-and-take of rights to each side is balanced against feasible limits of reproduction.
Or, to put it another way, if you can copy/synthesize at most 1 book a day, I can extend you a lot more implicit rights... than I can afford to someone who can copy/synthesize every book ever in a day.
I think the difference is you presumably obtained that book legally before writing the review. In this case the book was pirated (the definitely illegal part), and then used for training (the possibly illegal part, but I suspect this would be deemed fair use).
IMO google and their massive google books DB would have a better leg to stand on here if they trained on that dataset as they owned physical copies of all the books.
>Copyright laws should be amended to allow this scenario. If I read a book and write about it in a blog, it is considered review.
The problem with current AI is that they memorize stuff, there is the case with the AI memorizing an algorithm perfectly, or reciting quotes from Dune and then getting censored.
Now you as a paying user of this AI tools are not making reviews but probably using them for commercial purposes and it would not be fiar if your proprietary code would use code copy pasted from GPL code.
If this AI would be so clever then IMO you could have them laarn say Python exactly like a human, a few books and some exercises on python, some books on algorithms, some books on html or whatever tech. But today they train with the full github and you get a mix of stuff. My suggestion would also improve the sorry state of JS in ChatGPT where it uses super old syntax and still uses outdated pattern like it is coding for IE6. My guess this is because it is train with old or bad code and this mean a=most of the code from now one will be old syntax and bad
Going to go with “no, you don’t need to be rich to sue”. Likewise to be included in a class action you don’t have to pay anything, or even participate any way, you just get a cut of the settlement.
I doubt it makes a difference whether they purchase the ebook or not. And probably a bunch of them aren't even available as ebooks legitimately, people scan books and upload them to zlibrary etc.
Maybe this is just semantics, but I don't know if the OSS-vs-freemium distinction matters all that much (I'd have to think about the potential downsides a bit more tbh).
Virtually every discussion in the LLM space right now is almost immediately bifurcated by the "can I use this commercially?" question which has a somewhat chilling effect on innovation. The best performing open source LLMs we have today are llama-based, particularly the WizardLM variants, so giving them more actual industry exposure will hopefully be a force multiplier.
Llama isn't open source either. But if I understand your point correctly, you're saying that the commercial use axis is what is important to people, and it's orthogonal to freeware vs open source. In the present environment, I agree. But I don't think we should let companies get away with poisoning the term open source for things which are not. I also believe that actual open source models have the near-term opportunity to make an impact and shape the future landscape, with red pajamas and others in the works. The distinction could be important in the near term, at the rate this field is developing at.
Neural network weights are better viewed as source code because they specify what function the network computes. As we're operating purely on feed-forward networks, there are no loops. Therefore, weights fully describe everything relevant for executing their represented function on inputs. Weights can be seen as a sort of intermediate language (with lots of stored data and partially computed states) interpretable by some deep learning library.
The network architecture itself is not source code, but a rough specification constraining the optimizer, which searches for possible program descriptions that within the specified constraints, minimize some loss function with respect to the data.
Neither data nor network architecture are the actual source, they are better seen as recipes which if followed (will at great expense), allow finding behaviorally similar programs. As you can see, the standard ideas of open source don’t quite carry over because the actual "source-code" is not human interpretable.
> Weights can be seen as a sort of intermediate language (with lots of stored data and partially computed states) interpretable by some deep learning library.
I've often talked about weights being the equivalent to assembly, your note seems to map to a similar intuition. And in that sense provided we ever solve the interpretability problem, we could in theory disassemble the weights to achieve similar outcomes as we do in asm-to-C. Interesting thought experiment insofar as, if the weights ought not be classified as open source (notwithstanding your first point which I agree with), can the disassembled output be classified as open source?
> But I don't think we should let companies get away with poisoning the term open source for things which are not.
Thats totally fair. And you're correct in that I was making an argument for positive outcomes being orthogonal to the semantics distinction.
> I also believe that actual open source models have the near-term opportunity to make an impact and shape the future landscape, with red pajamas and others in the works. The distinction could be very important in the near term, at the rate this field is developing at.
I think Falcon and MPT support your point as well, but those are still models that were trained on very small budgets relative to llama or gpt-3/4. There's a clear quality delta, albeit that gap is closing. Through that lens, I think having a large, well-funded org doing the pre-training work for the OSS community and releasing the weights permissively is a net positive.
Sen. Marsha Blackburn said “fair use” protections have become a “fairly useful way to steal” intellectual property. Some people would like to use this situation to get rid of "fair use".
Strong disagree - I think OSS is fine framing of this. Weights are a third category, you can 'fork' them in an a way that you can't with standard binaries.
Nobody does that because if you only have binaries you probably don't have permission to do that. Plus it's impractical to make any significant changes that way.
Maybe there is no source code? I imagine an LLM is like output of the following process. There's a huge room full of programmers that can directly edit machine code. You give them a random binary, which they then hack on for a while and publish the result. You then inspect it and tell them it isn't quite optimal in some way and ask them for a new version. Iterate on this process a bazillion times. At the end you get a binary that you're reasonably happy with. Nobody ever has the source code.
Source code is the preferred form for development.
In your scenario, despite the unrealistic coding process, the machine code is the source code, because that's what everyone is working on.
In the development of LLM, the weights is in no way the preferred form of development. Programmers don't work on weights. They work on data, infrastructure, the model, the code for training, etc. The point of machine learning is not to work on weights.
Unless you anthropomorphize optimizers, in which case the weights are indeed the preferred form of editing, but I had never seen anyone---even the most forward AGI supportors---argue that optimziers are intelligent agents.
What? You work on the weights - you just do it using tools like the optimizers, etc.
You release your weights, others can build on top of that, fine tune it in different ways, produce new weights they can share with others. Seems very OSS-y.
I feel like there is some semantic nitpicky point being made here that is completely going over my head.
By "work on", I mean "making direct edits". If we take broad definition of "work on", we lose all the distinction between source code and output. Any binary code is source code in any project, because the programmers simply is using tools to work on them, like the compiler.
For all practical purposes, if you are part of the team who released the LLMs, you would be writing and modifying the code of data processing, of the model, and of the training process. Those should be considered source code.
And we do have the model, which is pretty Oss-y, and which is why we can fine-tune the weights. But from a broader perspective, it's not fully Oss-y, because we don't have the code for anything else. There's no way to change, for example, how the training is done in the first place.
I read it in all such discussions. What does it mean? I just have a very high level understanding of AI models. No idea how things work under the hood or what knobs can be tweaked.
The source code is all the supporting code needed to run inference on the weights. This is usually python and in the case of llama it's already open source. Usually the source code is referred to as the "model". You can kind of think of the weights as a settings file in a normal desktop application. The desktop app has its own source code and loads in the settings file at runtime. It can load different settings files for different behaviors.
This is almost completely wrong. When peope who work in AI refer to the "model", they are generally referring to the weights. It is the weights which are the most important determinant of how the model performs, and it is the weights that require the most resources to develop. Associated code and other assets are also important, but they not the core asset. The intuitive sense of open sourcing a model therefore typically means releasing the weights under an open licence (ideally along with the training and inference code, data, training info, etc).
I am not making a value judgement on what's the "most important" aspect when comparing the code vs the weights. I am just explaining the terminology as I understand it. Your intuitive sense of open sourcing certainly makes sense to me. I think a lay person would expect to be able to generate content with an "open source ai model" and that wouldn't be possible if only the code was open sourced and not the weights.
If you can show me people who work in AI calling just the weights a "model" then I would happily update my internal definition of the word. I am certainly not an expert in the subject, I am just going off what I've read from the community over the past few years.
Open source is about freedom to modify the product. So in the context of an LLM, the source code is the data and the code that processes the data during *training* (not only inference), as that is what generates the weights.
I think it's a little context dependent, and the definition seems to be fluid right now. I've seen "model" be used to refer to just the code, or to refer to the combination of the code and weights. I don't think I've seen it used to refer to just the weights, but I wouldn't be surprised if its used that way in some contexts.
That doesn't change the meaning of Open Source. These are "free as in beer", not "free as in [modify the sources and rebuild it]". There are LLMs for which that is true, which include a specific list of training data. If you wanted to "uncensor" one of those, you could curate the source data and rebuild it, instead of trying to get it to unlearn what it was taught.
Content is a complement to a social network: the cheaper it is to create content, the more content is available, the easier it is to optimize a feed, the larger the time people spend in the platform, the higher the revenue. GenAI is just a method to drive the cost of content creation to zero.
kind of a dystopian nightmare world in which large corporations utilize AI to create low cost, infinite content that humans engage with (mostly content catering to the human tendency for tribalism, prestige, sexual desires etc...), sounds like we are creating a world similar to the Matrix.
I think we may have already entered it. Infinite scroll based feeds like TikTok, Instagram, and Threads (and possible Reddit these days) … just AI algorithm deciding what you should find “entertaining” or “important”.
It’s really the ultimate nightmare with the internet becoming just TV 3.0 in which content is controlled and curated … you just consume mindlessly.
Any attempts to create a Reddit clone.. or system in which people freely communicate is now “regulated” for “hate” speech or “terrorism”. The days of open discourse … appear to be numbered. Even email will be analyzed by AI to look for “trends” or “optimize” employee efficiency.
> ... Infinite Jest, also called "the Entertainment" or "the samizdat". The film is so compelling that its viewers lose all interest in anything other than repeatedly viewing it, and thus eventually die.
I wish I could upvote this a dozen times. This is a very insightful comment. Read the link above first if you aren't sure what "commoditize their complements" means.
Zuck is a total killer. What better way to fight Google and Microsoft than to effectively spawn thousands of potential competitors to their AI businesses with this (and other) releases. There will be a mad scramble over the released weights to develop new tech, these startups will raise tons of money, and then fight the larger incumbents.
This is not charity, this is a shrewd business move.
If you read past the title, this article is not at all clear if they are referring to a commercial offering (ie. license our model for $$) or an open-source license with commercial usage (Apache, etc.)
My guess is still the latter because that's what I've heard the rumors about, but this article is pretty unclear on this fact.
Falcon 40B was released as a "free with royalties above a certain amount of profit" license, and got roasted for it. It was so bad that they changed the license to Apache.
I don't think any business would run such a "licensed" model over MPT 30B or Falcon 40B, unless its way better than LLaMA 65b.
You can leverage those big CPUs while still loading both GPUs with a 65B model.
... If you are feeling extra nice, you should set that up as an AI horde worker whenever you run koboldcpp to play with models. It will run API requests for others in the background whenever its not crunching your own requests, in return allowing you priority access to models other hosts are running: https://aihorde.net/
If you're just looking to play with something locally for the first time, this is the simplest project I've found and has a simple web UI: https://github.com/cocktailpeanut/dalai
It works for 7B/13B/30B/65B LLaMA and Alpaca (fine-tuned LLaMA which definitely works better). The smaller models at least should run on pretty much any computer.
May I ask why you have such an amazing machine, and two nice graphics cards? Feel free to tell me it's none of my business, it's just very interesting to me :-)
Career dev who had the cash and wanted to experiment with anything that can be done concurrently, such as in my language of choice lately, which features high concurrency (https://elixir-lang.org/) or these LLM's, or anything else that can be done in massively parallel fashion (which is, perhaps surprisingly, only a minority of possible computer work, but it still means I can run many apps without much slowdown!)
I originally had 2 2080ti's to experiment also with virtio/proxmox (you need 1 for the host and 1 for any VM you run). I never got that running successfully at the time, but then Proton got really good (I mainly just wanted to run windows games fast in a VM, but that circumvented that). Later on I upgraded one of them to a 3080ti.
I'm surprised nobody here has brought up the sensorship in this model. Listening to Mark Zuckerberg on Lex Friedman's podcast talk about it, it sounds like the model will be significantly blunted vs its "research" version release.
I remember arguing with people who honest to god thought that LLAMA was some sort of secret ploy, to trick startups into using it, so that meta could sue them for using it commercially.
Well now there is a commerical release. I guess it wasn't some corporate plot after all!
Some people just can't admit when a corporation does a good thing.
(In this case, the good thing is being done to obsolete their competitors, but it is good none the less, that a commerical LLM is available for people to use for free)
Maybe they've solved the fingerprinting problem and can identify text generated from their model, and this is a way of discovering the market they can sell more advanced models to directly. B2B leadgen...
I don't think so because I believe you can train AI models against other AI models. I believe you can fingerprint a family of models, but that's not going to tell whether you just used the general approach outlined in the academic papers.
That would prbably work to detect if e.g. OpenAI or Anthropic start using their weights directly. It wouldn't detect whether e.g. a blog was generated with their model or not.
See. They don't care about the LLaMA model leak. It turns out that it was OpenAI that cares because it ruins their moat. It costs Meta nothing to release a better open-source or freely available version of LLaMA again.
Still waiting for the 'Meta is dying' and 'Fire Mark Zuckerberg' calls from last year. A year later, where are they now?
I think we are at a looking where we just have to let go unless we are Disney, with an army of lawyers.
May be it's time for the change in thinking.
Having said that.
Attribution allows a person to trace the source, it's not a success marker anymore.
Probably, if enough negative statements generated by AI get popular, that could potentially piss of countries/people for example some LLM recognizing Taiwan as independent country you can bet China will push for attribution to sources.
We have bills pending in multiple countries that want access to personal of encrypted messages to trace the source.
What's the monetization model here? Is this a closed-source version of their open-source model? (That's suggested by the phrase in the article, "a commercial version of LLaMA, its open-source large language model".)
Like others said it’s probably to commoditize their competition. The models don’t matter so much as ownership of the platform and critical data. Which is why OpenAI is in a tricky position (although I guess they’re partnered with Microsoft).
It seems like the existing large platforms of today—Microsoft’s enterprise moat, Google’s ads and internet services, Meta’s social networks, Apple’s consumer and mobile products—will remain the primary platforms of the future. So having models that can operate exclusively on those platforms via integration to their key products and date will only continue this trend. If you’re an outsider with an AI model, you’ll have a harder time getting access to critical data and your standalone AI product (e.g., ChatGPT) won’t be as useful.
More broadly speaking, I believe the days where the top X largest companies in the stock company would be displaced by newer companies every decade or so is over. The FAANGs just control so many major platforms in so many aspects of our lives.
> More broadly speaking, I believe the days where the top X largest companies in the stock company would be displaced by newer companies every decade or so is over. The FAANGs just control so many major platforms in so many aspects of our lives.
It also helps that they buy or otherwise cooperate to destroy their competition in questionable ways while heavily lobbying the gov to favor them over others in a quid-pro-quo that benefits politicians and not their constituents.
> More broadly speaking, I believe the days where the top X largest companies in the stock company would be displaced by newer companies every decade or so is over.
I disagree: I think big tech is hard to disrupt ATM because the companies are still young and nimble. In the last cycle, the companies being displaced were ancient (by tech standards). When Google and Facebook are 30 years old, their DNA will get in the way of adopting to a new paradigm that will change the world. A paradigm that may be to the Metaverse what the smartphone was to the Apple Newton
Google opensourced Tensorflow because they believed it would help with the hiring process: if researchers could use the same framework to do their PhDs as Google used in their production systems, that was seen as an advantage.
Maybe that's Meta's play here? Maybe the idea is that the ecosystem around a model could be as valuable or more valuable than the model itself too, so an OSS model could benefit Meta a lot more by gaining more of the ecosystem mind share?
Or Maybe Yann LeCun is just a hippie that dreams of free love, hard drugs and open-source models?
LLaMA is already in it's way to becoming an industry standard (in my opinion, look at llama.cpp plus everything build on LLaMA). There are benefits to being able to set direction like that. Same as pytorch for example, it's not just about direct revenue it's about everyone building on and contributing to your platform.
They might have done well to make gg an offer he couldn't refuse and take on ggml and llama.cpp as an open source project.
LLaMA isn’t licensed for commercial use. It’s probably an update to the licensing.
Facebook benefits heavily from the open source development done on LLaMA. There was a report I saw that facebook has started using llama.cpp internally for inference. Updates to the licensing will cement facebook as the go to choice for open source language models.
Based on the podcast with Lex Friedman and Mark Zuckerberg, see ~minute 30.
My hypothesis based on the context of Mark discussing the release is that it's going to be completely open source and can licensed to be used commercially. Not that Meta is going to add a whole new revenue side of business to compete with OpenAI. i.e. "Here is model, with commercially permissive licensing" not "Here is model that you can use commercially but must pay me"
Meta has been one of the major open source contributors for about a decade now. They open source/contribute to a lot of tech, as their business isn’t about tech, but products.
This isn't some recent revelation or anything. Facebook's AI team (FAIR) open sourced their major technology in 2017 with Pytorch. In 2018 they published Pytext in an age when most people didn't know what a Large Language Model even meant. Seeing LLaMA get made should not be a surprise to anyone who is familiar with the history of AI research. It's like hearing people call CUDA an "unfair advantage" while ignoring billions of Nvidia R&D dollars getting spent in the AI sector over the course of a decade.
It might feel like "brand rehab" or "good will" as a consumer, but a lot of this work was put in motion a while ago.
I don’t know, maybe they don’t need to monetize their model? I don’t know if they have to, they need their models to be the best and to support their core business of ads, anything that keeps users on their platform for any reason is their goal. They need their models to be an industry standard and one upon which other things are built.
That's a huge reason to do it also, but it also makes sense if you have researchers + developers improving the engine of something that powers your product. The moat / competitive advantage at FB is their network, not so much the proprietary underlying tech.
People often say this but having interviewed ~200 facebook engineers over the years, their scaling tech around both software and hardware is pretty impressive.
Yeah I guess it's a competitive advantage when a competitor (Twitter) is showing to have technical problems operating at global scale with a smaller team. Their scale is not trivial by any means. But people aren't going to go to FB because they have the best LLM, makes sense to offload that development to the open source community.
You still need to build real-time serving infrastructure on top of LLaMA/Vicuna/Alpaca in order to compete with ChatGPT/OpenAI so it's not going to be done by that many companies and OpenAI already has a mindshare/first mover advantage.
When you use ChatGPT you are leasing their GPU infrastructure and their proprietary model, this opens the possibility of leasing GPU infrastructure from another company and using an open source model. You don't necessarily need to do the hard parts yourself, you can hire it out to competing companies.
Sure, but it's extra work slowing you down as your competitor is surfing the wave at full speed. Moreover, you are relying on an old LLM whereas OpenAI is developing newer versions of theirs, keeping their competitive advantage. Even Google who has the infra has a ridiculously bad LLM to compete.
Well, if they really released it as open source, I guess depending on the exact license a company that modifies(fine tunes) it and wants to make money on that modified version would have to distribute the weights and/or disclose the details about how they fine tuned it. On what data etc. By offering a commercial license , the buyer can do anything they want.
Meta is a company that makes money off of users endlessly browsing content. It would follow that making it easier/faster to generate content would benefit Meta.
If you want to live the good life before you are exquisitely extinguished, spend every other day figuring out how to buy more NVDA, the other days exercising outside, being human.
No, you can sell open source software commercially. That being said, I'm wondering if the license will truly be open source or more like Stable Diffusion's license which is not really open source.
Commercial presumably as opposed to non-commercial licensing (e.g. the CC BY-NC license, or the weird situation LLaMa is in).
If you listen to the definition the Open Source Initiative would have applied to the term open source had they succeeded in acquiring rights to the term, then commercial is redundant with open source, not the opposite of it.
The complaint lays out in steps why the plaintiffs believe the datasets have illicit origins — in a Meta paper detailing LLaMA, the company points to sources for its training datasets, one of which is called ThePile, which was assembled by a company called EleutherAI. ThePile, the complaint points out, was described in an EleutherAI paper as being put together from “a copy of the contents of the Bibliotik private tracker.” Bibliotik and the other “shadow libraries” listed, says the lawsuit, are “flagrantly illegal.”
IANAL, but this basically sounds like LLaMa was trained on illegally obtained books by Meta's own admission. It's an exciting development that Meta is releasing a commercial-use version of the model, but I wonder if this is going to cause issues down the road. It's not like Meta can remove these books from the training set without retraining from scratch (or at least the last checkpoint before they were used).
[1] https://news.ycombinator.com/item?id=36657540