Look, absolutely zero people in the world should trust any tech company when they say they care about or will keep commitments to the open-source ecosystem in any capacity. Nevertheless, it is occasionally strategic for them to do so, and there can be ancillary benefits for said ecosystem in those moments where this is the best play for them to harm their competitors
For now, Meta seems to release Llama models in ways that don't significantly lock people into their infrastructure. If that ever stops being the case, you should fork rather than trust their judgment. I say this knowing full well that most of the internet is on AWS or GCP, most brick and mortar businesses use Windows, and carrying a proprietary smartphone is essentially required to participate in many aspects of the modern economy. All of this is a mistake. You can't resist all lock-in. The players involved effectively run the world. You should still try where you can, and we should still be happy when tech companies either slip up or make the momentary strategic decision to make this easier
As an ML researcher, I agree. Meta doesn't include adequate information to replicate the models, and from the perspective of fundamental research, the interest that big tech companies have taken in this field has been a significant impediment to independent researchers, despite the fact that they are undeniably producing groundbreaking results in many respects, due to this fundamental lack of openness
This should also make everyone very skeptical of any claim they are making, from benchmark results to the legalities involved in their training process to the prospect of future progress on these models. Without being able to vet their results against the same datasets they're using, there is no way to verify what they're saying, and the credulity that otherwise smart people have been exhibiting in this space has been baffling to me
As a developer, if you have a working Llama model, including the source code and weights, and it's crucial for something you're building or have already built, it's still fundamentally a good thing that Meta isn't gating it behind an API and if they went away tomorrow, you could still use, self-host, retrain, and study the models
A) Release the data, and if it ends up causing a privacy scandal, at least you can actually call it open this time.
B) Neuter the dataset, and the model
All I ever see in these threads is a lot of whining and no viable alternative solutions (I’m fine with the idea of it being a hard problem, but when I see this attitude from “researchers” it makes me less optimistic about the future)
> and the credulity that otherwise smart people have been exhibiting in this space has been baffling to me
Remove the “otherwise” and you’re halfway to understanding your error.
This isn't a dilemma at all. If Facebook can't release data it trains on because it would compromise user privacy, it is already a significant privacy violation that should be a scandal, and if it would prompt some regulatory or legislative remedies against Facebook for them to release the data, it should do the same for releasing the trained model, even through an API. The only reason people don't think about it this way is that public awareness of how these technologies work isn't pervasive enough for the general public to think it through, and it's hard to prove definitively. Basically, if this is Facebook's position, it's saying that the release of the model already constitutes a violation of user privacy, but they're betting no one will catch them
If the company wants to help research, it should full-throatedly endorse the position that it doesn't consider it a violation of privacy to train on the data it does, and release it so that it can be useful for research. If the company thinks it's safeguarding user privacy, it shouldn't be training models on data it considers private and then using them in public-facing ways at all
As it stands, Facebook seems to take the position that it wants to help the development of software built on models like Llama, but not really the fundamental research that goes into building those models in the same way
> If Facebook can't release data it trains on because it would compromise user privacy, it is already a significant privacy violation that should be a scandal
Thousands of entities would scramble to sue Facebook over any released dataset no matter what the privacy implications of the dataset are.
It's just not worth it in any world. I believe you are not thinking of this problem from the view of the PM or VPs that would actually have to approve this: if I were a VP and I was 99% confident that the dataset had no privacy implications, I still wouldn't release it. Just not worth the inevitable long, drawn out lawsuits from people and regulators trying to get their pound of flesh.
I feel the world is too hostile to big tech and AI to enable something like this. So, unless we want to kill AGI development in the cradle, this is what we get - and we can thank modern populist techno-pessimism for cultivating this environment.
Translation: "we train our data on private user data and copyrighted material so of course we cannot disclose any of our datasets or we'll be sued into oblivion"
There's no AGI development in the cradle. And the world isn't "hostile". The world is increasingly tired of predatory behavior by supranational corporations
Lmao what? If the world were sane and hostile to big tech, we would've nuked them all years ago for all the bullshit they pulled and continue to pull. Big tech has politicians in their pockets, but thankfully the "populist techno-pessimist" (read: normal people who are sick of billionaires exploiting the entire planet) are finally starting to turn their opinions, albeit slowly.
If we lived in a sane world Cambridge Analytica would've been the death knell of Facebook and all of the people involved with it. But we instead live in a world where psychopathic pieces of shit like Zucc get away with it, because they can just buy off any politician who knocks on their doors.
> normal people who are sick of billionaires exploiting the entire planet
Don't understand what big tech does for humanity and how much they rely on it in the day to day. Literally all of their modern conveniences are enabled by big tech.
Crowdstrike merely shows how much people depend on big tech and they don't even realize how much they rely on it.
I think you have too much faith in the average person. They scarcely understand how nearly everything in their life has been manufactured on or designed on something powered by big tech.
This post demonstrates a willful ignorance of the factors driving so-called "populist techno-pessimism" and I'm sure every time a member of the public is exposed to someone talking like this, their "techno-pessimism" is galvanized
The ire people have toward tech companies right now is, like most ire, perhaps in places overreaching. But it is mostly justified by the real actions of tech companies, and facebook has done more to deserve it than most. The thought process you just described sounds like an accurate prediction of the mindset and culture of a VP within Facebook, and I'd like you to reflect on it for a sec. Basically, you rightly point out that the org releasing what data they have would likely invite lawsuits, and then you proceeded to do some kind of insane offscreen mental gymnastics that allow this reality to mean nothing to you but that the unwashed masses irrationally hate the company for some unknowable reason
Like you're talking about a company that has spent the last decade buying competitors to maintain an insane amount of control over billions of users' access to their friends, feeding them an increasingly degraded and invasive channel of information that also from time to time runs nonconsensual social experiments on them, and following even people who didn't opt in around the internet through shady analytics plugins in order to sell dossiers of information on them to whoever will pay. What do you think it is? Are people just jealous of their success, or might they have some legit grievances that may cause them to distrust and maybe even loathe such an entity? It is hard for me to believe Facebook has a dataset large enough to train a current-gen LLM that wouldn't also feel, viscerally, to many, like a privacy violation. Whether any party that felt this way could actually win a lawsuit is questionable though, as the US doesn't really have signficant privacy laws, and this is partially due to extensive collaboration with, and lobbying by, Facebook and other tech companies who do mass-surveillance of this kind
I remember a movie called Das Leben der Anderen (2006) (Officially translated as "the lives of others") which got accolades for how it could make people who hadn't experienced it feel how unsettling the surveillance state of East Germany was, and now your average American is more comprehensively surveilled than the Stasi could have imagined, and this is in large part due to companies like facebook
Frankly, I'm not an AGI doomer, but if the capabilities of near-future AI systems are even in the vague ballpark of the (fairly unfounded) claims the American tech monopolies make about them, it would be an unprecedented disaster on a global scale if those companies got there first, so inasmuch as we view "AGI research" as something that's inevitably going to hit milestones in corporate labs with secretive datasets, I think we should absolutely kill it to whatever degree is possible, and that's as someone who truly, deeply believes that AI research has been beneficial to humanity and could continue to become moreso
> Release the data, and if it ends up causing a privacy scandal...
We can't prove that a model like llama will never produce a segment of its training data set verbatim.
Any potential privacy scandal is already in motion.
My cynical assumption is that Meta knows that competitors like OpenAI have PR-bombs in their trained model and therefore would never opensource the weights.
Generally speaking, no. An important part of a lot of benchmarks in ML research is generalization. What this means is that it's often a lot easier to get a machine learning model to memorize the test cases in a benchmark than it is to train it to perform a general capability the benchmark is trying to test for. For that reason, the dataset is important, as if it includes the benchmark test cases in some way, it invalidates the test
When AI research was still mostly academic, I'm sure a lot of people still cheated, but there was somewhat less incentive to, and norms like publishing datasets made it easier to verify claims made in research papers. In a world where people don't, and there's significant financial incentive to lie, I just kind of assume they're lying
> If you don't have a way to replicate what they did to create the model, it seems more like freeware
Isn't that a bit like arguing that a linux kernel driver isn't open source if I just give you a bunch of GPL-licensed source code that speaks to my device, but no documentation how my device works? If you take away the source code you have no way to recreate it. But so far that never caused anyone to call the code not open-source. The closest is the whole GPL3 Tivoization debate and that was very divisive.
The heart of the issue is that open source is kind of hard to define for anything that isn't software. As a proxy we could look at Stallman's free software definition. Free software shares a common history with open source and in most open source software is free/libre, and the other way around, so this might be a useful proxy.
So checking the four software freedoms:
- The freedom to run the program as you wish, for any purpose: For most purposes. There's that 700M user restriction, also Meta forbids breaking the law and requires you to follow their acceptable use policy.
- The freedom to study how the program works, and change it so it does your computing as you wish: yes. You can change it by fine tuning it, and the weights allow you to figure out how it works. At least as well as anyone knows how any large neural network works, but it's not like Meta is keeping something from you here
- The freedom to redistribute copies so you can help your neighbor: Allowed, no real asterisks
- The freedom to distribute copies of your modified versions to others: Yes
So is it Free Software™? Not really, but it is pretty close.
The model is "open-source" for the purpose of software engineering, and it's "closed data" for the purpose of AI research. These are separate issues and it's not necessary to conflate them under one term
They actually did open source the infrastructure library they developed. They don't open source the data but they describe how they gathered/filtered it.
Kinda. Since you can self-host the model on a linux machine, there's no meaningful way for them to prevent you from having the trained weights. You can use this to bootstrap other models, or retrain on your own datasets, or fine-tune from the starting point of the currently-working model. What you can't do is be sure what they trained it on
How open is it really though? If you're starting from their weights, do you actually have legal permission to use derived models for commercial purposes? If it turns out that Meta used datasets they didn't have licenses to use in order to generate the model, then you might be in a big heap of mess.
From a legal perspective, yea. If we end up having any legal protection against training AI models, legal liability will be a huge mess for everyone involved. From an engineering perspective, if all you need is the pretrained weights, there's not a clear way Facebook could show up and break your product from a technological perspective, as compared to if the thing is relying on, say, an OpenAI API key rather than a self-hosted Llama instance
Is forking really possible with an LLM or one the size of future Lama versions, have they even released the weights and everything? Maybe I am just negative about it because I feel Meta is the worst company ever invented and feel this will hurt society in the long run just like Facebook.
When you run `ollama pull llama3.1:70b`, which you can literally do right now (assuming ollama.com is installed and you're not afraid of the terminal), and it downloads a 40 gigabyte model, that is the weights!
I'd consider the ability to admit when even your most hated adversary is doing something right, a hallmark of acting smarter.
Now, they haven't released the training data with the model weights. THAT plus the training tooling would be "end to end open source". Apple actually did that very thing recently, and it flew under almost everyone's radar for some reason:
Doing something right vs doing something that seems right but has a hidden self interest that is harmful in the long run can be vastly different things. Often this kind of strategy will allow people to let their guard down, and those same people will get steamrolled down the road, left wondering where it all went wrong. Get smarter.
How in the heck is an open source model that is free and open today going to lock me down, down the line? This is nonsense. You can literally run this model forever if you use NixOS (or never touch your windows, macos or linux install again). Zuck can't come back and molest it. Ever.
The best I can tell is that their self-interest here is more about gathering mindshare. That's not a terrible motive; in fact, that's a pretty decent one. It's not the bully pressing you into their ecosystem with a tit-for-tat; it's the nerd showing off his latest and going "Here. Try it. Join me. Join us."
Yeah because history isn't absolutely littered with examples of shiny things being dangled in front of people with the intent to entrap them /s.
Can you really say this model will still be useful in 2 years, 5 years for you? And that FB's stance on these models will still be open source at that time once they incrementally make improvements? Maybe, maybe not. But FB doesn't give anything away for free, and the fact that you think so is your blindness, not mine. In case you haven't figured it out, this isn't a technology problem, this is a "FB needs marketshare and it needs it fast" problem.
> But FB doesn't give anything away for free, and the fact that you think so is your blindness, not mine
Is it, though? They are literally giving this away "for free". https://dev.to/llm_explorer/llama3-license-explained-2915 Unless you build a service with it that has over 700 million monthly users (read: "problem anyone would love to have"), you do not have to re-negotiate a license agreement with them. Beyond that, it can't "phone home" or do any other sorts of nefarious shite. The other limitations there, which you can plainly read, seem not very restrictive.
Is there a magic secret clause conspiracy buried within the license agreement that you believe will be magically pulled out at the worst possible moment? >..<
Sometimes, good things happen. Sorry you're "too blinded" by past hurt experience to see that, I guess
For now, Meta seems to release Llama models in ways that don't significantly lock people into their infrastructure. If that ever stops being the case, you should fork rather than trust their judgment. I say this knowing full well that most of the internet is on AWS or GCP, most brick and mortar businesses use Windows, and carrying a proprietary smartphone is essentially required to participate in many aspects of the modern economy. All of this is a mistake. You can't resist all lock-in. The players involved effectively run the world. You should still try where you can, and we should still be happy when tech companies either slip up or make the momentary strategic decision to make this easier