Hacker News new | past | comments | ask | show | jobs | submit login
Open source AI is the path forward (fb.com)
2360 points by atgctg 4 months ago | hide | past | favorite | 888 comments



Related ongoing thread:

Llama 3.1 - https://news.ycombinator.com/item?id=41046540 - July 2024 (114 comments)


“The Heavy Press Program was a Cold War-era program of the United States Air Force to build the largest forging presses and extrusion presses in the world.” This ”program began in 1944 and concluded in 1957 after construction of four forging presses and six extruders, at an overall cost of $279 million. Six of them are still in operation today, manufacturing structural parts for military and commercial aircraft” [1].

$279mm in 1957 dollars is about $3.2bn today [2]. A public cluster of GPUs provided for free to American universities, companies and non-profits might not be a bad idea.

[1] https://en.m.wikipedia.org/wiki/Heavy_Press_Program

[2] https://data.bls.gov/cgi-bin/cpicalc.pl?cost1=279&year1=1957...


The National Science Foundation has been doing this for decades, starting with the supercomputing centers in the 80s. Long before anyone talked about cloud credits, NSF has had a bunch of different programs to allocate time on supercomputers to researchers at no cost, these days mostly run out of the Office of Advanced Cyberinfrastruture. (The office name is from the early 00s) - https://new.nsf.gov/cise/oac

(To connect universities to the different supercomputing centers, the NSF funded the NSFnet network in the 80s, which was basically the backbone of the Internet in the 80s and early 90s. The supercomputing funding has really, really paid off for the USA)


> NSF has had a bunch of different programs to allocate time on supercomputers to researchers at no cost, these days mostly run out of the Office of Advanced Cyberinfrastruture

This would be the logical place to put such a programme.


The DoE has also been a fairly active purchaser of GPUs for almost two decades now thanks to the Exascale Computing Project [0] and other predecessor projects.

The DoE helped subsidize development of Kepler, Maxwell, Pascal, etc along with the underlying stack like NVLink, NGC, CUDA, etc either via purchases or allowing grants to be commercialized by Nvidia. They also played matchmaker by helping connect private sector research partners with Nvidia.

The DoE also did the same thing for AMD and Intel.

[0] - https://www.exascaleproject.org/


The DoE subsidized the development of GPUs, but so did Bitcoin.

But before that, it was video games, like quake. Nvidia wouldn't be viable if not for games.

But before that, graphics research was subsidized by the DoD, back when visualizing things in 3D cost serious money.

It's funny how technology advances.


It was really Ethereum / Alt coins not Bitcoin that caused the GPU demand in 2021.

Bitcoin moved to FPGAs/ASIC very quickly because dedicated hardware was vastly more efficient they were only viable from Oct 2010. By 2013 when ASIC’s came online GPU’s only made sense if someone else was paying for both the hardware and electricity.


As you've rightly pointed out, we have the mechanism, now let's fund it properly!

I'm in Canada, and our science funding has likewise fallen year after year as a proportion of our GDP. I'm still benefiting from A100 clusters funded by tax payer dollars, but think of the advantage we'd have over industry if we didn't have to fight over resources.


Where do you get access to those as a member of the general public?


In Australia at least, anyone who is enrolled at or works at a university can use the taxpayer-subsidised "Gadi" HPC which is part of the National Computing Infrastructure (https://nci.org.au/our-systems/hpc-systems). I also do mean anyone, I have an undergraduate student using it right now (for free) to fine-tune several LLMs.

It also says commercial orgs can get access via negotiation, I expect a random member of the public would be able to go that route as well. I expect that there would be some hurdles to cross, it isn't really common for random members of the public to be doing the kinds of research Gadi was created to benefit. I expect it is the same way in this case in Canada. I suppose the argument is if there weren't any gatekeeping at all, you might end up with all kinds of unsuitable stuff on the cluster, e.g. crypto miners and such.

Possibly another way for a true random person to get access would be to get some kind of 0-hour academic affiliation via someone willing to back you up, or one could enrol in a random AI course or something and then talk to the lecturer in charge.

In reality, the (also taxpayer-subsidised) university pays some fee for access, but it doesn't come from any of our budgets.


Australia's peak HPC has a total of: "2 nodes of the NVIDIA DGX A100 system, with 8 A100 GPUs per node".

It's pretty meagre pickings!


Well, one, it has:

> 160 nodes each containing four Nvidia V100 GPUs

and two, well, it's a CPU-based supercomputer.


I get my resources through a combination of servers my lab bought with using a government grant and the Digital Research Alliance of Canada (nee Compute Canada)'s cluster.

These resources aren't available to the public, but if I were king for a day we'd increase science funding such that we'd have compute resources available to high-school students and the general public (possibly following training on how to use it).

Making sure folks didn't use it to mine bitcoin would be important, though ;)


I'm going to guess it's Compute Canada, which I don't think we non-academics have access to.


That's correct (they go by the Digital Research Alliance of Canada now... how boring).

I wish that wasn't the case though!


Yeah, the specific AI/ML-focused program is NAIRR.

https://nairrpilot.org/

Terrible name unless they low-key plan to make AI researchers' hair fall out.


the US already pays for 2+ aws region for cia/dod. why not pay for a region that is only available to researchers?


Doubtful that GPUs purchased today would be in use for a similar time scale. Govt investment would also drive the cost of GPUs up a great deal.

Not sure why a publicly accessible GPU cluster would be a better solution than the current system of research grants.


Of course they won't. The investment in the Heavy Press Program was the initial build, and just citing one example, the Alcoa 50,000 ton forging press was built in 1955, operated until 2008, and needed ~$100M to get it operational again in 2012.

The investment was made to build the press, which created significant jobs and capital investment. The press, and others like it, were subsequently operated by and then sold to a private operator, which in turn enabled the massive expansion of both military manufacturing, and commercial aviation and other manufacturing.

The Heavy Press Program was a strategic investment that paid dividends by both advancing the state of the art in manufacturing at the time it was built, and improving manufacturing capacity.

A GPU cluster might not be the correct investment, but a strategic investment in increasing, for example, the availability of training data, or interoperability of tools, or ease of use for building, training, and distributing models would probably pay big dividends.


I don't think there's a shortage of capital for AI... probably the opposite

Of all the things to expand the scope of government spending why would they choose AI, or more specifically GPUs?


There may however, be a shortage of capital for open source AI, which is the subject under consideration.

As for the why... because there's no shortage of capital for AI. It sounds like the government would like to encourage redirecting that capital to something that's good for the economy at large, rather than good for the investors of a handful of Silicon Valley firms interested only in their own short term gains.


Look at it from the perspective of an elected official:

If it succeeds, you were ahead of the curve. If it fails, you were prudent enough to fund an investigation early. Either way, bleeding edge tech gives you a W.


Or you wasted a bunch of tax payer money on some over hyped and over funded nonsense.


Yeah. There is alot of over hyped and over funded nonsense that comes out of NASA. Some of it is hype from the marketing and press teams, other hype comes from misinterpretation of releases.

None of that changes that there have been major technical breakthroughs, and entire classes of products and services that didn't exist before those investments in NASA (see https://en.wikipedia.org/wiki/NASA_spin-off_technologies for a short list). There are 15 departments and dozens of Agencies that comprise the US Federal government, many of whom make investments in science and technology as part of their mandates, and most of that is delivered through some structure of public-private partnerships.

What you see as over-hyped and over-funded nonsense could be the next ground breaking technology, and that is why we need both elected leaders who (at least in theory) represent the will of the people, and appointed, skilled bureaucrats who provide the elected leaders with the skills, domain expertise, and experience that the winners of the popularity contest probably don't have.

Yep, there will be waste, but at least with public funds there is the appearance of accountability that just doesn't exist with private sector funds.


You'll be long gone before they find out.


Which happens every single day in every government in the world.


how would you determine that without investigation?


If it succeeds the idea gets sold to private corporations or the technology is made public and everyone thinks the corporation with the most popular version created it.

If it fails certain groups ensure everyone knows the government "wasted" taxpayer money.


> A GPU cluster might not be the correct investment, but a strategic investment in increasing, for example, the availability of training data, or interoperability of tools, or ease of use for building, training, and distributing models would probably pay big dividends

Would you mind expanding on these options? Universal training data sounds intriguing.


Sure, just on the training front, building and maintaining a broad corpus of properly managed training data with metadata that provides attribution (for example, content that is known to be human generated instead of model generated, what the source of data is for datasets such as weather data, census data, etc), and that also captures any licensing encumbrance so that consumers of the training data can be confident in their ability to use it without risk of legal challenge.

Much of this is already available to private sector entities, but having a publicly funded organization responsible for curating and publishing this would enable new entrants to quickly and easily get a foundation without having to scrape the internet again, especially given how rapidly model generated content is being published.


I think the EPC (energy performance certificate) dataset in the UK is a nice example of this. Anyone can download a full dataset of EPC data from https://epc.opendatacommunities.org/

Admittedly it hasn't been cleaned all that much - you still need to put a bit of effort into that (newer certificates tend to be better quality), but it's very low friction overall. I'd love to see them do this with more datasets


If the public is going to go to all the trouble of doing something, why would that public not make it clear that there is no legal threat to using any data available?

The public is incredibly lazy, though. Don't expect them to do anything until their hand is forced, which doesn't bode well for the action to meet a desirable outcome.


there are many things i think are more capital constrained, if the government is trying to subsidize things.


> Doubtful that GPUs purchased today would be in use for a similar time scale

Totally agree. That doesn't mean it can't generate massive ROI.

> Govt investment would also drive the cost of GPUs up a great deal

Difficult to say this ex ante. On its own, yes. But it would displace some demand. And it could help boost chip production in the long run.

> Not sure why a publicly accessible GPU cluster would be a better solution than the current system of research grants

Those receiving the grants have to pay a private owner of the GPUs. That gatekeeping might be both problematic, if there is a conflict of interests, and inefficient. (Consider why the government runs its own supercomputers versus contracting everything to Oracle and IBM.)


It would be better that the government removes IP on such technology for public use, like drugs got generics.

This way the government pays 2'500 USD per card, not 40'000 USD or whatever absurd.


> It would be better that the government removes IP on such technology for public use, like drugs got generics.

20-25 year old drugs are a lot more useful than 20-25 year old GPUs, and the manufacturing supply chain is not a bottleneck.

There's no generics for the latest and greatest drugs, and a fancy gene therapy might run a lot more than $40k.


> better that the government removes IP on such technology for public use, like drugs got generics

You want to punish NVIDIA for calling its shots correctly? You don't see the many ways that backfires?


No. But I do want to limit the amount we reward NVIDIA for calling the shots correctly to maximize the benefit to society. For instance by reducing the duration of the government granted monopolies on chip technology that is obsolete well before the default duration of 20 years is over.

That said, it strikes me that the actual limiting factor is fab capacity not nvidia's designs and we probably need to lift the monopolies preventing competition there if we want to reduce prices.


> reducing the duration of the government granted monopolies on chip technology that is obsolete well before the default duration of 20 years is over

Why do you think these private entities are willing to invest the massive capital it takes to keep the frontier advancing at that rate?

> I do want to limit the amount we reward NVIDIA for calling the shots correctly to maximize the benefit to society

Why wouldn't NVIDIA be a solid steward of that capital given their track record?


> Why do you think these private entities are willing to invest the massive capital it takes to keep the frontier advancing at that rate?

Because whether they make 100x or 200x they make a shitload of money.

> Why wouldn't NVIDIA be a solid steward of that capital given their track record?

The problem isn't who is the steward of the capital. The problem is that economically efficient thing to do for a single company is (given sufficient fab capacity, and a monopoly) to raise prices to extract a greater share of the pie at the expense of shrinking the size of the pie. I'm not worried about who takes the profit, I'm worried about the size of the pie.


> Because whether they make 100x or 200x they make a shitload of money.

It's not a certainty that they 'make a shitload of money'. Reducing the right tail payoffs absolutely reduces the capital allocated to solve problems - many of which are risky bets.

Your solution absolutely decreases capital investment at the margin, this is indisputable and basic economics. Even worse when the taking is not due to some pre-existing law, so companies have to deal with the additional uncertainty of whether & when future people will decide in retrospect that they got too large a payoff and arbitrarily decide to take it from them.


You can't just look at the costs to an action, you also have to look at the benefits.

Of course I agree I'm going to stop marginal investments from occurring into research into patent-able technologies by reducing the expect profit. But I'm going to do so very slightly because I'm not shifting the expected value by very much. Meanwhile I'm going to greatly increase the investment into the existing technology we already have, and allow many more people to try to improve upon it, and I'm going to argue the benefits greatly outweigh the costs.

Whether I'm right or wrong about the net benefit, the basic economics here is that there are both costs and benefits to my proposed action.

And yes I'm going to marginally reduce future investments because the same might happen in the future and that reduces expected value. In fact if I was in charge the same would happen in the future. And the trade-off I get for this is that society gets the benefit of the same actually happening in the future and us not being hamstrung by unbreachable monopolies.


> But I'm going to do so very slightly because I'm not shifting the expected value by very much

I think you're shifting it by a lot. If the government can post-hoc decide to invalidate patents because the holder is getting too successful, you are introducing a substantial impact on expectations and uncertainty. Your action is not taken in a vacuum.

> Meanwhile I'm going to greatly increase the investment into the existing technology we already have, and allow many more people to try to improve upon it, and I'm going to argue the benefits greatly outweigh the costs.

I think this is a much more speculative impact. Why will people even fund the improvements if the government might just decide they've gotten too large a slice of the pie later on down the road?

> the trade-off I get for this is that society gets the benefit of the same actually happening in the future and us not being hamstrung by unbreachable monopolies.

No the trade-off is that materially less is produced. These incentive effects are not small. Take for instance, drug price controls - a similar post-facto taking because we feel that the profits from R&D are too high. Introducing proposed price controls leads to hundreds of fewer drugs over the next decade [0] - and likely millions of premature deaths downstream of these incentive effects. And that's with a policy with a clear path towards short-term upside (cheaper drug prices). Discounted GPUs by invalidating nvidia's patents has a much more tenuous upside and clear downside.

[0]: https://bpb-us-w2.wpmucdn.com/voices.uchicago.edu/dist/d/312...


> I'm going to do so very slightly because I'm not shifting the expected value by very much

You're massively increasing uncertainty.

> the same would happen in the future. And the trade-off I get for this is that society gets the benefit

Why would you expect it would ever happen again? What you want is an unrealized capital gains tax. Not to nuke our semiconductor industry.


You have proposed state ownership of all successful IP. That is a massive change and yet you have demonstrated zero understanding of the possible costs.

Your claim that removing a profit motivation will increase investment is flat out wrong. Everything else crumbles from there.


No, I've proposed removing or reducing IP protections, not transferring them to the state. Allowing competitors to enter the market will obviously increase investment in competitors...


This is already happening - its called China. There's a reason they don't innovate in anything, and they are always playing catch-up, except in the art of copying (stealing) from others.

I do think there are some serious IP issues, as IP rules can be hijacked in the US, but that means you fix those problems, not blow up IP that was rightfully earned


> they don't innovate in anything

They are leaders in solar and EVs.

Remember how Japan leapfrogged the western car industry, and six sigma became required reading for managers in every industry?


Removing IP restrictions transfers them to the state. Grow up.


>Why wouldn't NVIDIA be a solid steward of that capital given their track record?

Past performance is not indicative of future results.


> That said, it strikes me that the actual limiting factor is fab capacity not nvidia's designs and we probably need to lift the monopolies preventing competition there if we want to reduce prices.

Lol it's not "monopolies" limiting fab capacity. Existing fab companies can barely manage to stand-up a new fab in different cities. Fabs are impossibly complex and beyond risky to fund.

It's the kind of thing you'd put government money to making but it's so risky government really don't want to spend billions and fail so they give existing companies billions so if they fail it's not the governments fault.


So, if a private company is successful, you will nationalize its IP under some guise of maximizing the benefit to society? That form of government was tried once. It failed miserably.

Under your idea, we’ll try a badly broken economic philosophy again. And while we’re at it, we will completely stifle investment in innovation.


there is no such thing as a lump-sum transfer, this will shift expectations and incentives going forward and make future large capital projects an increasingly uphill battle


There was a post[0] on here recently about how the US went from producing woefully insufficient numbers of aircraft to producing 300k by the end of world war 2.

One of the things that the post mentioned was the meager profit margin that the companies made during this time.

But the thing is that this set the America auto and aviation industry up to rule the world for decades.

A government going to a company and saying 'we need you to produce this product for us at a lower margin thab you'd like to' isn't the end of the world.

I don't know if this is one of those scenarios but they exist.

[0] https://www.construction-physics.com/p/how-to-build-300000-a...


In the case of NVIDIA it's even more sneaky.

They are an intellectual property company holding the rights on plans to make graphic cards, not even a company actually making graphic cards.

The government could launch an initiative "OpenGPU" or "OpenAI Accelerator", where the government orders GPUs from TSMC directly, without the middleman.

It may require some tweaking in the law to allow exception to intellectual property for "public interest".


y'all really don't understand how these actions would seriously harm capital markets and make it difficult for private capital formation to produce innovations going forward.


> y'all really don't understand how these actions would seriously harm capital markets and make it difficult for private capital

Reflexively, I count that harm as a feature. I don't like private capital markets because I've been screwed by private capital on multiple occasions.

But you are right: I don't understand how these actions would harm. So please do expand your concerns.


If we have public capital formation, we don’t necessarily need private capital. Private innovation in weather modelling isn’t outpacing government work by leaps and bounds, for instance.


because it is extremely challenging to capture the additional value that is being produced by better weather forecasts and generally the forecasts we have right now are pretty good.

private capital is absolutely the driving force for the vast majority of innovations since the beginning of the 20th century. public capital may be involved, but it is dwarfed by private capital markets.


It’s challenging to capture the additional value and the forecasts are pretty good because of continual large-scale government investment into weather forecasting. NOAA is launching satellites! it’s a big deal!

Private nuclear research is heavily dependent on governmental contracts to function. Solar was subsidized to heck and back for years. Public investment does work, and does make a didference.

I would even say governmental involvement is sometimes even the deciding factor, to determine if research is worth pursuing. Some major capital investors have decided AI models cannot possibly gain enough money to pay for their training costs. So what do we do when we believe something is a net good for society, but isn’t going to be profitable?


They said remove legally-enforced monopolies on what they produce. Many of these big firms made their tech with millions to billions of taxpayer dollars at various points in time. If we’ve given them millions, shouldn’t we at least get to make independent implementations of the tech we already paid for?


To the extent these are incremental units that wouldn't have been sold absent the government program, it's difficult to see how NVIDIA is "harmed".


> Those receiving the grants have to pay a private owner of the GPUs.

Along similar lines, I'm trying to build a developer credits program where I get whomever (AMD/Dell) to purchase credits on my super computers, that we then give away to developers to build solutions, which drives more demand for our hardware, and we commit to re-invest those credits back into more hardware. The idea is to create a win-win-win (us, them, you) developer flywheel ecosystem. It isn't a new idea at all, Nvidia and hyperscalers have been doing this for ages.


A much better investment would be to (somehow) revolutionize production of chips for AI so that it's all cheaper, more reliable, and faster to stand up new generations of software and hardware codesign. This is probably much closer to the program mentioned in the top level comment: It wasn't to produce one type of thing, but to allow better production of any large thing from lighter alloys.


> Not sure why a publicly accessible GPU cluster would be a better solution than the current system of research grants.

You mean a better solution than different teams paying AWS over and over, potentially spending 10x on rent rather than using all that cash as a down payment on actually owning hardware? I can't really speak for the total costs of depreciation/hardware maintenance but renting forever isn't usually a great alternative to buying.


Do you have some information to share to support your bias against leasing especially with a depreciating asset?


In Canada, all three major AI research centers use clusters created with public money. These clusters receive regular additional hardware as new generations of GPUs become available. Considering how these institutions work, I'm pretty confident they've considered the alternatives (renting, AWS, etc). So that's one data point.


sure, I’ll hand it over after you spend your own time first to show that everything everywhere that’s owned instead of leased is a poor financial decision.


AWS is not only hardware but also software, documentation, support and more.


In the Netherlands, for instance, there is "the national supercomputer" Snellius: https://www.surf.nl/en/services/snellius-the-national-superc... I am not sure about its budget, but my impression as a user is that its resources are never fully used. At least, none of my jobs ever had to queue. I doubt that it can compete with the scale of resources that FAANG companies have available, but then again, I also doubt how research would benefit.

Sure, academia could build LLMs, and there is at least one large-scale project for that: https://gpt-nl.com/ On the other hand, this kind of models still need to demonstrate specific scientific value that goes beyond using a chatbot for generating ideas and summarizing documents.

So I fully agree that the research budget cuts in the past decades have been catastrophic, and probably have contributed to all the disasters the world is currently facing. But I think that funding prestigious super-projects is not the best way to spend funds.


Snellius is a nice resource. A powerful Slurm based HTC cluster with different cues for different workloads (cpu/genomics, gpu/deep learning).

To access the resource I had to go through EuroCC [0], which is a network facilitating access to and exploitation of HPC/HTC infra. It is (or can be) a great competing model to US cloud providers.

As a small business I got 8 hrs of consultancy and 10k compute hours for free. I’m still learning the details but my understanding is is that after that the prices are very competitive.

[0] https://www.eurocc-access.eu/


Italy built the Leonardo HPC cluster, it's one of the largest in EU and was created by a consortium of universities. After just over a year it's already at full capacity and expansion plans have been anticipated because of this.


How about using some of that money to develop CUDA alternatives so everyone is not paying the Nvidia tax?


Or just develop the next wave of chips designed for specifically transformer-based architectures (and ternary computing), and bypass the needs for GPUs and CUDA altogether


That would be betting against other architectures like Mamba, which does not seem like an obviously good bet to make yet. Maybe it is though.


You're right, there are a number of avenues that are viable alternatives to the gpu monopoly.

I like the fact that these can be made with just mass-printed multiplication (and in ternary computing's case - addition) gates which require little more than 10 year old tech which is already widely distributed.


It would be probably cheaper to negate some IP. There are quite some projects and initiatives to make CUDA code run on AMD for example, but as far as I know, they all stopped at some point, probably because of fear of being sued into oblivion.


It is being done already...

https://docs.scale-lang.com/


It seems like rocm is already fully ready for transformer inference, so you are just referring to training?


ROCm is buggy and largely undocumented. That’s why we don’t use it.


It is actively improving every day.

https://news.ycombinator.com/item?id=41052750


That's the kind of work that can come out of academia and open source communities when societies provide the resources required.


Please start with the Windows Tax first for Linux users buying hardware...and the Apple Tax for Android users...


Either you port Tensorflow (Apple)[1] or PyTorch to your platform or you allow CUDA to run on your hardware (AMD) [2]. Companies are incentives to not have NVIDIA having a monopoly but the thing is that CUDA is a huge moat due to compatibility of all frameworks and everyone knows it. Also, all of the cloud or on premises providers use NVIDIA regardless.

[1] https://developer.apple.com/metal/tensorflow-plugin/ [2] https://www.xda-developers.com/nvidia-cuda-amd-zluda/


>> Either you port Tensorflow (Apple)[1] or PyTorch to your platform or you allow CUDA to run on your hardware (AMD) [2]. Companies are incentives to not have NVIDIA having a monopoly but the thing is that CUDA is a huge moat due to compatibility of all frameworks and everyone knows it. Also, all of the cloud or on premises providers use NVIDIA regardless.

This never made sense to me -- Apple could easily hire top talent to write Apple Silicon bindings for these popular libraries. I work at a creative ad agency, we have tons of high end apple devices yet the neural cores sit unused most of the time.


A lot of libraries seem to be working on Apple Silicon GPUs but not on ANE. I found this discussion interesting, seems like the ANE has a lot of limitations, is not well documented, and can only be used indirectly through Core ML. https://github.com/ggerganov/llama.cpp/discussions/336


The problem is that any public cluster would be outdated in 2 years. At the same time, GPUs are massively overpriced. Nvidia's profit margins on the H100 are crazy.

Until we get cheaper cards that stand the test of time, building a public cluster is just a waste of money. There are far better ways to spend $1b in research dollars.


> any public cluster would be outdated in 2 years

The private companies buying hundreds of billions of dollars of GPUs aren't writing them off in 2 years. They won't be cutting edge for long. But that's not the point--they'll still be available.

> Nvidia's profit margins on the H100 are crazy

I don't see how the current practice of giving a researcher a grant so they can rent time on a Google cluster that runs H100s is more efficient. It's just a question of capex or opex. As a state, the U.S. has a structual advantage in the former.

> far better ways to spend $1b in research dollars

One assumes the U.S. government wouldn't be paying list price. In any case, the purpose isn't purely research ROI. Like the heavy presses, it's in making a prohibitively-expensive capital asset generally available.


What about dollar cost averaging your purchases of GPUs? So that you're always buying a bit of the newest stuff every year rather than just a single fixed investment in hardware that will become outdated? Say 100 million a year every year for 20 years instead of 2 billion in a single year?


I just watched this 1950s DoD video on the heavy press program and highly recommend it: https://www.youtube.com/watch?v=iZ50nZU3oG8



Don't these public clusters exist today, and have been around for decades at this point, with varying architectures? In the sense that you submit a proposal, it gets approved, and then you get access for your research?


This is the most recent iteration of a national platform. They have tons of GPUs (and CPUs, and flash storage) hooked up as a Kubernetes cluster, available for teaching and research.

https://nationalresearchplatform.org/


Not--to my knowledge--for the GPUs necessary to train cutting-edge LLMs.


All of the major cloud providers offer grants for public research https://www.amazon.science/research-awards https://edu.google.com/intl/ALL_us/programs/credits/research https://www.microsoft.com/en-us/azure-academic-research/

NVIDIA offers discounts https://developer.nvidia.com/education-pricing

eg. for Australia, the National Computing Infrastructure allows researchers to reserve time on:

- 160 nodes each containing four Nvidia V100 GPUs and two 24-core Intel Xeon Scalable 'Cascade Lake' processors.

- 2 nodes of the NVIDIA DGX A100 system, with 8 A100 GPUs per node.

https://nci.org.au/our-systems/hpc-systems


Great idea, too bad the DOE and NSF were there first.


Better idea would be to make various open source packages utilities and put maintainers everywhere funded by public good.

AI is a fad, the brick and mortar of the future is open source tools.


> A public cluster of GPUs provided for free to American universities, companies and non-profits might not be a bad idea.

USA and Europe is already doing that in a grand scale, in different forms. Both at national and international scale.

I work at an HPC center which provides servers nationally and collaborates on international level.


Eric Schmidt advocated for this exact thing in an Op-ed piece in the latest MIT Technology Review.

[1] https://www.technologyreview.com/2024/05/13/1092322/why-amer...


It makes much more sense to invest in a next generation fab for GPUs than to buy GPUs and more closely matches this kind of project.


Does it? You're looking at a gargantuan investment in terms of money that would also require thousands of staff.

That just doesn't seem a good idea.


> gargantuan investment

it's a bigger investment, but it's an investment which will pay dividends for decades. with a compute cluster, the government is taking on an asset in the form of the cluster but also liabilities in the form of operations and administration.

with a fab, the government takes on either a promise of lower taxes for N years or hands over a bag of cash. after that they're clear of it. the company operating the fab will be responsible for the risks and on-going expenses.

on top of that...

> thousand of staff

the company will employ/attract even more top talent, each of whom will pay taxes and eventually go on to found related companies or teach the next generation or what have you. not to mention the risk reduction that comes with on-shoring something as critical to national security and the economy as a fab.

a public-access compute cluster isn't a bad idea, but it probably makes more sense to fund/operate it in similar PPP model. non-profit consortium of universities and business pool resources to plan, build, and operate it, government recognizes it as a public good and chips in a significant amount of money to help.


Now, I have no idea.

How much capability would $3.2bn in terms of AI computing power provide, including the operational and power costs of the cluster?

Certainly, you could build a "$3.2bn GPU cluster", but it would be dark.

So, how much learning time would $3.2bn provide? 1 year? 10 years?

Just curious about hand wavy guesses. I have no idea the scope of the these clusters.


The size of the cluster would have to be massive or else your job will be on the queue for a year. And even then what are you going to do downsize the resources requested so you can get in earlier? After a certain point it starts to make more sense to just buy your own xeons and run your own cluster.


Very much in this spirit is the NSF-funded National Deep Inference Fabric, which lets researchers run remote experiments on foundation models: https://ndif.us. They just announced a pilot program for Llama405b!


I'd like to see big programs to increase the amount of cheap, clean energy we have. AI compute would be one of many beneficiaries of super cheap energy, especially since you wouldn't need to chase newer, more efficient hardware just to keep costs down.


Yeah this would be the real equivalent of the program people are talking about above. That an investing in core networking infrastructure (like cables) instead of just giving huge handouts to certain corporations that then pocket the money.....


For the DoE, take a look at:

https://doeleadershipcomputing.org/


What about distributed training on volunteer hardware? Is that feasible?


It is an exciting concept, there's a huge wealth of gaming hardware deployed that is inactive at most hours of the day. And I'm sure people are willing to pay well above the electricity cost for it.

Unfortunately, the dominant LLM architecture makes it relatively infeasible right now.

- Gaming hardware has too limited VRAM for training any kind of near-state-of-the-art model. Nvidia is being annoyingly smart about this to sell enterprise GPUs at exorbitant markups.

- Right now communication between machines seems to be the bottleneck, and this is way worse with limited VRAM. Even with data-centre-grade interconnect (mostly Infiniband, which is also Nvidia, smart-asses), any failed links tend to cause big delays in training.

Nevertheless, it is a good direction to push towards, and the government could indeed help, but it will take time. We need both a more healthy competitive landscape in hardware, and research towards model architectures that are easy to train in a distributed manner (this was also the key to the success of Transformers, but we need to go further).


Couldn’t VRAM be subsidised with SSDs on a lower end machine? It would make it slower but maybe useful at last.


Perhaps, the landscape has improved a lot in the last couple of years, there are lots of implementation tricks to improve efficiency on consumer hardware, particularly for inference.

Although it is clear that the computing capacity of the GPU would be very underutilized with the SSD as the bottleneck. Even using RAM instead of VRAM is pretty impractical. It might be a bit better for chips like Apple's where the CPU, RAM and GPU are all tightly connected on the same SoC, and the main RAM is used as the VRAM.

Would that performance be still worth more than the electricity cost? Would the earnings be high enough for a wide population to be motivated to go through the hassle of setting up their machine to serve requests?


Ever heard of SETI@home?

https://setiathome.berkeley.edu


Followed the link and got two, for me, new infos: both the project and Drake are dead.

Used to contribute in the early 2000s with my Pentium for a while.

Ever got any results?

Also, for training LLMs, I understand there is a huge bandwith problem with this approach.


Imagine if they made a data center with 1957 electronics that cost $279 million.

They probably won't be using it now because the phone in your pocket is likely more powerful. Moore law did end but data center stuff are still evolving order of magnitudes faster than forging presses.


So we'll have the government bypass markets and force the working class to buy toys for the owning class?

If anything, allocate compute to citizens.


> If anything, allocate compute to citizens.

If something like this were to become a reality, I could see something like "CitizenCloud" where once you prove that you are a US Citizen (or green card holder or some other requirement), you can then be allocated a number of credits every month for running workloads on the "CitizenCloud". Everyone would get a baseline amount, from there if you can prove you are a researcher or own a business related to AI then you can get more credits.


Overall government doing anything is a bad idea. There are cases however where government is the only entity that can do certain things. These are things that involve military, law enforcement etc. Outside of this we should rely on private industry and for-profit industry as much as possible.


The American healthcare industry demonstrates the tremendous benefits of rigidly applying this mindset.

Why couldn’t law enforcement be private too? You call 911, several private security squads rush to solve your immediate crime issue, and the ones who manage to shoot the suspect send you a $20k bill. Seems efficient. If you don’t like the size of the bill, you can always get private crime insurance.


For a further exploration of this particular utopia, see Snowcrash by Neal Stephenson.


Ugh.

Government distorting undeveloped markets that have a lot of room for competition to increase efficiencies is a bad thing.

Government agencies running programs that should not be profitable, or where the only profit to be left comes at the expense of society as a whole, is a good thing.

Lots of basic medicine is the go to example here, treating cancer isn't going to be "profitable" and attempting to make it such just leads to dead people.

On the flip side, one can argue that dentistry has seen amazing strides in affordability and technological progress through the free market. From dental xrays to improvements in dental procedures to make them less painful for the patients.

Eye surgery is another area where competition has lead to good consumer outcomes.

But life of death situations where people can't spend time researching? The only profit there comes through exploiting people.


> Overall government doing anything is a bad idea.

that is bereft of detail enough to just be wrong. There are things that government is good for and things that government is bad for, but "anything" is just too broad, and reveals an anti-government bias which just isn't well thought out.


Why are governments a bad idea? Seems the human race has opted for governments doing things since the dawn of civilization. Building roads, providing defense, enforcing rights, provide social safety nets, funding costly scientific endeavors.


To summarise: There are some things where government action is the best solution, however by default see if the private sector can sort it first.


And it has been demonstrated long ago that the private market is not the most efficient solution for the general society for handling healthcare insurance.


That’s not correct. The American health care system is an extreme example of where private organisations fail overall society.


"Eventually though, open source Linux gained popularity – initially because it allowed developers to modify its code however they wanted ..."

I find the language around "open source AI" to be confusing. With "open source" there's usually "source" to open, right? As in, there is human legible code that can be read and modified by the user? If so, then how can current ML models be open source? They're very large matrices that are, for the most part, inscrutable to the user. They seem akin to binaries, which, yes, can be modified by the user, but are extremely obscured to the user, and require enormous effort to understand and effectively modify.

"Open source" code is not just code that isn't executed remotely over an API, and it seems like maybe its being conflated with that here?


"Open weights" is a more appropriate term but I'll point out that these weights are also largely inscrutable to the people with the code that trained it. And for licensing reasons, the datasets may not be possible to share.

There is still a lot of modifying you can do with a set of weights, and they make great foundations for new stuff, but yeah we may never see a competitive model that's 100% buildable at home.

Edit: mkolodny points out that the model code is shared (under llama license at least), which is really all you need to run training https://github.com/meta-llama/llama3/blob/main/llama/model.p...


"Open weights" means you can use the weights for free (as in beer). "Open source" means you get the training dataset and the methodology. ~Nobody does open source LLMs.


Indeed, since when the deliverable being a jpeg/exe, which is similar to what the model file is, is considered the source? it is more like open result or freely available vm image, which works, but has its core FS scrambled or crypted.

Zuck knows this very well and it does him no honour to speak like, and from his position this equals attempt ate trying to change the present semantics of open source. Of course, others do that too - using the notion of open source to describe something very far from open.

What Meta is doing under his command can better be desdribed as releasing the resulting...build, so that it can be freely poked around and even put to work. But the result cannot be effectively reversed engineered.

Whats more ridiculous is that precisely because the result is not the source in its whole form, that these graphical structures can made available. Only thanks to the fact it is not traceable to the source, which makes the whole game not only closed, but like... sealed forever. An unfair retell of humanity's knowledge tossed around in very obscure container that nobody can reverse engineer.

how's that even remotely similar to open source?


Even if everything was released how you described, what good would that really do for an individual without access to heaps of compute? Functionally there seems to be no difference between open weights and open compute because nobody could train a facsimile model. Furthermore, all frontier models are inscrutable due to their construction. It’s wild to me seeing people complain semantics when meta dropped their model for cheap. Now I’m not saying we should suck the zuck for this act of charity, but you have to imagine that other frontier models are not thrilled that meta has invalidated their compute moats with the release of llama. Whether we like it or not, we’re on this AI rollercoaster and I’m glad that it’s not just oligopolists dictating the direction forward. I’m happy to see meta take this direction, knowing that the alternatives are much worse.


That's not the discussion. We're talking about what open source is, and it's having the weights and the method to recreate the model.

If someone gives me an executable that I can run for free, and then says "eh why do you want the source, it would take you a long time to compile", that doesn't make it open source, it just makes it gratis.


Calling weights an executable is disingenuous and not a serious discussion. You can do a lot more with weights than you could with a binary executable.


You can do a lot more with an executable as well than just execute it. So maybe the analogy is apt, even if not exact.

Actually executables you can reverse engineer it into something that could be compiled back into an executable with the exact same functionality, which is AFAIK impossible to do with "open weights". Still, we don't call free executables "open source".


Its not really an analogy. LLMs are quite literally executables in the same way that jpegs are executables. They both specify machine readable (but not human readable) domain specific instructions executed by the image viewer/inference harness.

And yes, like other executables, they are not literal black boxes. Rather, they provide machine readable specifications which are not human readable without immense effort.

For an LLM to be open source there would need to be source code. Source code, btw, is not just a procedure that can be handed to a machine to produce code that can be executed by the machine. That means the training data and code is not sufficient (or necessary) for an open source model.

What we need for an open source model is a human readable specification of the model's functionality and data structures which allows the user to modify specific arbitrary functionally/structure, and can be used to produce an executable (the model weights).

We simply need much stronger interpretability for that to be possible.


This is debatable, even an executable is valuable artifact. You can also do a lot with executable in expert hand.


I'd find knowing what's in the training data hugely valuable - can analyse it to understand and predict capabilities.


Linux is open source and is mostly C code. You cannot run C code directly, you have to compile it and produce binaries. But it's the C code, not binary form, where the collaboration happens.

With LLMs, weights are the binary code: it's how you run the model. But to be able to train the model from scratch, or to collaborate on new approaches, you have to operate at a the level of architecture, methods, and training data sets. They are the source code.


Analogies are always going to fall short. With LLM weights, you can modify them (quant, fine-tuning) to get something different, which is not something you do with compiled binaries. There are ample areas for collaboration even without being able to reproduce from scratch, which takes $X Millions of dollars, also something that a typical binary does not have as a feature.


You can absolutely modify compiled binaries to get something different. That's how lots of video game modding and ROM hacks work.


And we would absolutely do it more often if compiling would cost as much as training of an LLM costs now.


I considered adding "normally" to the binary modifications expecting a response like this. The concepts are still worlds apart

Weights aren't really a binary in the same sense that a compiler produces, they lack instructions and are more just a bunch of floating point values. Nor can you run model weights without separate code to interpret them correctly. In this sense, they are more like a JPEG or 3d model


JPEGs and 3D models are also executable binaries. They, like model weights, contain domain specific instructions that execute in a domain specific and turing incomplete environment. The model weights are the instructions, and those instructions are interpreted by the inference harness to produce outputs.


>Nobody does open source LLMs.

There are a bunch of independent, fully open source foundation models from companies that share everything (including all data). AMBER and MAP-NEO for example. But we have yet to see one in the 100B+ parameter category.


Sorry, the tilde before "nobody" is my notation for "basically nobody" or "almost nobody". I thought it was more common.


It is more common when it comes to numbers I guess. There are ~5 ancestors in this comment chain, if I would agree roughly 4-6 is acceptable.


It's the literal (figurative) nobody rather than the literal (literal) nobody.


There are plenty of open source LLMs, they just aren’t at the top of the leaderboards yet. Here’s a recent example, I think from Apple: https://huggingface.co/apple/DCLM-7B

Using open data and dclm: https://github.com/mlfoundations/dclm


If weights are not the source, then if they gave you the training data and scripts but not the weights, would that be "open source"?


Yes, but they won't do that. Possibly because extensive copyright violation in the training data that they're not legally allowed to share.


If somebody would leak the training data and they would deny that it’s real ergo not getting sued and the data would be available.

Edit typo.


It's not available if you can't use it because you don't have as many lawyers as facebook and can't ignore laws so easily.


This is bending the definition to the other extreme.

Linux doesn't ship you the compiler you need to build the binaries either, that doesn't mean it's closed source.

LLMs are fundamentally different to software and using terms from software just muddies the waters.


And LLMs don't ship with a Python distribution.

Linux sources :: dataset that goes into training

Linux sources' build confs and scripts :: training code + hyperparameters

GCC :: Python + PyTorch or whatever they use in training

Compiled Linux kernel binary :: model weights


Just because you keep saying it doesn't make it true.

LLMs are not software any more than photographs are.


Then what is the "source"? If we are to use the term "source" then what does that mean here, as distinct from it merely being free?


It means nothing because LLMs aren't software.


Do they not run on a computer?


So does a video. Is a video open source if you're given the permissions to edit it? To distribute it? Given the files to generate it? What if the files can only be open in a proprietary program?

Videos aren't software and neither are llms.


If a video doesn't have source code, then it can't be open source. Likewise, if you feel that an LLM doesn't have source code because of some property of what it is -- as you claim it isn't software and somehow that means that it abstractly removes it from consideration for this concept (an idea I think is ridiculous, FWIW: an LLM is clearly software that runs in a particularly interesting virtual machine defined by the model architecture) -- then; somewhat trivially, it also can't be open source. It is, as the person you are responding to says, at best "open weights".

If a video somehow does have source code which can "generate it", then the question of what it means for the source code to the video to be open even if the only program which can read it and generate the video is closed source is equivalent to asking if a program written in Visual Basic can ever be open source given that the Visual Basic compiler is closed source. Personally, I can see arguments either way on this issue, though most people seem to agree that the program is still open source in such a situation.

However, we need not care too much about the answer to that specific conundrum, as the moral equivalent of both the compiler and the runtime virtual machine are almost always open source. What is then important is much easier: if you don't provide the source code to the project, even if the compiler is open source and even if it runs on an open source machine, clearly the project -- whatever it is that we might try to be discussing, including video files -- cannot be open source. The idea that a video can be open source when what you mean is the video is unencrypted and redistributanle but was merely intended to be played in an open source video player is absurd.


> Is a video open source if you're given the permissions to edit it? To distribute it? Given the files to generate it?

If you're given the source material and project files to continue editing where the original editors finished, and you're granted the rights to re-distribute - Yes, that would be open source[1].

Much like we have "open source hardware" where the "source" consists of original schematics, PCB layouts, BOM, etc. [2]

[1] https://en.wikipedia.org/wiki/Open-source_film

[2] https://en.wikipedia.org/wiki/Open-source_hardware


Videos and images are software. They are compiled binaries with very domain specific instructions executed in a very non-turing complete context. They are generally not released as open source, and in many cases the source code (the file used to edit the video or image) is lost. They are not seen, colloquially, as software, but that does not mean that they are not software.

If a video lacks a specification file (the source code) which can be used by a human reader to modify specific features in the video, then it is software that is simply incapable of being open sourced.


"LLMs are fundamentally different to software and using terms from software just muddies the waters."

They're still software, they just don't have source code (yet).


There is a comment elsewhere claiming there are a few dozen fully open source models: https://news.ycombinator.com/item?id=41048796


Why is the dataset required for it to be open source?

If I self host a project that is open sourced rather than paying for a hosted version, like Sentry.io for example, I don't expect data to come along with the code. Licensing rights are always up for debate in open source, but I wouldn't expect more than the code to be available and reviewable for anything needed to build and run the project.

In the case of an LLM I would expect that to mean the code run to train the model, the code for the model data structure itself, and the control code for querying the model should all be available. I'm not actually sure if Meta does share all that, but training data is separate from open source IMO.


The open source movement, from which the name derives, was about the freedom to make bespoke alterations to the software you choose to run. Provided you have reasonably widespread proficiency in industry standard tools, you can take something that's open source, modify that source, and rebuild/redeploy/reinterpret/re-whatever to make it behave the way that you want or need it to behave.

This is in contrast to a compiled binary or obfuscated source image, where alteration may be possible with extraordinairy skill and effort but is not expected and possibly even specirically discouraged.

In this sense, weights are entirely like those compiler binaries or obfuscated sources rather than the source code usually associated with "open source"

To be "open source" we would want LLM's where one might be able to manipulate the original training data or training algorithm to produce a set of weights more suited to one's own desires and needs.

Facebook isn't giving us that yet, and very probably can't. They're just trading on the weird boundary state of the term "open source" -- it still carries prestige and garners good will from its original techno-populist ideals, but is so diluted by twenty years of naive consumers who just take it to mean "I don't have to pay to use this" that the prestige and good will is now misplaced.


>The open source movement, from which the name derives, was about the freedom to make bespoke alterations to the software you choose to run.

The open source movement was a cash grab to make the free software movement more palatable to big corp by moving away from copy left licenses. The MIT license is perfectly open source and means that you can buy software without ever seeing its code.


If you obtain open source licensed software you can pass it on legally (and freely). With some licenses you also have to provide the source code.


The sticking point is you can’t build the model. To be able to build the model from scratch you need methodology and a complete description of the data set.

They only give you a blob of data you can run.


Got it, that makes sense. I still wouldn't expect them to have to publicly share the data itself, but if you can't take the code they share and run it against your own data to build a model that wouldn't be open source in my understanding of it.


Data is the source code here, though. Training code is effectively a build script. Data that goes into training a model does not function like assets in videogames; you can't swap out the training dataset after release and get substantially the same thing. If anything, you can imagine the weights themselves are the asset - and even if the vendor is granting most users a license to copy and modify it (unlike with videogames), the asset itself isn't open source.

So, the only bit that's actually open-sourced in these models is the inference code. But that's a trivial part that people can procure equivalents of elsewhere or reproduce from published papers. In this sense, even if you think calling the models "open source" is correct, it doesn't really mean much, because the only parts that matter are not open sourced.


Compare/contrast:

DOOM-the-engine is open source (https://github.com/id-Software/DOOM), even though DOOM-the-asset-and-scenario-data is not. While you need a copy of DOOM-the-asset-and-scenario-data to "use DOOM to run DOOM", you are free to build other games using DOOM-the-engine.


I think no one would claim that “Doom” is open source though, if that’s the situation.


That's what op is saying, the engine is GPLv2, but the assets are copyrighted. There's Freedoom though and it's pretty good [0].

[0]: https://freedoom.github.io/


The thing they are pointing at and which is the thing people want is the output of the training engine, not the inputs. This is like someone saying they have an open source kernel, but they only release a compiler and a binary... the kernel code is never released, but the kernel is the only reason anyone even wants the compiler. (For avoidance of anyone being somehow confused: the training code is a compiler which takes training data and outputs model weights.)


The output of the training engine, I.E. the model itself, isn't source code at all though. The best approximation would be considering it obfuscated code, and even then it's a stretch since it is more similar to compressed data.

It sounds like Meta doesn't share source for the training logic. That would be necessary for it to really be open source, you need to be able to recreate and modify the codebase but that has nothing to do with the training data or the trained model.


I didn't claim the output is source code, any more than the kernel is. Are you sure you don't simply agree with me?


> not actually sure if Meta does share all that

Meta shares the code for inference but not for training, so even if we say it can be open-source without the training data, Meta's models are not open-source.

I can appreciate Zuck's enthusiasm for open-source but not his willingness to mislead the larger public about how open they actually are.


https://opensource.org/osd

"The source code must be the preferred form in which a programmer would modify the program. Deliberately obfuscated source code is not allowed. Intermediate forms such as the output of a preprocessor or translator are not allowed."

> In the case of an LLM I would expect that to mean the code run to train the model, the code for the model data structure itself, and the control code for querying the model should all be available

The M in LLM is for "Model".

The code you describe is for an LLM harness, not for an LLM. The code for the LLM is whatever is needed to enable a developer to modify to inputs and then build a modified output LLM (minus standard generally available tools not custom-created for that product).

Training data is one way to provide this. Another way is some sort of semantic model editor for an interpretable model.


I still don't quite follow. If Meta were to provide all code required to train a model (it sounds like they don't), and they provided the code needed to query the model you train to get answers how is that not open source?

> Deliberately obfuscated source code is not allowed. Intermediate forms such as the output of a preprocessor or translator are not allowed.

This definition actually makes it impossible for any LLM to be considered open source until the interpretability problem is solved. The trained model is functionally obfuscated code, it can't be read or interpreted by a human.

We may be saying the same thing here, I'm not quite sure if you're saying the model must be available or if what is missing is the code to train your own model.


I'm not the person you replied directly to so I can't speak for them, but I did start this thread, and I just wanted to clarify what I meant in my OP, because I see a lot of people misinterpreting what I meant.

I did not mean that LLM training data needs to be released for the model to be open source. It would be a good thing if creators of models did release their training data, and I wouldn't even be opposed to regulation which encourages or even requires that training data be released when models meet certain specifications. I don't even think the bar needs to be high there- We could require or encourage smaller creators to release their training data too and the result would be a net positive when it comes to public understanding of ML models, control over outputs, safety, and probably even capabilities.

Sure, its possible that training data is being used illegally, but I don't think the solution to that is to just have everyone hide that and treat it as an open secret. We should either change the law, or apply it equally.

But that being said, I don't think it has anything to do with whether the model is "open source". Training data simply isn't source code.

I also don't mean that the license that these models are released under is too restrictive to be open source. Though that is also true, and if these models had source code, that would also prevent them from being open source. (Rather, they would be "source available" models)

What I mean is "The trained model is functionally obfuscated code, it can't be read or interpreted by a human." As you point out, it is definitionally impossible for any contemporary LLM to be considered open source. (Except for maybe some very, very small research models?) There's no source code (yet) so there is no source to open.

I think it is okay to acknowledge when something is technically infeasible, and then proceed to not claim to have done that technically infeasible thing. I don't think the best response to that situation is to, instead, use that as justification for muddying the language to such a degree that its no longer useful. And I don't think the distinction is trivial or purely semantic. Using the language of open source in this way is dangerous for two reason.

The first is that it could conceivably make it more challenging for copyleft licenses such as the GPL to protect the works licensed with them. If the "public" no longer treats software with public binaries and without public source code as closed source, then who's to say you can't fork the linux kernel, release the binary, and keep the code behind closed doors? Wouldn't that also be open source?

The second is that I think convincing a significant portion of the open source community that releasing a model's weights is sufficient to open source a model will cause the community to put more focus on distributing and tuning weights, and less time actually figuring out how to construct source code for these models. I suspect that solving interpretability and generating something resembling source code may be necessary to get these models to actually do what we want them to do. As ML models become increasingly integrated into our lives and production processes, and become increasingly sophisticated, the danger created by having models optimized towards something other than what we would actually like them optimized towards increases.


Data is to models what code is to software.


I don't quite agree there. Based on other comments it sounds like Meta doesn't open source the code used to train the model, that would make it not open source in my book.

The trained model doesn't need to be open source though, and frankly I'm not sure what the value there is specifically with regards to OSS. I'm not aware of a solution to interpretability problem, even if the model is shared we can't understand what's in it.

Microsoft ships obfuscated code with Windows builds, but that doesn't make it open source.


Wouldn't the "source code" of the model be closer to the source code of a compiler or the runtime library?

IMO a pre-trained model given with the source code used to train/run it is analogous to a company shipping a compiler and a compiled binary without any of the source, which is why I don't think it's "open source" without the training data.


You really should be able to train a model on whatever data you choose to use though.

Training data instead source code at all, it's content fed into the ingestion side to train a model. As long as source for ingedting and training a model is available, which it sounds like isn't the case for Meta, that would be open source as best I understand it.

Said a little differently, I would need to be able to review all code used to generate a model and all code used to query the model for it to be OSS. I don't need Meta's training data or their actual model at all, I can train my own with code that I can fully audit and modify if I choose to.


But surely you wouldn't call it open source if sentry just gave you a binary - and the source code wasn't available.


I suspect that even if you allowed people to take the data, nobody but a FAANG like organisation could even store it?


My impression is the training data for foundation models isn't that large. It won't fit on your laptop drive, but it will fit comfortably in a few racks of high-density SSDs.


yeah, according to the article [0] about the release of Llama 3.1 405B, it was trained on 15 trillion tokens using 16000 Nvidia H100's to do it. Even if they did release the training data, I don't think many people would have the number of gpus required to actually do any real training to create the model....

[0] https://ai.meta.com/blog/meta-llama-3-1/


And a token is the sequence number of a sequence of input in a restricted dictionary. GPT-2 was said to have 50k distinct tokens, so I think it's safe to assume even the latest ones are well below 4M tokens, so max 4 bytes per token. 15 trillion tokens -> 4 bytes/token * 15 T tokens -> training input<=60 TB doesn't sound that large.

It's the computation that is costly.


LLAMA is an open-weights model. I like this term, let's use that instead of open source.


Can a human programmer edit the weights according to some semantics?


It is possible to merge two fine-tunes of models from the same family by... wait for it... averaging or combining their weights[0].

I am still amazed that we can do that.

[0]: https://arxiv.org/abs/2212.09849


This is absolutely wild.


Yes. Using fine tuning.


Yes, there is the concept of a "frakenmerge" and folks have also bolted on vision and audio models to LLMs.


If you can’t share the dataset, under what twisted reality are you fine to share the derivative models based on those unsharable datasets?

In a better world, there would be no “I ran some algos on it and now it’s mine” defense.


Yeah was gonna say exactly the same thing. Weird how the legislation allows releasing LLMs trained on data that is not allowed to be shared otherwise.


Meta might possibly have a license to use (some of) that data, but not a license to distribute it. Legislation has little to do with it, I imagine.


latest llama 3.1 is in a different repo, https://github.com/meta-llama/llama-models/blob/main/models/... , but yes, the code is shared. It astonishing that in software 2.0 era, powerful applications like llama has only hundreds of lines of code, and most work hidden in training data. Source code alone is no longer that informative as Software 1.0


For models of this size, the code used to train them is going to be very custom to the architecture/cluster they are built on. It would be almost useless to anybody outside of Meta. The dataset would be more a lot more interesting, as it would at the very least show everybody how they got it to behave in certain ways.


Open training data would be great too.

If you have open data and open source code you can reproduce the weights


Not easily for these large scale models, but theoretically maybe


Really? I have to check out the training code again. Last time I looked the training and inference code were just example toys that were barely usable.

Has that changed?


Open Source Initiative (kind of a de-facto authority on what's open source and what not) is spending a whole lot of time figuring out what it means for an AI system to be open source. In other words, they're basically trying to come up with a new license because the existing ones can't easily apply.

I believe this is the current draft: https://opensource.org/deepdive/drafts/the-open-source-ai-de...


OSI made themselves the authority because they hated Richard Stallman and his Free Software movement. It's just marketing.


RMS has no interest in governing Open Source, so your comment bears no particular relevance.

RMS is an advocate for Free Software. Free Software generally implies Open Source, but not the converse.

RMS considers openness of source to be a separate category from the freeness of software. "Free software is a political movement; open source is a development model."

https://www.gnu.org/licenses/license-list.en.html


Are you really pretending that OSI and the open source label itself wasn’t a reactionary movement that vilified free software principles in hopes of gaining corporate traction?

Most of us who were there remember it differently. True open source advocates will find little to refute in what I’ve said.


> True open source advocates will find little to refute in what I’ve said.

No true Scotsman https://en.wikipedia.org/wiki/No_true_Scotsman

OSI helped popularize the open source movement. They not only make it palatable to businesses, but got them excited about it. I think that FSF/Stallman alone would not have been very successful on this front with GPL/AGPL.


Like I said, honest open source advocates won’t take issue to how I framed their position.

Here’s a more important point: how far would the open source people have gotten without GCC and glibc?

Much less far than they will ever admit, in my experience.


> Most of us who were there remember it differently. True open source advocates will find little to refute in what I’ve said.

> Like I said, honest open source advocates won’t take issue to how I framed their position.

Yet you've failed to provide even a single point of evidence to back up your claim.

> "honest open source advocates"

You've literally just made this term up. It's meaningless.


It’s not a term, it’s a phrase. It means “open source advocates who are being honest about their advocacy”, in case you really need such a degree of clarification.

I’ve met honest open source advocates before and, once again, they would be unlikely to refute the fact that “open source” was invented in explicit contrast to “free software” to achieve corporate palatability.

The comment you are responding to was literally responding to a comment which validated this exact sentiment.

As to providing evidence, those of us who were there at the time don’t need any and those of you who weren’t ought to seek some. It’s not my job to link to the nearly infinite number of conversations where this obvious dynamic played out.


For some advocates, sure. I was there, too — although at the beginning of my career and not deeply involved in most licensing discussions until the founding of Mozilla (where I argued against the GNU GPL and was generally pleased with the result of the MPL). However, from ~1990, I remember sharing some code where I "more or less" made my code public domain but recommended people consider the GNU GPL as part of the README (I don't have the source code available, so I don't recall).

Your characterization is quit easily refutable, because at the time that OSI was founded, there was already an explosion of possible licenses and RMS and other GNUnatics were making lots of noise about GNU/Linux and trying to be as maximalist as possible while presenting any choice other than the GNU GPL as "against freedom".

This certainly would not have held well with people who were using the MIT Licence or BSD licences (created around the same time as the GNU GPL v1), who believed (and continue to believe) that there were options other than a restrictive viral licence‡. Yes, some of the people involved vilified the "free software principles", but there were also GNU "advocates" who were making RMS look tame with their wording (I recall someone telling me to enjoy "software slavery" because I preferred licences other than the GNU GPL).

The "Free Software" advocates were pretending that the goals of their licence were the only goals that should matter for all authors and consumers of software. That is not and never has been the case, so it is unsurprising that there was a bit of reaction to such extremism.

OSI and the open source label were a move to make things easier for corporations to accept and understand by providing (a) a clear unifying definition, and (b) a set of licences and guidelines for knowing what licenses did what and the risks and obligations they presented to people who used software under those licences.

‡ Don't @ me on this, because both the virality and restrictiveness are features of the GNU GPL. If it weren't for the nonsense in the preamble, it would be a good licence. As it is, it is an effective if rampantly misrepresented licence.


Didn't the Open Source Definition start as the DFSG? You telling me Debian hates the Free Software movement? Unless you define "hating Free Software" as "not banning the BSD license", then I'll have to disagree.


Training code is only useful to people in academia, and the closest thing to "code you can modify" are open weights.

People are framing this as if it was an open-source hierarchy, with "actual" open-source requiring all training code to be shared. This is not obvious to me, as I'm not asking people that share open-source libraries to also share the tools they used to develop them. I'm also not asking them to share all the design documents/architecture discussion behind this software. It's sufficient that I can take the end result and reshape it in any way I desire.

This is coming from an LLM practitioner that finetunes models for a living; and this constant debate about open-source vs open-weights seems like a huge distraction vs the impact open-sourcing something like Llama has... this is truly a Linux-like moment. (at a much smaller scale of course, for now at least)


I dunno — if an open source project required, say, a proprietary compiler, that would diminish its open source-ness. But I agree it's not totally comparable, since the weights are not particularly analogous to machine code. We probably need a new term. Open Weights.


There are many "compilers", you can download The Pile yourself.


> If so, then how can current ML models be open source?

The source of a language model is the text it was trained on. Llama models are not open source (contrary to their claims), they are open weight.


You can find the entire Llama 3.0 pretraining set here: https://huggingface.co/datasets/HuggingFaceFW/fineweb

15T tokens, 45 terrabytes. Seems fairly open source to me.


Where has Facebook linked that? I can't find anywhere that they actually published that.


Many companies stopped publishing their data sets after people published evidence they were mass, copyright infringement. They dropped the specifics of pretraining data from the model cards.

Aside from licensing content, that content creators don’t like redistribution means a lawful model would probably only use Gutenberg’s collection and permissive code. Anything else, including Wikipedia, usually has licensing requirements they might violate.


Yeah I don't think I've seen it linked officially, but Meta does this sort of semi-official stuff all the time, leaking models ahead of time for PR, they even have a dedicated Reddit account for releasing unofficial info.

Regardless, it fits the compute used and the claim that they trained from public web data, and was suspiciously published by HF staff shortly after L3 released. It's about as official as the Mistral 7B v0.2 base model. I.e. mostly, but not entirely, probably for some weird legal reasons.


Says it is ~94TB, with >130k downloads, implying more than 12 exabytes of copying, seems a bit off, wonder how they are calculating downloads


No. The text is an asset used by the source to train the model. The source can process arbitrary text. Text is just text, it was written for communication purposes, software (defined by source code) processes that text in a particular way to train a model.


In programming, "source" and "asset" have specific meanings that conflict with how you used them.

Source is the input to some built artifact. It is the source of that artifact. As in: where the artifact comes from. Textual input is absolutely the source of the ML model. What you are using "source" as is analogous to the source of the compiler in traditional programming.

Asset is an artifact used as input, that is revered verbatim by the output. For example, a logo baked into an application to be rendered in the UI. The compilation of the program doesn't make a new logo, it just moves the asset into the built artifact.


I hadn't had my morning coffee yet when I wrote this and I have no idea what I meant instead of "revered", but you get the idea :D


I think it would also include the code used to train it


That would be more analogous to the build toolchain than the source code, but yes


Surely traditional “open source” also needs some notion of a reproducible build toolchain, otherwise the source code itself is approximately useless.

Imagine if the source code was in a programming language of which the basic syntax and semantics were known to no one but the original developers.

Or more realistically, I think it’s a major problem if an open source project can only be built by an esoteric process that only the original developers have access to.


Source code in a vacuum is still valuable as a way to deal with missing/inaccurate documentation and diagnose faults and their causes.

Raw training datasets similarly has some value as you can analyze it for different characteristics to understand why the trained model is under/over-representing different concepts.

But yes real FOSS should be "open-build" and allow anyone to build a test-passing artifact from raw source material.


I like the term "open weights". Open source would be the dataset and code that generates these weights.

There is still a lot you can do with weights, like fine tuning, and it is arguably more useful as retraining the entire model would cost millions in compute.


Of course you are right, I'd put it less carefully: The quoted Linux line is deceptive marketing.

- If we start with the closed training set, that is closed and stolen, so call it Stolen Source.

- What is distributed is a bunch of float arrays. The Llama architecture is published, but not the training or inference code. Without code there is no open source. You can as well call a compiler book open source, because it tells you how to build a compiler.

Pure marketing, but predictably many people follow their corporate overlords and eagerly adopt the co-opted terms.

Reminder again that FB is not releasing this out of altruism, but because they have an existing profitable business model that does not depend on generated chats. They probably do use it internally for tracking and building profiles, but that is the same as using Linux internally, so they release the weights to destroy the competition.

Isn't price dumping an anti trust issue?



No, it's not. The Llama 3 Community License Agreement is not an open source license. Open source licenses need to meet the criteria of the only widely accepted definition of "open source", and that's the one formulated by the OSI [0]. This license has multiple restrictions on use and distribution which make it not open source. I know Facebook keeps calling this stuff open source, maybe in order to get all the good will that open source branding gets you, but that doesn't make it true. It's like a company calling their candy vegan while listing one its ingredients as pork-based gelatin. No matter how many times the company advertises that their product is vegan, it's not, because it doesn't meet the definition of vegan.

[0] - https://opensource.org/osd


Isn't the MIT license the generally accepted "open source" license? It's a community owned term, not OSI owned


MIT is a permissive open source license, not the open source license.


There are more licenses than just MIT that are "open source". GPL, BSD, MIT, Apache, some of the Creative Commons licenses, etc. MIT has become the defacto default though

https://opensource.org/license (linking to OSI for the list because it's convenient, not because they get to decide)


These discussions (ie, everything that follows here) would be much easier if the crowd insisting on the OSI definition of open source would capitalize Open Source.

In English, proper nouns are capitalized.

"Open" and "source" are both very normal English words. English speakers have "the right" to use them according to their own perspective and with personal context. It's the difference between referring to a blue tooth, and Bluetooth, or to an apple store or an Apple store.


Open source licenses need to meet the criteria of the only widely accepted definition of "open source", and that's the one formulated by the OSI [0]

Who died and made OSI God?


This isn't helpful. The community defers to the OSI's definition because it captures what they care about.

We've seen people try to deceptively describe non-OSS projects as open source, and no doubt we will continue to see it. Thankfully the community (including Hacker News) is quick to call it out, and to insist on not cheapening the term.

This is one the topics that just keeps turning up:

* https://news.ycombinator.com/item?id=24483168

* https://news.ycombinator.com/item?id=31203209

* https://news.ycombinator.com/item?id=36591820


This isn't helpful. The community...

Speak for yourself, please. The term is much older than 1998, with one easily-Googled example being https://www.cia.gov/readingroom/docs/DOC_0000639879.pdf , and an explicit case of IT-related usage being https://i.imgur.com/Nw4is6s.png from https://www.google.com/books/edition/InfoWarCon/09X3Ove9uKgC... .

Unless a registered trademark is involved (spoiler: it's not) no one, whether part of a so-called "community" or not, has any authority to gatekeep or dictate the terms under which a generic phrase like "open source" can be used.


Neither of those usages relate to IT, they both are about sources of intelligence (espionage). Even if they were, the OSI definition won, nobody is using the definitions from 1995 CIA or the 1996 InfoConWar book in the realm of IT, not even Facebook.

The community has the authority to complain about companies mis-labelling their pork products as vegan, even if nobody has a registered trademark on the term vegan. Would you tell people to shut up about that case because they don't have a registered trademark? Likewise, the community has authority to complain about Meta/Facebook mis-labelling code as open source even when they put restrictions on usage. It's not gate-keeping or dictatorship to complain about being misled or being lied to.


Would you tell people to shut up about that case because they don't have a registered trademark?

I especially like how I'm the one telling people to "shut up" all of a sudden.

As for the rest, see my other reply.


You're right, I and those who agree with me were the first to ask people to "shut up", in this case, to ask Meta to stop misusing the term open source. And I was the first to say "shut up", and I know that can be inflammatory and disrespectful, so I shouldn't have used it. I'm sorry. We're here in a discussion forum, I want you to express your opinion even it is to complain about my complaints. For what it's worth, your counter-arguments have been stronger and better referenced than any other I have read (for the case of accepting a looser definition of the term open source in the realm of IT).


All good, and I also apologize if my objection came across as disrespectful.

This whole 'Open Source' thing is a bigger pet peeve than it should be, because I've received criticism for using the term on a page where I literally just posted a .zip file full of source code. The smart thing to do would have been to ignore and forget the criticism, which I will now work harder at doing.

In the case of a pork producer who labels their products as 'vegan', that's different because there is some authority behind the usage of 'vegan'. It's a standard English-language word that according to Merriam-Webster goes back to 1944. So that would amount to an open-and-shut case of false advertising, which I don't think applies here at all.


> In the case of a pork producer who labels their products as 'vegan', that's different because there is some authority behind the usage of 'vegan'.

I don't see the difference. Open source software is a term of art with a specific meaning accepted by its community. When people misuse the term, invariably in such a way as to broaden it to include whatever it is they're pushing, it's right that the community responds harshly.


Terms of art do not require licenses. A given term is either an ordinary dictionary word that everyone including the courts will readily recognize ("Vegan"), a trademark ("Microsoft® Office 365™"), or a fragment of language that everyone can feel free to use for their own purposes without asking permission. "Open Source" falls into the latter category.

This kind of argument is literally why trademark law exists. OSI did not elect to go down that path. Maybe they should have, but I respect their decision not to, and perhaps you should, too.


> Terms of art do not require licenses.

Agreed. There is no trademark on aileron or carburetor or context-free grammar. A couple of years ago I made this same point myself. [0]

> A given term is either an ordinary dictionary word that everyone including the courts will readily recognize ("Vegan"), a trademark ("Microsoft® Office 365™"), or a fragment of language that everyone can feel free to use for their own purposes without asking permission. "Open Source" falls into the latter category.

This taxonomy doesn't hold up.

Again, it's a term of art with a clear meaning accepted by its community. We've seen numerous instances of cynical and deceptive misuse of the term, which the community rightly calls out because it's not fair play, it's deliberate deception.

> This kind of argument is literally why trademark law exists

It is not. Trademark law exists to protect brands, not to clarify terminology.

You seem to be contradicting your earlier point that terms of art do not require licenses.

> OSI did not elect to go down that path. Maybe they should have, but I respect their decision not to, and perhaps you should, too.

I haven't expressed any opinion on that topic, and I don't see a need to.

[0] https://news.ycombinator.com/item?id=31203209


If the OSI members wanted to "clarify the terminology" in a way that permitted them (and you) to exclude others, trademark law would have absolutely been the correct way to do that. It's too late, however. The ship has sailed.

Come up with a new term and trademark that, and heck, I'll help you out with a legal fund donation when Facebook and friends inevitably try to appropriate it. Apart from that, you've fought the good fight and done what you could. Let it go.


The OSI was created about 20 years ago and defined and popularized the term open source. Their definition has been widely accepted over that period.

Recently, companies are trying to market things as open source when in reality, they fail to adhere to the definition.

I think we should not let these companies change the meaning of the term, which means it's important to explain every time they try to seem more open than they are.

I'm afraid the battle is being lost though.


>The OSI was created about 20 years ago and defined and popularized the term open source. Their definition has been widely accepted over that period.

It was defined and accepted by the community well before OSI came around though.


Citation? Wikipedia would appreciate your contribution.

https://en.wikipedia.org/wiki/Open_source

> Linus Torvalds, Larry Wall, Brian Behlendorf, Eric Allman, Guido van Rossum, Michael Tiemann, Paul Vixie, Jamie Zawinski, and Eric Raymond [...] > At that meeting, alternatives to the term "free software" were discussed. [...] Raymond argued for "open source. The assembled developers took a vote, and the winner was announced at a press conference the same evening

The original "Open source Definition" was derived from Debian's Social Contract, which did not use the term "open source"

https://web.archive.org/web/20140328095107/http://www.debian...


Citation? Wikipedia would appreciate your contribution.

It's not hard to find earlier examples where the phrase is used to describe enabling and (yes) leveraging community contributions to accomplish things that otherwise wouldn't be practical; see my other post for a couple of those.

But then people will rightfully object that the term "Open Source", when used in a capacity related to journalistic or intelligence-gathering activities, doesn't have anything to do with software licensing. Even if OSI had trademarked the phrase, which they didn't, that shouldn't constrain its use in another context.

To which I'd counter that this statement is equally true when discussing AI models. We are going to have to completely rewire copyright law from the ground up to deal with this. Flame wars over what "Open Source" means or who has the right to use the phrase are going to look completely inconsequential by the time the dust settles.


I'll concede that "open source" may mean other things in other contexts. For example, an open source river may mean something in particular to those who study rivers. This thread was not talking about a new context, it was not even talking about the weights of a machine learning model or the licensing of training data, it was talking about the licensing of the code in a particular GitHub repository, llama3.

AI may make copyright obsolete, or it may make copyright more important than ever, but my prediction is that the IT community will lose something of great value if the term "open source" is diluted to include licenses that restrict usage, restrict distribution, and restrict modification. I can understand why people may want to choose somewhat restrictive licenses, just like I can understand why a product may contain gelatin, but I don't like it when the product is mis-labelled as vegan. There are plenty of other terms that could be used, for example, "open" by itself. I'm honestly curious if you would defend a pork product labelled as vegan, or do you just feel that the analogy doesn't apply?


This is like saying any python program is open source because the python runtime is open source.

Inference code is the runtime; the code that runs the model. Not the model itself.


I disagree. The file I linked to, model.py, contains the Llama 3 model itself.

You can use that model with open data to train it from scratch yourself. Or you can load Meta’s open weights and have a working LLM.


Yeah a lot of people here seem to not understand that PyTorch really does make model definitions that simple, and that has everything you need to resume back-propagation. Not to mention PyTorch itself being open-sourced by Meta.

That said the LLama-license doesn't meet strict definitions of OS, and I bet they have internal tooling for datacenter-scale training that's not represented here.


> The file I linked to, model.py, contains the Llama 3 model itself.

That makes it source available ( https://en.wikipedia.org/wiki/Source-available_software ), not open source


Source available means you can see the source, but not modify it. This is kinda the opposite, you can modify the model, but you don't see all the details of its creation.


> Source available means you can see the source, but not modify it.

No, it doesn't mean that. To quote the page I linked, emphasis mine,

> Source-available software is software released through a source code distribution model that includes arrangements where the source can be viewed, and in some cases modified, but without necessarily meeting the criteria to be called open-source. The licenses associated with the offerings range from allowing code to be viewed for reference to allowing code to be modified and redistributed for both commercial and non-commercial purposes.

> This is kinda the opposite, you can modify the model, but you don't see all the details of its creation.

Per https://github.com/meta-llama/llama3/blob/main/LICENSE there's also a laundry list of ways you're not allowed to use it, including restrictions on commercial use. So not Open Source.


That's not the training code, just the inference code. The training code, running on thousands of high-end H100 servers, is surely much more complex. They also don't open-source the dataset, or the code they used for data scraping/filtering/etc.


"just the inference code"

It's not the "inference code", its the code that specifies the architecture of the model and loads the model. The "inference code" is mostly the model, and the model is not legible to a human reader.

Maybe someday open source models will be possible, but we will need much better interpretability tools so we can generate the source code from the model. In most software projects you write the source as a specification that is then used by the computer to implement the software, but in this case the process is reversed.


That is just the inference code. Not training code or evaluation code or whatever pre/post processing they do.


Is there an LLM with actual open source training code and dataset? Besides BLOOM https://huggingface.co/bigscience/bloom



Yes, there are a few dozen full open source models (license, code, data, models)


What are some of the other ones? I am aware mainly of OLMo (https://blog.allenai.org/olmo-open-language-model-87ccfc95f5...)


Can’t you do fine tuning on those binaries? That’s a modification.


You can fine tune the models, and you can modify binaries. However, there is no human readable "source" to open in either case. The act of "fine tuning" is essentially brute forcing the system to gradually alter the weights such that loss is reduced against a new training set. This limits what you can actually do with the model vs an actual open source system where you can understand how the system is working and modify specific functionality.

Additionally, models can be (and are) fine tuned via APIs, so if that is the threshold required for a system to be "open source", then that would also make the GPT4 family and other such API only models which allow finetuning open source.


I don't find this argument super convincing.

There's a pretty clear difference between the 'finetuning' offered via API by GPT4 and the ability to do whatever sort of finetuning you want and get the weights at the end that you can do with open weights models.

"Brute forcing" is not the correct language to use for describing fine-tuning. It is not as if you are trying weights randomly and seeing which ones work on your dataset - you are following a gradient.


"There's a pretty clear difference between the 'finetuning' offered via API by GPT4 and the ability to do whatever sort of finetuning you want and get the weights at the end that you can do with open weights models."

Yes, the difference is that one is provided over a remote API, and the provider of the API can restrict how you interact with it, while the other is performed directly by the user. One is a SaaS solution, the other is a compiled solution, and neither are open source.

""Brute forcing" is not the correct language to use for describing fine-tuning. It is not as if you are trying weights randomly and seeing which ones work on your dataset - you are following a gradient."

Whatever you want to call it, this doesn't sound like modifying functionality in source code. When I modify source code, I might make a change, check what that does, change the same functionality again, check the new change, etc... up to maybe a couple dozen times. What I don't do is have a very simple routine make very small modifications to all of the system's functionality, then check the result of that small change across the broad spectrum of functionality, and repeat millions of times.


The gap between fine-tuning API and weights-available is much more significant than you give it credit for.

You can take the weights and train LoRAs (which is close to fine-tuning), but you can also build custom adapters on top (classification heads). You can mix models from different fine-tunes or perform model surgery (adding additional layers, attention heads, MoE).

You can perform model decomposition and amplify some of its characteristics. You can also train multi-modal adapters for the model. Prompt tuning requires weights as well.

I would even say that having the model is more potent in the hands of individual users than having the dataset.


That still doesn't make it open source.

There is a massive difference between a compiled binary that you are allowed to do anything you want with, including modifying it, building something else on top or even pulling parts of it out and using in something else, and a SaaS offering where you can't modify the software at all. But that doesn't make the compiled binary open source.


> When I modify source code, I might make a change, check what that does, change the same functionality again, check the new change, etc... up to maybe a couple dozen times.

You can modify individual neurons if you are so inclined. That's what Anthropic have done with the Claude family of models [1]. You cannot do that using any closed model. So "Open Weights" looks very much like "Open Source".

Techniques for introspection of weights are very primitive, but i do think new techniques will be developed, or even new architectures which will make it much easier.

[1] https://www.anthropic.com/news/mapping-mind-language-model


"You can modify individual neurons if you are so inclined."

You can also modify a binary, but that doesn't mean that binaries are open source.

"That's what Anthropic have done with the Claude family of models [1]. ... Techniques for introspection of weights are very primitive, but i do think new techniques will be developed"

Yeah, I don't think what we have now is robust enough interpretability to be capable of generating something comparable to "source code", but I would like to see us get there at some point. It might sound crazy, but a few years ago the degree of interpretability we have today (thanks in no small part to Anthropic's work) would have sounded crazy.

I think getting to open sourcable models is probably pretty important for producing models that actually do what we want them to do, and as these models become more powerful and integrated into our lives and production processes the inability to make them do what we actually want them to do may become increasingly dangerous. Muddling the meaning of open source today to market your product, then, can have troubling downstream effects as focus in the open source community may be taken away from interpretability and on distributing and tuning public weights.


> a few years ago the degree of interpretability we have today (thanks in no small part to Anthropic's work) would have sounded crazy

My understanding is that a few years ago, if we knew the degree of interpretability we have today (compared to capability) it would have been devastatingly disappointing.

We are climbing out of the trough of disillusionment maybe, but to say that we have reached mind-blowing heights with interpretability seems a bit of an hyperbole, unless I've missed some enormous breakthrough.


"My understanding is that a few years ago, if we knew the degree of interpretability we have today (compared to capability) it would have been devastatingly disappointing."

I think this is a situation where both things are true. Much more progress has been made in capabilities research than interpretability and the interpretability tools we have now (at least, in regards to specific models) would have been seen as impossible or at least infeasible a few years back.


You make a good point but those are also just limitations of the technology (or at least our current understanding of it)

Maybe an analogy would help. A family spent generations breeding the perfect apple tree and they decided to “open source” it. What would open sourcing look like?


Your hypothetical apple-grower family would simply share a handbook which meticulously shared the initial species of apple used, the breeding protocol, the hybridization method, and any other factors used to breed this perfect apple.

Having the handbook and materials available would make it possible for others to reproduce the resulting apple, or to obtain similar apples with different properties by modifying the protocols.

The handbook is the source code.

On the other hand, what we have here is Monsanto saying: "we've got those Terminator-lineage apples, and we're open-sourcing them by giving you the actual apples as an end product for free. Feel free to breed them into new varieties at will as long as you're not a Big Farm company."

Not open source.


What would enable someone to reproduce the tree from scratch, and continue developing that line of trees, using tools common to apple tree breeders? I’m not an apple tree breeder, but I suspect that’s the seeds. Maybe the genetic sequence is like source code in some analogical sense, but unless you can use that information to produce an actual seed, it doesn’t qualify in a practical sense. Trees don’t have a “compilation phase” to my knowledge, so any use of “open source” would be a stretch.


"You make a good point but those are also just limitations of the technology (or at least our current understanding of it)"

Yeah, that is my point. Things that don't have source code can't be open source.

"Maybe an analogy would help. A family spent generations breeding the perfect apple tree and they decided to “open source” it. What would open sourcing look like?"

I think we need to be weary of dilemmas without solutions here. For example, let's think about another analogy: I was in a car accident last week. How can I open source my car accident?

I don't think all, or even most things, are actually "open sourcable". ML models could be open sourced, but it would require a lot of work to interpret the models and generate the source code from them.


Be charitable and intellectually curious. What would "open" look like?

GNU says "The GNU GPL can be used for general data which is not software, as long as one can determine what the definition of “source code” refers to in the particular case. As it turns out, the DSL (see below) also requires that you determine what the “source code” is, using approximately the same definition that the GPL uses."

and offers these categories, for example:

https://www.gnu.org/licenses/license-list.en.html#NonFreeSof...

* Software Licenses

* * GPL-Compatible Free Software Licenses

\

* * GPL-Incompatible Free Software Licenses

\

* Licenses For Documentation

* * Free Documentation Licenses

\

* Licenses for Other Works

* * Licenses for Works of Practical Use besides Software and Documentation

* * Licenses for Fonts

* * Licenses for Works stating a Viewpoint (e.g., Opinion or Testimony)

* * Licenses for Designs for Physical Objects


"Be charitable and intellectually curious. What would "open" look like?"

To really be intellectually curious we need to be open to the idea that there is not (yet) a solution to this problem. Or in the analogy you laid out, that it is simply not possible for the system to be "open source".

Note that most of the licenses listed under the "Licenses for Other Works" section say "It is incompatible with the GNU GPL. Please don't use it for software or documentation, since it is incompatible with the GNU GPL and with the GNU FDL." This is because these are not free software/open source licenses. They are licenses that the FSF endorses because they encourage openness and copyleft in non-software mediums, and play nicely with the GPL when used appropriately (i.e. not for software).

The GPL is appropriate for many works that we wouldn't conventionally view as software, but in those contexts the analogy is usually so close to the literal nature of software that it stops being an analogy. The major difference is public perception. For example, we don't generally view jpegs as software. However, jpegs, at their heart, are executable binaries with very domain specific instructions that are executed in a very much non-Turing complete context. The source code for the jpeg is the XCF or similar (if it exists) which contains a specification (code) for building the binary. The code becomes human readable once loaded into an IDE, such as GIMP, designed to display and interact with the specification. This is code that is most easily interacted with using a visual IDE, but that doesn't change the fact that it is code.

There are some scenarios where you could identify a "source code" but not a "software". For example, a cake can be open sourced by releasing the recipe. In such a context, though, there is literally source code. It's just that the code never produces a binary, and is compiled by a human and kitchen instead of a computer. There is open source hardware, where the source code is a human readable hardware specification which can be easily modified, and the hardware is compiled by a human or machine using that specification.

The scenario where someone has bred a specific plant, however, can not be open source, unless they have also deobfuscated the genome, released the genome publicly, and there is also some feasible way to convert the deobfuscated genome, or a modification of it, into a seed.


> vs an actual open source system where you can understand how the system is working and modify specific functionality.

No one on the planet understands how the model weights work exactly, nor can they modify them specifically (i.e. hand modifying the weights to get the result they want). This is an impossible standard.

The source code is open (sorta, it does have some restrictions). The weights are open. The training data is closed.


> No one on the planet understands how the model weights work exactly

Which is my point. These models aren't open source because there is no source code to open. Maybe one day we will have strong enough interpretability to generate source from these models, and then we could have open source models. But today its not possible, and changing the meaning of open source such that it is possible probably isn't a great idea.


It's no secret that implementing AI usually involves far more investment into training and teaching than actual code. You can know how a neural net or other ML model works. You can have all the code before you. It's still a huge job (and investment) to do anything practical with that. If Meta shares the code their AI runs on with you, you're not going to be able to do much with it unless you make the same investment in gathering data and teaching to train that AI. That would probably require data Meta won't share. You'd effectively need your own Facebook.

If everyone open sources their AI code, Meta can snatch the bits that help them without much fear of helping their direct competitors.


I think you're misunderstanding what I'm saying. I don't think its technically feasible for current models to be open source, because there is no source code to open. Yes, there is a harness that runs the model, but the vast, vast amount of instructions are contained in the model weights, which are akin to a compiled binary.

If we make large strides in interpretability we may have something resembling source code, but we're certainly not there yet. I don't think the solution to that problem should be to change the definition of open source and pretend the problem has been solved.


The term “source code” can mean many things. In a legal context it’s often just defined as the preferred format for modification. It can be argued that for artificial neural networks that’s the weights (along with code and preferably training data).


I agree; there's a lot of muddiness in the term "open source AI". Earlier this year there was a talk[1] at FOSDEM, titled "Moving a step closer to defining Open Source AI". It is from someone at the Open Source Initiative. The video and slides are available in the link below[1]. From the abstract:

"Finding an agreement on what constitutes Open Source AI is the most important challenge facing the free software (also known as open source) movement. European regulation already started referring to "free and open source AI", large economic actors like Meta are calling their systems "open source" despite the fact that their license contain restrictions on fields-of-use (among other things) and the landscape is evolving so quickly that if we don't keep up, we'll be irrelevant."

[1] https://fosdem.org/2024/schedule/event/fosdem-2024-2805-movi... defining-open-source-ai/


You release all the technology and the training data. Everything that was used to create the model, including instructions.

I'm not sure if facebook has done that


Open source = reproducible binaries (weights) by you on your computer, IMO.

Strategy of FB is that they are good to be a user only and fine ruining competitor’s business with good enough free alternatives while collecting awards as saviors of whatever.


If that were the definition then any software you can install on your computer would be open source. It makes open source lose nearly all meaning.

Just say "open weights", not "open source".


Not sure what you mean by "they are good to be a user only." Whatever their strategy is, this is great for the community.


Open training dataset + open steps sufficient to train exactly the same model.


This isn't what Meta releases with their models, though I would like to see more public training data. However, I still don't think that would qualify as "open source". Something isn't open source just because its reproducible out of composable parts. If one, very critical and system defining part is a binary (or similar) without publicly available source code, then I don't think it can be said to be "open source". That would be like saying that Windows 11 is open source because Windows Calculator is open source, and its a component of Windows.


Here’s one list of what is needed to be actually open source:

https://blog.allenai.org/hello-olmo-a-truly-open-llm-43f7e73...


That's what I meant by "open steps", I guess I wasn't clear enough.


Is that what you meant? I don't think releasing the sequence of steps required to produce the model satisfies "open source", which is how I interpreted you, because there is still no source code for the model.


They can't release training dataset if it was illegally scrapped all over the web without permission :) (taps head)


Coming up with the words and concepts to describe the models is a challenge.

Does the training data require permission from the copyright holder to use? Are the weights really open source or more like compiled assembly?


I also think that something like Chromium is a better analogy for corporate open source models than a grassroots project like Linux is. Chromium is technically open source, but Google has absolute control over the direction of it's development and realistically it's far too complex to maintain a fork without Googles resources, just like Meta has complete control over what goes into their open models, and even if they did release all the training data and code (which they don't) us mere plebs could never afford to train a fork from scratch anyway.


I think you’re right from the perspective of an individual developer. You and I are not about to fork Chromium any time soon. If you presume that forking is impractical then sure, the right to fork isn’t worth much.

But just because a single developer couldn’t do it doesn’t mean it couldn’t be done. It means nobody has organized a large enough effort yet.

For something like a browser, which is critical for security, you need both the organization and the trust. Despite frequent criticism, Mozilla (for example) is still considered pretty trustworthy in a way that an unknown developer can’t be.


If Microsoft can't do it, then we can reasonably conclude that it can't be done for any practical purpose. Discussing infinitesimal possibilities is better left to philosophers.


Doesn’t Microsoft maintain its own fork of Chromimum?


yes - their browser is chromium-based


If you think about LLMs as a new kind of programming runtime, the matrices are the source.


Ok call it Open Weights then if the dictionary definitions matter so much to you.

The actual point that matters is that these models are available for most people to use for a lot of stuff, and this is way way better than what competitors like OpenAI offer.


They don't "[allow] developers to modify its code however they want", which is a critical component of "open source", and one that Meta is clearly trying to leverage in branding around its products. I would like them to start calling these "public weight models", because what they're doing now is muddying the waters so much that "open source" now just means providing an enormous binary and an open source harness to run it in, rather than serving access to the same binary via an API.


Feels a bit like you are splitting hair for the pleasure of semantic arguments to be honest. Yes there are no source in ML, so if we want to be pedantic it shouldn't be called open source. But what really matters in the open source movement is that we are able to take a program built by someone and modify it to do whatever we want with it, without having to ask someone for permission or get scrutinized or have to pay someone.

The same applies here, you can take those models and modify them to do whatever you want (provided you know how to train ML models), without having to ask for permission, get scrutinized or pay someone.

I personally think using the term open source is fine, as it conveys the intent correctly, even if, yes, weights are not sources you can read with your eyes.


Calling that “open source” renders the word “source” meaningless. By your definition, I can release a binary executable freely and call it “open source” because you can modify it to do whatever you want.

Model weights are like a binary that nobody has the source for. We need another term.


No it’s not the same as releasing a binary, feels like we can’t get out of the pedantics. I can in theory modify a binary to do whatever I want. In practice it is intractably hard to make any significant modification to a binary, and even if you could, you would then not be legally allowed to e.g. redistribute.

Here, modifying that model is not harder that doing regular ML, and I can redistribute.

Meta doesn’t have access to some magic higher level abstraction for that model that would make working with it easier that they did not release.

The sources in ML are the architecture the training and inference code and a paper describing the training procedure. It’s all there.


"In practice it is intractably hard to make any significant modification to a binary, and even if you could, you would then not be legally allowed to e.g. redistribute."

It depends on the binary and the license the binary is released under. If the binary is released to the public domain, for example, you are free to make whatever modifications you wish. And there are plenty of licenses like this, that allow closed source software to be used as the user wishes. That doesn't make it open source.

Likewise, there are plenty of closed source projects who's binaries we can poke and prod with much higher understanding of what our changes are actually doing than we're able to get when we poke and prod LLMs. If you want to make a Pokemon Red/Blue or Minecraft mod you have a lot of tools at your disposal.

A project that only exists as a binary which the copyright holder has relinquished rights to, or has released under some similar permissive closed source license, but people have poked around enough to figure out how to modify certain parts of the binary with some degree of predictability is a more apt analogy. Especially if the original author has lost the source code, as there is no source code the speak of when discussing these models.

I would not call that binary "open source", because the source would, in fact, not be open.


Can you change the tokenizer? No, because all you have is the weights trained with the current tokenizer. Therefore, by any normal definition, you don’t have the source. You have a giant black box of numbers with no ability to reproduce it.


> Can you change the tokenizer?

Yes.

You can change it however you like, then look at the paper [1] under section 3.2. to know which hyperparameters were used during training and finetune the model to work with your new tokenizer using e.g. FineWeb [2] dataset.

You'll need to do only a fraction of the training you would have needed to do if you were to start a training from scratch for your tokenizer of choice. The weights released by Meta give you a massive head start and cost saving.

The fact that it's not trivial to do and out of reach of most consumer is not a matter of openness. That's just how ML is today.

[1]: https://scontent-sjc3-1.xx.fbcdn.net/v/t39.2365-6/452387774_...

[2]: https://huggingface.co/datasets/HuggingFaceFW/fineweb


You can change the tokenizer and build another model, if you can come up with your own version of the rest of the source (e.g., the training set, RLHF, etc.). You can’t change the tokenizer for this model, because you don’t have all of its source.


There is nothing that requires you to train with the same training set, or to re-do RLHF. You can train on fineweb, and llama 3.1 will learn to use your new tokenizer just fine.

There is 0 doubt that you are better of finetuning that model to use your tokenizer than training from scratch. So what Meta gives you for free massively helps you building your model, that's OSS to me.


You have to write all the code needed to do the modifications you are interested in. That is, there is no source code provided that can be used to make the modifications of interest. One also has to come up with suite le datasets, from scratch. Training setup and data is completely non trivial for a large language model. To replicate Llama would take hundreds of hours of engineering, at least.


> You have to write all the code needed to do the modifications you are interested in. That is, there is no source code provided that can be used to make the modifications of interest.

Just like open source?

> Training setup and data is completely non trivial for a large language model. To replicate Llama would take hundreds of hours of engineering, at least.

The entire point of having the pre-trained weight released is to *not* have to do this. You just need to finetune, which can be done with very little data, depending on the task, and many open source toolkits, that work with those weights, exist to make this trivial.


I think maybe we’re talking past each other because it seems obvious to me and others that the weights are the output of the compilation process, whereas you seem to think they’re the input. Whether you can fine tune the weights is irrelevant to whether you got all the materials needed to make them in the first place (i.e., the source).

I can do all sorts of things by “fine tuning” Excel with formulas, but I certainly don’t have the source for Excel.


> The same applies here, you can take those models and modify them to do whatever you want without having to ask for permission, get scrutinized or pay someone.

The "Additional Commercial Terms" section of the license includes restrictions that would not meet the OSI definition of open source. You must ask for permission if you have too many users.


"Public weight models" sounds about right, thanks for coming up with a good term! Hope it catches.


My central point is this:

"are available for most people to use for a lot of stuff, and this is way way better than what competitors like OpenAI offer."

I presume you agree with it.

> rather than serving access

Its not the same access though.

I am sure that you are creative enough to think of many questions that you could ask llama3, that would instead get you kicked off of OpenAI.

> They don't "[allow] developers to modify its code however they want"

Actually, the fact that the model weights are available means that you can even ignore any limitations that you think are on it, and you'll probably just get away with it. You are also ignoring the fact that the limitations are minimal to most people.

Thats a huge deal!

And it is dishonest to compare a situation where limitations are both minimal and almost unenforceable (Except against maybe Google) to a situation where its physically not possible to get access to the model weights to do what you want with them.


> Actually, the fact that the model weights are available means that you can even ignore any limitations that you think are on it, and you'll probably just get away with it. You are also ignoring the fact that the limitations are minimal to most people.

The limitations here are technical, not legal. (Though I am aware of the legal restrictions as well, and I think its worth noting that no other project would get by calling themselves open source while imposing a restriction which prevents competitors from using the system to build their competing systems.) There isn't any source code to read and modify. Yes, you can fine tune a model just like you can modify a binary but this isn't source code. Source code is a human readable specification that a computer can use to transform into executable code. This allows the human to directly modify functionality in the specification. We simply don't have that, and it will not be possible unless we make a lot of strides in interpretability research.

> Its not the same access though.

> I am sure that you are creative enough to think of many questions that you could ask llama3, that would instead get you kicked off of OpenAI.

I'm not saying that systems that are provided as SaaS don't tend to be more restrictive in terms of what they let you do through the API they expose vs what is possible if you run the same system locally. That may not always be true, but sure, as a general rule it is. I mean, it can't be less restrictive. However, that doesn't mean that being able to run code on your own machine makes the code open source. I wouldn't consider Windows open source, for example. Why? Because they haven't released the source code for Windows. Likewise, I wouldn't consider these models open source because their creators haven't released source code for them. Being technically infeasible to do doesn't mean that the definition changes such that its no longer technically infeasible. It is simply infeasible, and if we want to change that, we need to do work in interpretability, not pretend like the problem is already solved.


So then yes you agree with this:

"are available for most people to use for a lot of stuff, and this is way way better than what competitors like OpenAI offer." And that this is very significant.


One counterpoint is that major publications (eg New York Times) would have you believe that AI is a mildly lossy compression algorithm capable of reconstructing the original source material.


I believe it is able to reconstruct parts of the original source material—if the interrogator already knows the original source material to prompt the model appropriately.


It's not?


Unfortunately open source really just means an open API these days. The API is heavily intertwined with closed source.


No, open source means that sources are open, typically for inspection, modification etc. Also here it can be considered the case. Likely in order to claim "true open source", they would have to share dataset? But even this might not be enough for truely open source model? This dataset is nothing but another artifact. So how did they arrive at this dataset, now they have to share pipelines and infra...

.. the thing is, we have not dealt with llm much, it's hard to say what can be considered open source llm just yet, so we use that as metaphore for now


Weight is the new code.


I think saying it's the new binary is closer to the truth. You can't reproduce it, but you can use it. In this new version, you can even nudge it a bit to do something a little different.

New stuff, so probably not good to force old words, with known meanings, onto new stuff.


The model is more akin to a python script than a compiled C binary. This is how I see it:

Training Code and dataset are analogous to the developer who wrote the script

Model and weights are end product that is then released

Inference Code is the runtime that could execute the code. That would be e.g. PyTorch, which can import the weights and run inference.


> The model is more akin to a python script than a compiled C binary.

No, I completely disagree. Python is near pseudo-text source. Source exists for the specific purpose of being easily and completely understood, by humans, because it's for and from humans. You can turn a python calculator into a web server, because it can be split and separated at any point, because it can be completely understood at any point, and it's deterministic at every point.

A model cannot be understood by a human. It isn't meant to be. It's meant to be used, very close to as is. You can't fundamentally change the model, or dissect it, you can only nudge it in a direction, with the force of that nudge being proportional to the money you can burn, along with hope that it turns out how you want.

That's why I say it's closer to a binary: more of a black box you can use. You can't easily make a binary do something fundamentally different without changing the source. You can't easily see into that black box, or even know what it will do without trying. You can only nudge it to act a little differently, or use it as part of a workflow. (decompilation tools aside ;))


None of Meta's models are "open source" in the FOSS sense, even the latest Llama 3.1. The license is restrictive. And no one has bothered to release their training data either.

This post is an ad and trying to paint these things as something they aren't.


> no one has bothered to release their training data

If the FOSS community sets this as the benchmark for open source in respect of AI, they're going to lose control of the term. In most jurisdictions it would be illegal for the likes of Meta to release training data.


Regardless of the training data, the license even heavily restricts how you can use the model.

Please read through their "acceptable use" policy before you decide whether this is really in line with open source.


> Please read through their "acceptable use" policy before you decide whether this is really in line with open source

I'm not taking a specific posiion on this license. I haven't read it closely. My broad point is simply that open source AI, as a term, cannot practically require the training data be made available.


> In most jurisdictions it would be illegal for the likes of Meta to release training data.

How come releasing an LLM trained on that data is not illegal then? I think it should be.


the training data is the source.


I don’t think it’s that simple. The source is “the preferred form of the work for making modifications to it” (to use the GPL’s wording).

For an LLM, that’s not the training data. That’s the model itself. You don’t make changes to an LLM by going back to the training data and making changes to it, then re-running the training. You update the model itself with more training data.

You can’t even use the training code and original training data to reproduce the existing model. A lot of it is non-deterministic, so you’ll get different results each time anyway.

Another complication is that the object code for normal software is a clear derivative work of the source code. It’s a direct translation from one form to another. This isn’t the case with LLMs and their training data. The models learn from it, but they aren’t simply an alternative form of it. I don’t think you can describe an LLM as a derivative work of its training data. It learns from it, it isn’t a copy of it. This is mostly the reason why distributing training data is infeasible – the model’s creator may not have the license to do so.

Would it be extremely useful to have the original training data? Definitely. Is distributing it the same as distributing source code for normal software? I don’t think so.

I think new terminology is needed for open AI models. We can’t simply re-use what works for human-editable code because it’s a fundamentally different type of thing with different technical and legal constraints.


No the preferred way to make modifications is using the the training code. One may also input a snapshot weighs to start from, but the training code is definitely what you would modify to make a change.


how do you train it in a different language by changing the training code?


By selecting different dataset. Of course this dataset does need to exist. In practice building and curating datasets also involves a lot of code.


sounds like you need the data to train the model.


Given a well behaved training setup, you will give an equivalently powerful model, given the same dataset and training scripts, and training settings. At least if you are willing to run it several times, an pick the best one - a process that is commonly used for large models.


> the training data is the source

Sure. But that's not going to be released. The term open source AI cannot be expected to cover it because it's not practical.


Meta can call it something else other than open source.

Synthetic part of the training data could be released.


Of course it could be practical - provide the data. The fact of that society is a dystopian nightmare controlled by a few megacorporations that don't want free information does not justify outright changing the meaning of the language.


> provide the data

Who? It's not their data.


why are they using it?


And why legislation allows them to use the data to train their LLM and release that, but not release the data?


So because it's really hard to do proper Open Source with these LLMs, means we need to change the meaning of Open Source so it fits with these PR releases?


> because it's really hard to do proper Open Source with these LLMs, means we need to change the meaning of Open Source so it fits with these PR releases?

Open training data is hard to the point of impracticality. It requires excluding private and proprietary data.

Meanwhile, the term "open source" is massively popular. So it will get used. The question is how.

Meta et al would love for the choice to be between, on one hand, open weights only, and, on the other hand, open training data, because the latter is impractical. That dichotomy guarantees that when someone says open source AI they'll mean open weights. (The way open source software, today, generally means source available, not FOSS.)


>Meanwhile, the term "open source" is massively popular. So it will get used. The question is how.

Here's the source of the disagreement. You're justifying the use of the term "open source" by saying it's logical for Meta to want to use it for its popularity and layman (incorrect) understanding.

Other person is saying it doesn't matter how convenient it is or how much Meta wants to use it, that the term "open source" is misleading for a product where the "source" is the training data, and the final product has onerous restrictions on use.

This would be like Adobe giving Photoshop away for free, but for personal use only and not for making ads for Adobe's competitors. Sure, Adobe likes it and most users may be fine with it, but it isn't open source.

>The way open source software, today, generally means source available, not FOSS.

I don't agree with that. When a company says "open source" but it's not free, the tech community is quick to call it "source available" or "open core".


> You're justifying the use of the term "open source" by saying it's logical for Meta to want to use it for its popularity and layman (incorrect) understanding

I'm actually not a fan of Meta's definition. I'm arguing specifically against an unrealistic definition, because for practical purposes that cedes the term to Meta.

> the term "open source" is misleading for a product where the "source" is the training data, and the final product has onerous restrictions on use

Agree. I think the focus should be on the use restrictions.

> When a company says "open source" but it's not free, the tech community is quick to call it "source available" or "open core"

This isn't consistently applied. It's why we have the free vs open vs FOSS fracture.


> Open training data is hard to the point of impracticality. It requires excluding private and proprietary data.

Right, so the onus is on Facebook/Meta to get that right, then they could call something Open Source, until then, find another name that already doesn't have a specific meaning.

> (The way open source software, today, generally means source available, not FOSS.)

No, but it's going in that way. Open Source, today, still means that the things you need to build a project, is publicly available for you to download and run on your own machine, granted you have the means to do so. What you're thinking of is literally called "Source Available" which is very different from "Open Source".

The intent of Open Source is for people to be able to reproduce the work themselves, with modifications if they want to. Is that something you can do today with the various Llama models? No, because one core part of the projects "source code" (what you need to reproduce it from scratch), the training data, is being held back and kept private.


source available is absolutely not the same as open source

you are playing very loosely with terms that have specific, widely accepted definitions (e.g. https://opensource.org/osd )

I don't get why you think it would be useful to call LLMs with published weights "open source"


> terms that have specific, widely accepted definitions

OSF's definition is far from the only one [1]. Switzerland is currently implementing CH Open's definition, the EU another one, et cetera.

> I don't get why you think it would be useful to call LLMs with published weights "open source"

I don't. I'm saying that if the choice is between open weights or open weights + open training data, open weights will win because the useful definition will outcompete the pristine one in a public context.

[1] https://en.wikipedia.org/wiki/Open-source_software#Definitio...


For the EU, I'm guessing you're talking about the EUPL, which is FSF/OSI approved and GPL compatible, generally considered copyleft.

For the CH Open, I'm not finding anything specific, even from Swiss websites, could you help me understand what you're referring to here?

I'm guessing that all these definitions have at least some points in common, which involves (another guess) at least being able to produce the output artifacts/binaries by yourself, something that you cannot do with Llama, just as an example.


> For the CH Open, I'm not finding anything specific, even from Swiss websites, could you help me understand what you're referring to here

Was on the HN front page earlier [1][2]. The definition comes strikingly close to source on request with no use restrictions.

> all these definitions have at least some points in common

Agreed. But they're all different. There isn't an accepted defintiion of open source even when it comes to software; there is an accepted set of broad principles.

[1] https://news.ycombinator.com/item?id=41047172

[2] https://joinup.ec.europa.eu/collection/open-source-observato...


> Agreed. But they're all different. There isn't an accepted defintiion of open source even when it comes to software; there is an accepted set of broad principles.

Agreed, but are we splitting hairs here and is it relevant to the claim made earlier?

> (The way open source software, today, generally means source available, not FOSS.)

Do any of these principles or definitions from these orgs agree/disagree with that?

My hypothesis is that they generally would go against that belief and instead argue that open source is different from source available. But I haven't looked specifically to confirm if that's true or not, just a guess.


> are we splitting hairs here and is it relevant to the claim made earlier?

I don't think so. Take the Swiss definition. Source on request, not even available. Yet being branded and accepted as open source.

(To be clear, the Swiss example favours FOSS. But it also permits source on request and bundles them together under the same label.)


diluting open source into a marketing term meaning "you can download something" would be a sad result


> specific, widely accepted definitions

Realistically, nobody outside of Hacker News commenters have ever cared about the OSD. It's just not how the term is used colloquially.


who says open source colloquially? ime anyone who doesn't care about software licenses will just say free (per free beer)

and (strong personal opinion) any software developer should have a firm grip on the terminology and details for legal reasons


> who says open source colloquially?

There is a large span of people between gray beard programmer and lay person, and many in that span have some concept of open-source. It's often used synonymously with visible source, free software, or in this case, open weights.

It seems unfortunate - though expected - that over half of the comments in this thread are debating the OSD for the umpeenth time instead of discussing the actual model release or accompanying news posts. Meanwhile communities like /r/LocalLlama are going hog wild with this release and already seeing what it can do.

> any software developer should have a firm grip on the terminology and details for legal reasons

They'd simply need to review the terms of the license to see if it fits their usage. It doesn't really matter if the license satisfies the OSD or not.


No, we need to adapt an existing term into the new context that it is being deployed in.


We've had a similar debate before, but the last time it about whether Linux device drivers based on non-public datasheets under NDA were actually open source. This debate occurred again over drivers that interact with binary blobs.

I disagree with the purists - if you can legally change the source or weights - even without having access to the data used by the upstream authors - it's open enough for me. YMMV.


No. It's an asset used in the training process, the source code can process arbitrary training data.


I don’t think even that is true. I conjecture that Facebook couldn’t reproduce the model weights if they started over with the same training data, because I doubt such a huge training run is a reproducible deterministic process. I don’t think anyone has “the” source.


numpy.random.seed(1234)


AI2 has released training data in their OLMo model: https://blog.allenai.org/hello-olmo-a-truly-open-llm-43f7e73...


The big winners of this: devs and AI startups

- No more vendor lock-in

- Instead of just wrapping proprietary API endpoints, developers can now integrate AI deeply into their products in a very cost-effective and performant way

- Price race to the bottom with near-instant LLM responses at very low prices are on the horizon

As a founder, it feels like a very exciting time to build a startup as your product automatically becomes better, cheaper, and more scalable with every major AI advancement. This leads to a powerful flywheel effect: https://www.kadoa.com/blog/ai-flywheel


- Price race to the bottom with near-instant LLM responses at very low prices are on the horizon

Maybe a big price war while the market majors fight out for positioning but they still need to make money off their investments so someone is going to have to raise prices at some point and youll be locked into their system if you build on it.


>locked into their system

There are going to be loads of providers for these open models. Openrouter already has 3 providers for the new 405B model within hours.


Maybe for the time being. I don't see how else they monetize the incredible amount the spent on the models without forcing people to lock into models or benefits or something else.

It's not going to stay like this I can assure you that :).


Not sure whether you mean by that post open router serving the 405b or meta producing more.

Open router is a paid api so that can absolutely be sustainable.

And meta has multiple reasons for going open route - some explained in their posts so less so (harms their competitors)

I reckon there will be a llama 4 and beyond


Meta will make money like it has in the past by having data about users and advertising to them. Commoditizing AI helps them keep at that.

See Joel on Software "Smart companies try to commoditize their products’ complements" https://www.joelonsoftware.com/2002/06/12/strategy-letter-v/


> they still need to make money off their investments

Depends on how you define this. Most of the top companies don't care as much about making a profit off of AI inference itself, if the existence of the -feature- of AI inference drives more usage and/or sales of their other products (phones, computers, operating systems, etc.)

That's why, for example, Google and Bing searches automatically perform LLM inference at no cost to the user.


Also the opportunity to run on user compute and on private data. That supports a slate of business models that are incompatible with the mainframe approach.

Including adtech models, which are predominantly cloud-based.


It creates the opposite of a flywheel effect for you. It creates a leapfrog effect.


AI might cannabalize a lot of first gen AI businesses.


What Meta is doing is borderline market distortion. It's not that they have figured out some magic sauce they are happy to share. They are just deciding to burn brute force money that they made elsewhere and give their stuff away below cost, first of all because they can.


I know, and it's beautiful to see. Bad actors like "Open"AI tried to get in first and monopolize this tech with lawfare. But that game plan has been mooted by Meta's scorched-earth generosity.


Meta has actually figured out where the moot is: Ecosystem, tooling. As soon as "we" build it, they an still do whatever they want with the core/llm, starting with Llama 4 or any other point in the future.

The best kind of open source: All the important ingredients to make it work (more and more data and money) are either not open source or in the hands of Meta. It's prohibitive by design.

People seem happy to help build Metas empire once again in return for scraps.


To be fair MSFT investments with credits into OpenAI is also almost market distortion. All the investments done with credits posing as dollars has made the VC investment world very chaotic in the AI space. No real money changing hands and the revenue on the books of MSFT and AMAZON is low quality revenue. Those companies AI moves are overvalued.


It's strange you are downvoted for this. It is a legitimate take on things (even if it is likely not accurate as far as intent is concerned).


and Xi Jingping


Even if it's just open weights and not "true" open source, I'll still give Meta the appreciation of being one of the few big AI companies actually committed to open models. In an ecosystem where groups like Anthropic and OpenAI keep hemming and hawing about safety and the necessity of closed AI systems "for our sake", they stand out among the rest.


To me it will be most interesting to see who attempts to manipulate the models by stuffing them with content, essentially adding "duplicate" content such as via tautology, in order to make it have added-misallocated weight; which I don't think an AI model will automatically be able to determine, unless it was truly intelligent, instead it would require to be trained by competent humans.

And so the models that have mechanisms for curating and preventing such misapplied weighting, and then the organizations and individuals who accurately create adjustments to the models, will in the end be the winners - where truth has been more honed for.


Why would openai/anthropic's approach be more safe? Are people able to remove all the guard rails on the llama models?



Humanity is so fortunate this "guardrails" mentality didn't catch on when we started publishing books. While too close for comfort, we got twice lucky that computing wasn't hampered by this mentality either.

This time, humanity narrowly averted complete disaster thanks to the huge efforts and resources of a small number of people.

I wonder if we are witnessing humanity's the end of open knowledge and compute (at least until we pass through a neo dark ages and reach the next age of enlightenment).

Whether it'll be due to profit or control, it looks like humanity is posed to get fucked.


[flagged]


The EU hasn't made it hard to release models (yet). The EU has made it hard to train models on EU data. Meta has responded by blocking access to the models trained on non-EU data as a form of leverage/retribution. This is explained by your own reference.


They're not safer. The claim is that OpenAI will enforce guard rails and take steps to ensure model outputs and prompts are responsible... but only a fool would take them at their word.


Yeah.. and Facebook said they would enforce censorship on their platforms to ensure content safety.. that didn't turn out so well. Now it just censors anything remotely controversial, such as World War 2 historical facts or even just slightly offensive wording.


You're really just arguing about the tuning. I get that it's annoying as a user but as a moderator going into it with the mentality that any post is expendable and bringing down the banhammer on everything near the line keeps things civil. HN does that too with the no flame-bait rule.


HN moderation is quite poor and very subjective. The guidelines are not the site rules, the rules are made up on the spot.

HN censors too. Facebook just does it automatically on a huge scale with no reasoning behind each censor.

Censorship is just tuning people or things you don't want out. Censorship of your own content as a user is extremely annoying and Facebook's censorhsip is quite unethical. It doesn't help safety of the users, it helps safety of the business.

Also Facebook censors things that are not objectively not offensive in lots of instances. YouTube too. Safety for their brand.


Censorship isn't moderation.


The banhammer can quickly become a tool of net negative though, when actual facts are being repressed/censored.


They are positioning themselves as champions of AI open source mostly because they were blindsided by OpenAI, are not in the infra game, and want to commoditize their complements as much as possible.

This is not altruism although it's still great for devs and startups. All FB GPU investments is primarily for new AI products "friends", recommendations and selling ads.

https://www.joelonsoftware.com/2002/06/12/strategy-letter-v/


Meta does a good thing

HN spends a day figuring out how it’s actually bad


It’s not actually bad, OP’s point is that it is not motivated by altruism. An action can be beneficial to the people without that effect being the incentive


Of course, it's not altruism; it's a publicly traded corporation. No one should ever believe in any such claims by these organizations. Non-altruistic organizations can still make positive-impact actions when they align with their goals.


So you think FB did this with zero benefit to themselves? They did open source so people could improve their models and eventually have a paid tier later either from hosting services or other strategies


The linked article already spelled out the benefits


No one said it was bad. It's just self interested (as companies generally are) and are using that to have a PR spin on the topic. But again, this is what all companies do and nothing about it is bad per se.


Nothing they're doing is bad, and sometimes we benefit when large companies interests align with our own. All the spiel about believing in open systems because of being prevented from making their best products by Apple is a bit much considering we're talking about Facebook which is hardly an 'open platform', and the main thing Apple blocked them on was profiling their users to target ads.


By virtue of it being Meta, it's automatically bad.

If we lived in a sensible world we'd have nuked Meta into a trillion tiny little pieces some time around the Cambridge Analytica bullshit.


They've been working on AI for a good bit now. Open source especially is something they've championed since the mid 2010s at least with things like PyTorch, GraphQL, and React. It's not something they've suddenly pivoted to since ChatGPT came in 2022.


They are giving it "for free" because:

* they need LLMs that they can control for features on their platforms (Fb/Instagram, but I can see many use cases on VR too)

* they cannot sell it. They have no cloud services to offer.

So they would spend this money anyways, but to compensate some losses they just decided to use it to fix their PR by contenting developers


They also reap the benefits of AI researchers across the world using Llama as a base. All their research is immediately applicable to their models. It's also likely a strategic decision to reduce the moat OpenAI is building around itself.

I also think LeCunn opposes OpenAI's gatekeeping at a philosophical/political level. He's using his position to strengthen open-source AI. Sure, there's strategic business considerations, but I wouldn't rule out principled motivations too.


Yes LeCun has said he thinks AI should be open like journalism should be - that openness is inherently valuable in such things.

Add to the list of benefits to Meta that it keeps LeCun happy.


I think people massively underestimate how much time/attention span (and ad revenue) will be up for grabs once a platform really nails the "AI friend" concept. And it makes sense for Meta to position themselves for it.


yes ... I remember when online dating was absolutely cringe / weird thing to do. Ten years later and it's the primary way a whole generation seeks a partner.

It will seem incredibly weird today to have an imaginary friend that you treat as a genuine relationship but I genuinely expect this will happen and become a commonplace thing within the next two decades.


> they were blindsided by OpenAI

Given the mountain of GPUs they bought at precisely the right moment I don't think that's entirely accurate


> Given the mountain of GPUs they bought at precisely the right moment I don't think that's entirely accurate

If I remember correctly, FB didnt buy those GPUs because of Open AI, they were going to buy it anyway but Mark said whatever we are buying let's double it.


Yeah, still not entirely clear what exactly they're doing with all of it...but they certainly saw the GPU supply crunch earlier than the rest


AI is not a "complement" of a social network in the way Spolsky defines the term.

> A complement is a product that you usually buy together with another product. Gas and cars are complements. Computer hardware is a classic complement of computer operating systems. And babysitters are a complement of dinner at fine restaurants. In a small town, when the local five star restaurant has a two-for-one Valentine’s day special, the local babysitters double their rates. (Actually, the nine-year-olds get roped into early service.)

> All else being equal, demand for a product increases when the prices of its complements decrease.

Smart phones ar a complement of Instagram. VR headsets are a complement of the metaverse. AI could be a component of a social network, but it's not a complement.


Intentions are overrated. Given how many people with good intentions fuck up everything, I'd rather have actual results, even if the intention is self-serving.


I wish Meta stopped using the "open source" misnomer for free of charge weights. In the US the FTC already uses the term Open-Weights, and it seems the industry is also adopting this term (e.g. Mistral).

Someone can correct me here but AFAIK we don't even know which datasets are used to train these models, so why should we even use "open" to describe Llama? This is more similar to a freeware than an open-source project.

[1] https://www.ftc.gov/policy/advocacy-research/tech-at-ftc/202...


This is such a good point. The industry is really putting the term "open source" through the ringer at the moment but I don't see any justification for considering the final weight output a "source" anymore than releasing a compiled binary would be open source.

In fairness to Llama, the source code itself (though not the training data) is available to access, although not really under a license that many would consider open source.


Facebook is one of the great when it comes to twisting words and appropriating terms in ways that benefit Facebook.


Meta makes their money off advertising, which means they profit from attention.

This means they need content that will grab attention, and creating open source models that allow anyone to create any content on their own becomes good for Meta. The users of the models can post it to their Instagram/FB/Threads account.

Releasing an open model also releases Meta from the burden of having to police the content the model generates, once the open source community fine-tunes the models.

Overall, this move is good business move for Meta - the post doesn't really talk about the true benefit, instead moralizing about open source, but this is a sound business move for Meta.


I am not sure I follow this.

1. Is there such a thing as 'attention grabbing AI content' ? Most AI content I see is the opposite of 'attention grabbing'. Kindle store is flooded with this garbage and none of it is particularly 'attention grabbing'.

2. Why would creation of such content, even if it was truly attention grabbing, benefit meta in particular ?

3. How would poliferation of AI content lead to more ad spend in the economy. Ad budgets won't increase because of AI content?

To me this is typical Zuckerberg play. Attach metas name to whatever is trendy at the moment like ( now forgotten) metaverse, cryptocoins and bunch of other failed stuff that was trendy for a second. Meta is NOT an Gen AI company ( or a metaverse company, or a cypto company) like he is scamming ( more like colluding) the market to believe. A mere distraction from slowing user growth on ALL of meta apps.

ppl seem to have just forgotten this https://en.wikipedia.org/wiki/Diem_(digital_currency)


Sure - there is plenty of attention grabbing AI content - it doesn't have to grab _your_ attention, and it won't work for everyone. I have seen people engaging with apps that redo a selfie to look like a famous character or put the person in a movie scene, for example.

Every piece of content in any feed (good, bad, or otherwise) benefits the aggregator (Meta, YouTube, whatever), because someone will look at it. Not everything will go viral, but it doesn't matter. Scroll whatever on Twitter, YouTube Shorts, Reddit, etc. Meta has a massive presence in social media, so content being generated is shared there.

The more content of any type leads to more engagement on the platforms where it's being shared. Every Meta feed serves the viewer an ad (for which Meta is paid) every 3 or so posts (pieces of content). It doesn't matter if the user doesn't like 1/5 posts or whatever, the number of ads still goes up.


> it doesn't have to grab _your_ attention

I am talking about in general, not me personally. No popular content on any website/platform is AI generated. Maybe you have examples that lead you believe that its possible on a mass scale.

> look like a famous character or put the person in a movie scene

what attention grabbing movie used gen ai persons


i'd say reddit is a pretty great example, twitter, even instagram or facebook comments, where bot generated traffic and comments are a norm.

you have plenty of bot or "AI/LLM" generated content, that is consumed -- up to and including things like "news".

as for the comment about movies, i'm confused -- CGI has been a thing for a long time, and "AI" has been used to convey aging or how a person might look given some conditions, on screen, as well as a whole host of things.

while this might not be an LLM, it is certainly computer generated, predictive, and artificially generated.


I think the biggest part of it is just that they were behind but also betting on it. This allowed them to get a lot of traction, support and be a notable player in the race whilst still retaining some control. Chances are if someone is going to have a frontrow seat monetizing this it's still them.


AI moderators too would be an enormous boon if they could get that right.


It would be good, but the cost per moderation is still really high for it to be practical.


Creating content with AI will surely be helpful for social media to some extent but I think it's not that important in larger scheme of things, there's already a vast sea of content being created by humans and differentiation is already in recommending the right content to right people at right time.

More important is the products that Meta will be able to make if the industry standardizes on Llama. They would have the front seat in not just with access the latest unreleased models but also settings the direction of progress and next gen LLM optimizes for. If you're Twitter or Snap or TikTok or compete with Meta on the product then good luck in trying to keep up.


> Meta makes their money off advertising, which means they profit from attention. This means they need content that will grab attention

That is why they hopped on the Attention is All You Need train


This is a great point. Eventually, META will only allow LLAMA generated visual AI content on its platforms. They'll put a little key in the image that clears it with the platform.

Then all other visual AI content will be banned. If that is where legislation is heading.


Huge companies like facebook will often argue for solutions that on the surface, seem to be in the public interest.

But I have strong doubts they (or any other company) actually believe what they are saying.

Here is the reality:

- Facebook is spending untold billions on GPU hardware.

- Facebook is arguing in favor of open sourcing the models, that they spent billions of dollars to generate, for free...?

It follows that companies with much smaller resources (money) will not be able to match what Facebook is doing. Seems like an attempt to kill off the competition (specifically, smaller organizations) before they can take root.


I actually think this is one of the rare times where the small guys interests are aligned with Meta. Meta is scared of a world where they are locked out of LLM platforms, one where OpenAI gets to dictate rules around their use of the platform much like Apple and Google dictates rules around advertiser data and monetization on their mobile platforms. Small developers should be scared of a world where the only competitive LLMs are owned by those players too.

Through this lense, Meta’s actions make more sense to me. Why invest billions in VR/AR? The answer is simple, don’t get locked out of the next platform, maybe you can own the next one. Why invest in LLMs? Again, don’t get locked out. Google and OpenAi/Microsoft are far larger and ahead of Meta right now and Meta genuinely believes the best way to make sure they have an LLM they control is to make everyone else have an LLM they can control. That way community efforts are unified around their standard.


Sure, but don't you think the "not getting locked out" is just the pre-requisite for their eventual goal of locking everyone else out?


Does it really matter? Attributing goodwill to a company is like attributing goodwill to a spider that happens to clean up the bugs in your basement. Sure if they had the ability to, I'm confident Meta would try something like that, but they obviously don't, and will not for the foreseeable future.

I have faith they will continue to do what's in their best interests and if their best interests happen to align with mine, then I will support that. Just like how I don't bother killing the spider in my basement because it helps clean up the other bugs.


But you also know that the spider has been laying eggs so you better have an extermination plan ready.


Everyone is aware of that. No one thinks Facebook or Mark are some saint entities. But while the spider is doing some good deeds why not just go "yeah! go spider!". Once it becomes an asshole, we will kill it. People are not dumb.


It's not even truly open source, they set a user limit.


I'm not particularly concerned about the user limit. The companies for which those limits will matter are so large that they should consider contributing back to humanity by developing their own SOTA foundation models.


If by "everyone else" here you mean 3 or 4 large players trying to create a regulatory moat around themselves then I am fine with them getting locked out and not being able to create a moat for next 3 decades.


> I actually think this is one of the rare times where the small guys interests are aligned with Meta

Small guys are the ones being screwed over by AI companies and having their text/art/code stolen without any attribution or adherence to license. I don’t think Meta is on their side at all


That's a separate problem which affects small to large players alike (e.g. ScarJo).

Small companies interests are aligned with Meta as they are now on an equal footing with large incumbent players. They can now compete with a similarly sized team at a big tech company instead of that team + dozens of AI scientists


The reason for Meta making their model open source is rather simple: They receive an unimaginable amount of free labor, and their license only excludes their major competitors to ensure mass adoption without benefiting their competition (Microsoft, Google, Alibaba, etc). Public interest, philanthropy, etc are just nice little marketing bonuses as far as they're concerned (otherwise they wouldn't be including this licensing restriction).


All correct, Meta does obviously benefit.

It's helpful to also look at what do the developers and companies (everyone outside of top 5/10 big tech companies) get out of this. They get open access to weights of SOTA LLM models that take billions of dollars to train and 10s of billions a year to run the AI labs that make these. They get the freedom to fine tune them, to distill them, and to host them on their own hardware in whatever way works best for their products and services.


Interesting. So llama isn't actually open source! That would partially explain why they feel their moat isn't compromised.

Also... Wow, Mark Zuckerberg is such a liar! Implying that llama is open source when it isn't, while at the same time trying to gather the goodwill of FOSS developers.


Meta haven't made an open source model. They have released a binary with a proprietary but relatively liberal license. Binaries are not source and their license isn't free.


The model it's self isn't actually that valuable to facebook. The thing that's important is the dataset, the infrastructure and the people to make the models.

There is still, just about, a strong ethos( especially in the research teams) to chuck loads of stuff over the wall into opensource. (pytorch, detectron, SAM, aria etc)

but its seen internally as a two part strategy:

1) strong recruitment tool (come work with us, we've done cool things, and you'll be able to write papers)

2) seeding the research community with a common toolset.


Meta is, fundamentally, a user-generated-content distribution company.

Meta wants to make sure they commoditize their complements: they don’t want a world where OpenAI captures all the value of content generation, they want the cost of producing the best content to be as close to free as possible.


i was thinking along the same. A lot of content generated by LLMs is going to end up on Facebook or Instagram. The easier it is to create AI generated content the more content ends up on those applications.


Especially because genAI is a copyright laundering system. You can train it on copyrighted material and none of the content generated with it are copyright-able, which is perfect for social apps


> We’re releasing Llama 3.1 405B, the first frontier-level open source AI model, as well as new and improved Llama 3.1 70B and 8B models.

Bravo! While I don't agree with Zuck's views and actions on many fronts, on this occasion I think he and the AI folks at Meta deserve our praise and gratitude. With this release, they have brought the cost of pretraining a frontier 400B+ parameter model to ZERO for pretty much everyone -- well, everyone except Meta's key competitors.[a] THANK YOU ZUCK.

Meanwhile, the business-minded people at Meta surely won't mind if the release of these frontier models to the public happens to completely mess up the AI plans of competitors like OpenAI/Microsoft, Google, Anthropic, etc. Come to think of it, the negative impact on such competitors was likely a key motivation for releasing the new models.

---

[a] The license is not open to the handful of companies worldwide which have more than 700M users.


Look, absolutely zero people in the world should trust any tech company when they say they care about or will keep commitments to the open-source ecosystem in any capacity. Nevertheless, it is occasionally strategic for them to do so, and there can be ancillary benefits for said ecosystem in those moments where this is the best play for them to harm their competitors

For now, Meta seems to release Llama models in ways that don't significantly lock people into their infrastructure. If that ever stops being the case, you should fork rather than trust their judgment. I say this knowing full well that most of the internet is on AWS or GCP, most brick and mortar businesses use Windows, and carrying a proprietary smartphone is essentially required to participate in many aspects of the modern economy. All of this is a mistake. You can't resist all lock-in. The players involved effectively run the world. You should still try where you can, and we should still be happy when tech companies either slip up or make the momentary strategic decision to make this easier


> If that ever stops being the case, you should fork rather than trust their judgment.

Fork what? The secret sauce is in the training data and infrastructure. I don't think either of those is currently open.


I'm just a lowly outsider to the AI space, but calling these open source models seems kind of like calling a compiled binary open source.

If you don't have a way to replicate what they did to create the model, it seems more like freeware than open source.


As an ML researcher, I agree. Meta doesn't include adequate information to replicate the models, and from the perspective of fundamental research, the interest that big tech companies have taken in this field has been a significant impediment to independent researchers, despite the fact that they are undeniably producing groundbreaking results in many respects, due to this fundamental lack of openness

This should also make everyone very skeptical of any claim they are making, from benchmark results to the legalities involved in their training process to the prospect of future progress on these models. Without being able to vet their results against the same datasets they're using, there is no way to verify what they're saying, and the credulity that otherwise smart people have been exhibiting in this space has been baffling to me

As a developer, if you have a working Llama model, including the source code and weights, and it's crucial for something you're building or have already built, it's still fundamentally a good thing that Meta isn't gating it behind an API and if they went away tomorrow, you could still use, self-host, retrain, and study the models


Which option would be better?

A) Release the data, and if it ends up causing a privacy scandal, at least you can actually call it open this time.

B) Neuter the dataset, and the model

All I ever see in these threads is a lot of whining and no viable alternative solutions (I’m fine with the idea of it being a hard problem, but when I see this attitude from “researchers” it makes me less optimistic about the future)

> and the credulity that otherwise smart people have been exhibiting in this space has been baffling to me

Remove the “otherwise” and you’re halfway to understanding your error.


This isn't a dilemma at all. If Facebook can't release data it trains on because it would compromise user privacy, it is already a significant privacy violation that should be a scandal, and if it would prompt some regulatory or legislative remedies against Facebook for them to release the data, it should do the same for releasing the trained model, even through an API. The only reason people don't think about it this way is that public awareness of how these technologies work isn't pervasive enough for the general public to think it through, and it's hard to prove definitively. Basically, if this is Facebook's position, it's saying that the release of the model already constitutes a violation of user privacy, but they're betting no one will catch them

If the company wants to help research, it should full-throatedly endorse the position that it doesn't consider it a violation of privacy to train on the data it does, and release it so that it can be useful for research. If the company thinks it's safeguarding user privacy, it shouldn't be training models on data it considers private and then using them in public-facing ways at all

As it stands, Facebook seems to take the position that it wants to help the development of software built on models like Llama, but not really the fundamental research that goes into building those models in the same way


> If Facebook can't release data it trains on because it would compromise user privacy, it is already a significant privacy violation that should be a scandal

Thousands of entities would scramble to sue Facebook over any released dataset no matter what the privacy implications of the dataset are.

It's just not worth it in any world. I believe you are not thinking of this problem from the view of the PM or VPs that would actually have to approve this: if I were a VP and I was 99% confident that the dataset had no privacy implications, I still wouldn't release it. Just not worth the inevitable long, drawn out lawsuits from people and regulators trying to get their pound of flesh.

I feel the world is too hostile to big tech and AI to enable something like this. So, unless we want to kill AGI development in the cradle, this is what we get - and we can thank modern populist techno-pessimism for cultivating this environment.


Translation: "we train our data on private user data and copyrighted material so of course we cannot disclose any of our datasets or we'll be sued into oblivion"

There's no AGI development in the cradle. And the world isn't "hostile". The world is increasingly tired of predatory behavior by supranational corporations


> I feel the world is too hostile to big tech

Lmao what? If the world were sane and hostile to big tech, we would've nuked them all years ago for all the bullshit they pulled and continue to pull. Big tech has politicians in their pockets, but thankfully the "populist techno-pessimist" (read: normal people who are sick of billionaires exploiting the entire planet) are finally starting to turn their opinions, albeit slowly.

If we lived in a sane world Cambridge Analytica would've been the death knell of Facebook and all of the people involved with it. But we instead live in a world where psychopathic pieces of shit like Zucc get away with it, because they can just buy off any politician who knocks on their doors.


> normal people who are sick of billionaires exploiting the entire planet

Don't understand what big tech does for humanity and how much they rely on it in the day to day. Literally all of their modern conveniences are enabled by big tech.


Rather dismissive particularly as Crowdstrike has laid a good chunk of that bare.

In my experience many 'normal people' understand far more than you deign credit, many are able to forgo modern 'conveniences' if pressed.


Crowdstrike merely shows how much people depend on big tech and they don't even realize how much they rely on it.

I think you have too much faith in the average person. They scarcely understand how nearly everything in their life has been manufactured on or designed on something powered by big tech.


This post demonstrates a willful ignorance of the factors driving so-called "populist techno-pessimism" and I'm sure every time a member of the public is exposed to someone talking like this, their "techno-pessimism" is galvanized

The ire people have toward tech companies right now is, like most ire, perhaps in places overreaching. But it is mostly justified by the real actions of tech companies, and facebook has done more to deserve it than most. The thought process you just described sounds like an accurate prediction of the mindset and culture of a VP within Facebook, and I'd like you to reflect on it for a sec. Basically, you rightly point out that the org releasing what data they have would likely invite lawsuits, and then you proceeded to do some kind of insane offscreen mental gymnastics that allow this reality to mean nothing to you but that the unwashed masses irrationally hate the company for some unknowable reason

Like you're talking about a company that has spent the last decade buying competitors to maintain an insane amount of control over billions of users' access to their friends, feeding them an increasingly degraded and invasive channel of information that also from time to time runs nonconsensual social experiments on them, and following even people who didn't opt in around the internet through shady analytics plugins in order to sell dossiers of information on them to whoever will pay. What do you think it is? Are people just jealous of their success, or might they have some legit grievances that may cause them to distrust and maybe even loathe such an entity? It is hard for me to believe Facebook has a dataset large enough to train a current-gen LLM that wouldn't also feel, viscerally, to many, like a privacy violation. Whether any party that felt this way could actually win a lawsuit is questionable though, as the US doesn't really have signficant privacy laws, and this is partially due to extensive collaboration with, and lobbying by, Facebook and other tech companies who do mass-surveillance of this kind

I remember a movie called Das Leben der Anderen (2006) (Officially translated as "the lives of others") which got accolades for how it could make people who hadn't experienced it feel how unsettling the surveillance state of East Germany was, and now your average American is more comprehensively surveilled than the Stasi could have imagined, and this is in large part due to companies like facebook

Frankly, I'm not an AGI doomer, but if the capabilities of near-future AI systems are even in the vague ballpark of the (fairly unfounded) claims the American tech monopolies make about them, it would be an unprecedented disaster on a global scale if those companies got there first, so inasmuch as we view "AGI research" as something that's inevitably going to hit milestones in corporate labs with secretive datasets, I think we should absolutely kill it to whatever degree is possible, and that's as someone who truly, deeply believes that AI research has been beneficial to humanity and could continue to become moreso


> Release the data, and if it ends up causing a privacy scandal...

We can't prove that a model like llama will never produce a segment of its training data set verbatim.

Any potential privacy scandal is already in motion.

My cynical assumption is that Meta knows that competitors like OpenAI have PR-bombs in their trained model and therefore would never opensource the weights.


The model is public, so you can at least verify their benchmark claims.


Generally speaking, no. An important part of a lot of benchmarks in ML research is generalization. What this means is that it's often a lot easier to get a machine learning model to memorize the test cases in a benchmark than it is to train it to perform a general capability the benchmark is trying to test for. For that reason, the dataset is important, as if it includes the benchmark test cases in some way, it invalidates the test

When AI research was still mostly academic, I'm sure a lot of people still cheated, but there was somewhat less incentive to, and norms like publishing datasets made it easier to verify claims made in research papers. In a world where people don't, and there's significant financial incentive to lie, I just kind of assume they're lying


> If you don't have a way to replicate what they did to create the model, it seems more like freeware

Isn't that a bit like arguing that a linux kernel driver isn't open source if I just give you a bunch of GPL-licensed source code that speaks to my device, but no documentation how my device works? If you take away the source code you have no way to recreate it. But so far that never caused anyone to call the code not open-source. The closest is the whole GPL3 Tivoization debate and that was very divisive.

The heart of the issue is that open source is kind of hard to define for anything that isn't software. As a proxy we could look at Stallman's free software definition. Free software shares a common history with open source and in most open source software is free/libre, and the other way around, so this might be a useful proxy.

So checking the four software freedoms:

- The freedom to run the program as you wish, for any purpose: For most purposes. There's that 700M user restriction, also Meta forbids breaking the law and requires you to follow their acceptable use policy.

- The freedom to study how the program works, and change it so it does your computing as you wish: yes. You can change it by fine tuning it, and the weights allow you to figure out how it works. At least as well as anyone knows how any large neural network works, but it's not like Meta is keeping something from you here

- The freedom to redistribute copies so you can help your neighbor: Allowed, no real asterisks

- The freedom to distribute copies of your modified versions to others: Yes

So is it Free Software™? Not really, but it is pretty close.


The model is "open-source" for the purpose of software engineering, and it's "closed data" for the purpose of AI research. These are separate issues and it's not necessary to conflate them under one term


> it seems more like freeware than open source.

What would you have them do instead? Specifically?


Release the training set and the code that was used to train the model, or stop calling it open source.

If you can't fork it and take the project in your own direction, it's not open source.


They actually did open source the infrastructure library they developed. They don't open source the data but they describe how they gathered/filtered it.


A good point.

Forgive me, I am AI naive, is there some way to harness Llama to train ones own actually-open AI?


Kinda. Since you can self-host the model on a linux machine, there's no meaningful way for them to prevent you from having the trained weights. You can use this to bootstrap other models, or retrain on your own datasets, or fine-tune from the starting point of the currently-working model. What you can't do is be sure what they trained it on


How open is it really though? If you're starting from their weights, do you actually have legal permission to use derived models for commercial purposes? If it turns out that Meta used datasets they didn't have licenses to use in order to generate the model, then you might be in a big heap of mess.


From a legal perspective, yea. If we end up having any legal protection against training AI models, legal liability will be a huge mess for everyone involved. From an engineering perspective, if all you need is the pretrained weights, there's not a clear way Facebook could show up and break your product from a technological perspective, as compared to if the thing is relying on, say, an OpenAI API key rather than a self-hosted Llama instance


I could be wrong but most “model” licenses prohibit the use of the models to improve other models


That's a good point. I expect it is ultimately unenforceable though. I'm describing training a model for myself, not for sale or public consumption.


Is forking really possible with an LLM or one the size of future Lama versions, have they even released the weights and everything? Maybe I am just negative about it because I feel Meta is the worst company ever invented and feel this will hurt society in the long run just like Facebook.


> have they even released the weights?

Isn't that what the model is? just a collection weights?


When you run `ollama pull llama3.1:70b`, which you can literally do right now (assuming ollama.com is installed and you're not afraid of the terminal), and it downloads a 40 gigabyte model, that is the weights!

I'd consider the ability to admit when even your most hated adversary is doing something right, a hallmark of acting smarter.

Now, they haven't released the training data with the model weights. THAT plus the training tooling would be "end to end open source". Apple actually did that very thing recently, and it flew under almost everyone's radar for some reason:

https://x.com/vaishaal/status/1813956553042711006?s=46&t=qWa...


Doing something right vs doing something that seems right but has a hidden self interest that is harmful in the long run can be vastly different things. Often this kind of strategy will allow people to let their guard down, and those same people will get steamrolled down the road, left wondering where it all went wrong. Get smarter.


How in the heck is an open source model that is free and open today going to lock me down, down the line? This is nonsense. You can literally run this model forever if you use NixOS (or never touch your windows, macos or linux install again). Zuck can't come back and molest it. Ever.

The best I can tell is that their self-interest here is more about gathering mindshare. That's not a terrible motive; in fact, that's a pretty decent one. It's not the bully pressing you into their ecosystem with a tit-for-tat; it's the nerd showing off his latest and going "Here. Try it. Join me. Join us."


> How in the heck is an open source model that is free and open today

Is free, but it's not open source


Yeah because history isn't absolutely littered with examples of shiny things being dangled in front of people with the intent to entrap them /s.

Can you really say this model will still be useful in 2 years, 5 years for you? And that FB's stance on these models will still be open source at that time once they incrementally make improvements? Maybe, maybe not. But FB doesn't give anything away for free, and the fact that you think so is your blindness, not mine. In case you haven't figured it out, this isn't a technology problem, this is a "FB needs marketshare and it needs it fast" problem.


> But FB doesn't give anything away for free, and the fact that you think so is your blindness, not mine

Is it, though? They are literally giving this away "for free". https://dev.to/llm_explorer/llama3-license-explained-2915 Unless you build a service with it that has over 700 million monthly users (read: "problem anyone would love to have"), you do not have to re-negotiate a license agreement with them. Beyond that, it can't "phone home" or do any other sorts of nefarious shite. The other limitations there, which you can plainly read, seem not very restrictive.

Is there a magic secret clause conspiracy buried within the license agreement that you believe will be magically pulled out at the worst possible moment? >..<

Sometimes, good things happen. Sorry you're "too blinded" by past hurt experience to see that, I guess


In tech you can trust the underdogs. Once they turn into dominant players they turn evil. 99% of the cases.


Praising is good. Gratitude is a bit much. They got this big by selling user generated content and private info to the highest bidder. Often through questionable means.

Also, the underdog always touts Open Source and standards, so it’s good to remain skeptical when/if tables turn.


All said and done, it is a very expensive and balsy way to undercut competitors. They’ve spent > $5B on hardware alone, much of which will depreciate in value quickly.

Pretty sure the only reason Meta’s managed to do this is because of Zuck’s iron grip on the board (majority voting rights). This is great for Open Source and regular people though!


Zuck made a bet when they provisioned for reels to buy enough GPUs to be able to spin up another reels-sized service.

Llama is probably just running on spare capacity (I mean, sure, they've kept increasing capex, but if they're worried about an llm-based fb competitor they sort of have to in order to enact their copycat strategy)


At Meta level, spending $5B to stay competitive is not balsy. It’s a bargain.


Well, he didn't do it to be "nice", you can be sure about that. Obviously they see a financial gain somewhere/sometime


I'm perfectly happy with them draining the life essence out of the people crazy enough to still use Facebook, if they're funneling the profits into advancing human progress with AI. It's an Alfred Nobel kind of thing to do.


It's not often you see a take this bad on HN. Wow!

You are aware Facebook tracks everyone, not just people with Facebook accounts, right? They have a history of being anti-consumer in every sense of the word. So while I can understand where you're coming from, it's just not anywhere close to being reality.

If you want to or not, if you consent or not, Facebook is tracking and selling you.


Oh no! Facebook knows who I am!

No they are not selling me. How can they sell my attention to advertisers when I don't look at their ads? How can they influence me if I don't engage with their algorithm? You're the one who's trying to sell me your fear and mistrust.


>selling user generated content and private info to the highest bidder

Was always their modus operandi, surely. How else would they have survived.

Thanks for returning everyone else;s content and never mind all the content stealing your platform did.


> the AI folks at Meta deserve our praise and gratitude

We interviewed Thomas who led Llama 2 and 3 post training here in case you want to hear from someone closer to the ground on the models https://www.latent.space/p/llama-3


"Come to think of it, the negative impact on such competitors was likely a key motivation for releasing the new models."

"Commoditize Your Complement" is often cited here: https://gwern.net/complement


Makes me wonder why he's really doing this. Zuckerberg being Zuckerberg, it can't be out of any genuine sense of altruism. Probably just wants to crush all competitors before he monetizes the next generation of Meta AI.


Its certainly not altruism. Given that Facebook/Meta owns the largest user data collection systems, any advancement in AI ultimately strengthens their business model (which is still mostly collecting private user data, amassing large user datasets, and selling targeting ads).

There is a demo video that shows a user wearing a Quest VR headset and asks the AI "what do you see" and it interprets everything around it. Then, "what goes well with these shorts"... You can see where this is going. Wearing headsets with AIs monitoring everything the users see and collecting even more data is becoming normalized. Imagine the private data harvesting capabilities of the internet but anywhere in the physical world. People need not even choose to wear a Meta headset, simply passing a user with a Meta headset in public will be enough to have private data collected. This will be the inevitable result of vision models improvements integrated into mobile VR/AR headsets.


That's very dystopian. It's bad enough having cameras everywhere now. I never opted in to being recorded.


That sounds fantastic. If they make the Meta headset easy to wear and somewhat fashionable (closer to eyeglass than to a motorcycle helmet), I'd take it everywhere and record everything. Give me a retrospective search and conferences/meetings will be so much easier (I am terrible with names).


I wouldn’t even say hi alone my name to someone wearing a Meta headset out in public. And if facial recognition becomes that common for wearers, most of the population is going to adorn something to prevent that. And if it’s at work, I’m not working there and I have to think many would agree. Coworkers don’t and wouldn’t tolerate coworkers taking videos or pictures of them.


This is not how the overwhelming majority of the world works though.

> if facial recognition becomes that common for wearers, most of the population is going to adorn something to prevent that

"Most of the population" is going to be "the wearers".

> Coworkers don’t and wouldn’t tolerate coworkers taking videos or pictures of them.

Here is a fun experience you can try: just hit "record" on every single Teams or Meet meeting you're ever on (or just set recording as the default setting in the app).

See how many coworkers comment on it, let alone protest.

I can tell you from experience (of having been in thousands of hours of recorded meetings in the last 3 years) that the answer is zero.


You are probably right, but that is truly a cyberpunk dystopian situation. A few megacorps will catalog every human interaction and there will be no way to opt out.


Of course, no Hacker News thread is complete without the "I would never shake hands with an Android user" guy who just has to virtue signal.

> And if facial recognition becomes that common for wearers, most of the population is going to adorn something to prevent that

My brother in Christ, you sincerely underestimate how much "most of the population" gives a shit. Most people are being tracked by Google Maps or FindMy, are triangulated with cell towers that know their exact coordinates, and willingly use social media that profiles them individually. The population doesn't even try in the slightest to resist any of it.


[flagged]


> I think you dramatically overestimate the number of people that would actually care about any theoretical privacy infringement

Not really surprised that you don't see it as a problem

> This is a very antiquated view IMO. You are already being filmed and monitored at work.

Not really surprised that you don't see it as a problem


> a privacy-aware secure LLM

Funniest thing I've heard all month.


Do read through the linked article. Not sure how you could make cloud compute more private if you tried; apart from homomorphic encryption.


I really think the value of this for Meta is content generation. More open models (especially state of the art) means more content is being generated, and more content is being shared on Meta platforms, so there is more advertising revenue for Meta.


He's not even pretending it's altruism. Literally about 1/3 of the entire post is the section titled "Why Open Source AI Is Good for Meta". I find it really weird that there are whole debates in threads here about whether it's altruistic when Zuckerberg isn't making that claim in the first place.


All the content generated by llms (good or bad) is going to end up back in Facebook/Instagram and other social media sites. This enables Meta to show growth and therefore demand a higher stock price. So it makes sense to get content generation tools out there as widely as possible.


Zuckerberg didn't really say anything about altruism. The point he was making is an explicit "I believe open models are best for our business"

He was clear in that one of their motivations is avoiding vendor lockin. He doesn't want Meta to be under the control of their competitors or other AI providers.

He also recognizes the value brought to his company by open sourcing products. Just look at React, PyTorch, and GraphQL. All industry standards, and all brought tremendous value to Facebook.


You can always listen to the investor calls for the capitalist point of view. In short, attracting talent, building the ecosystem, and making it really easy for users to make stuff they want to share on Meta's social networks


He addresses this pretty clearly in the post. They don't want to be beholden to other companies to build the products they want to build. Their experience being under Apple's thumb on mobile strongly shaped this point of view.


Don't be fooled, it is a "embrace extend extinguish" strategy. Once they have enough usage and be the default standard they will start to find any possible ways to make you pay.


Credits where due: Facebook didn't do that with React or PyTorch. Meta will reap benefit for sure, but they don't seem to be betting on selling the model itself, rather they will benefit from being at the forefront of a new ecosystem.


Hasn't really happened with PyTorch or any of their other open sourced releases tbh.


There's nothing open source about it.

It's a proprietary dump of data you can't replicate or verify.

What were the sources? What datasets it was trained on? What are the training parameters? And so on and so on


> they have brought the cost of pretraining a frontier 400B+ parameter model to ZERO

It is still far from zero.


If the model is already pretrained, there's no need to pretrain it, so the cost of pretraining is zero.


Yeah but you only have the one model, and so far it seems to be only good on paper.


So far, it seems like this release has done ~nothing to the stock price for GOOGL/MSFT, which we all know has been propped up largely on the basis of their AI plans. So it's probably premature to say that this has messed it up for them.


> We’re releasing Llama 3.1 405B

Is it possible to run this with ollama?


If you have the ram for it.

Ollama will offload as many layers as it can to the gpu then the rest will run on the cpu/ram.


Sure, if you have a H100 cluster. If you quant it to int4 you might get away with using only 4 H100 GPUs!


Assuming $25k a pop, that’s at least $100k in just the GPUs alone. Throw in their linking technology (NVLink) and cost for the remaining parts, won’t be surprised if you’re looking at $150k for such a cluster. Which is not bad to be honest, for something at this scale.

Can anyone share the cost of their pre-built clusters, they’ve recently started selling? (sorry feeling lazy to research atm, I might do that later when I have more time).


You can rent H100 GPUs.


you're about right.

https://smicro.eu/nvidia-hgx-h100-640gb-935-24287-0001-000-1

8x H100 HGX cluster for €250k + VAT


If you want your first token around tomorrow lunch sure


>> Bravo! While I don't agree with Zuck's views and actions on many fronts, on this occasion I think he and the AI folks at Meta deserve our praise and gratitude.

Nope. Not one bit. Supporting F/OSS when it suits you in one area and then being totally dismissive of it in every other area should not be lauded. How about open sourcing some of FB's VR efforts?


They open sourced Gear VR. The are a stark contrast to other players in terms of how they have built everything on open standards (OpenXR, WebXR, etc), and they have just opened their platform by allowing third parties to build and customise it to make their own commercial offerings. Not open source, but a quite a contrast to every other player in that industry so far.


I've summarized this entire thread in 4 lines (didn't even use AI for it!)

Step 1. Chick-Fil-A releases a grass-fed beef burger to spite other fast-food joints, calls it "the vegan burger"

Step 2. A couple of outraged vegans show up in the comments, pointing out that beef, even grass-fed beef, isn't vegan

Step 3. Fast food enthusiasts push back: it's unreasonable to want companies to abide by this restrictive definition of "vegan". Clearly this burger is a gamechanger and the definition needs to adapt to the times.

Step 4. Goto Step 2 in an infinite loop


Open source software is one of our best and most passionately loved inventions. It'd be much easier to have a nuanced discussion about "open weights" but I don't think that's in Facebook's interest.


More like vegetarians show up claiming to be vegans, then vegans show up and explain why eating animal products is still wrong.

That's the difference between open source and free software.


Yeah the moral step up from the status quo is still laudable. Open weights are still much improved over the closed creepy spy agency clusterfucks that OpenAI/Microsoft/Google/Apple are bringing to the table.


On point, and pretty good analogy


Software 2.0 is about open licensing.

I.e., the more important thing - the more "free" thing - is the licensing now.

E.g., I play around with different image diffusion models like Stable Diffusion and specific fine-tuned variations for ControlNet or LoRA that I plug into ComfyUI.

But I can't use it at work because of the licensing. I have to use InvokeAI instead of ComfyUI if I want to be careful and only very specific image diffusion models without the latest and greatest fine-tuning. As others have said - the weights themselves are rather inscrutable. So we're building on more abstract shapes now.

But the key open thing is making sure (1) the tools to modify the weights are open and permissive (ComfyUI, related scripts or parts of both the training and deployment) and (2) the underlying weights of the base models and the tools to recreate them have MIT or other generous licensing. As well as the fine-tuned variants for specific tasks.

It's not going to be the naive construction in the future where you take a base model and as company A you produce company A's fine tuned model and you're done.

It's going to be a tree of fine-tuned models as a node-based editor like ComfyUI already shows and that whole tree has to be open if we're to keep the same hacker spirit where anyone can tinker with it and also at some point make money off of it. Or go free software the whole way (i.e., LGPL or equivalent the whole tree of tools).

In that sense unfortunately Llama has a ways to go to be truly open: https://news.ycombinator.com/item?id=36816395


In the LLM world there are many open source solutions to find tuning, maybe the best one being from Meta: https://github.com/pytorch/torchtune

In terms of inference and interface (since you mentioned comfy) there are many truly open source options such as vLLM (though there isn't a single really performant open source solution for inference yet).


Thanks! Good to know.


The LLAMA 3.1 license addresses some of this.


> This is how we’ve managed security on our social networks – our more robust AI systems identify and stop threats from less sophisticated actors who often use smaller scale AI systems.

Ok, first of all, has this really worked? AI moderators still can't capture the mass of obvious spam/bots on all their platforms, threads included. Second, AI detection doesn't work, and with how much better the systems are getting, it's probably never going to, unless you keep the best models for yourself, and it's is clear from the rest of the note that its not zuck's intention to do so.

> As long as everyone has access to similar generations of models – which open source promotes – then governments and institutions with more compute resources will be able to check bad actors with less compute.

This just doesn't make sense. How are you going to prevent AI spam, AI deepfakes from causing harm with more compute? What are you gonna do with more compute about nonconsensual deepfakes? People are already using AI to bypass identity verification on your social media networks, and pump out loads of spam.


"AI detection doesn't work, and with how much better the systems are getting, it's probably never going to, unless you keep the best models for yourself"

I don't think that's true. I don't think even the best privately held models will be able to detect AI text reliably enough for that to be worthwhile.


I found this dubious as well, especially how it is portrayed as a simple game of compute power. For a start, there is an enormous asymmetry which is why we have a spam problem in the first place. For example a single bot can send out millions of emails at almost no cost and we have to expend a lot more "energy" to classify each one and decide if it's spam or not. So you don't just need more compute power you need drastically more compute power, and as AI models improve and get refined, the operation at ten times the scale is probably going to be marginally better, not orders of magnitude better.

I still agree with his general take - bad actors will get these models or make them themselves, you can't stop it. But the logic about compute power is odd.


Interesting quotes. Less sophisticated actors just means humans who already write in 2020 what the NYT wrote in early 2022 to prepare for Biden's State Of The Union 180° policy reversals (manufacturing consent).

FB was notorious for censorship. Anyway, what is with the "actions/actors" terminology? This is straightforward totalitarian language.


This is really good news. Zuck sees the inevitability of it and the dystopian regulatory landscape and decided to go all in.

This also has the important effect of neutralizing the critique of US Government AI regulation because it will democratize "frontier" models and make enforcement nearly impossible. Thank you, Zuck, this is an important and historic move.

It also opens up the market to a lot more entry in the area of "ancillary services to support the effective use of frontier models" (including safety-oriented concerns), which should really be the larger market segment.


Unfortunately, there are a number of AI safety people that are still crowing about how AI models need to be locked down, with some of them loudly pivoting to talking about how open source models aid China.

Plus there's still the spectre of SB-1047 hanging around.


Probably, Yann Lecun is the Lord Varys here. He has Mark's ear and Mark believes in Yann's vision.


The "open source" part sounds nice, though we all know there's nothing particularly open about the models (or their weights). The barriers to entry remain the same - huge upfront investments to train your own, and steep ongoing costs for "inference".

Is the vision here to treat LLM-based AI as a "public good", akin to a utility provider in a civilized country (taxpayer funded, govt maintained, non-for-profit)?

I think we could arguably call this "open source" when all the infra blueprints, scripts and configs are freely available for anyone to try and duplicate the state-of-the-art (resource and grokking requirements nonwithstanding)


check out the paper. it's pretty comprehensive https://ai.meta.com/research/publications/the-llama-3-herd-o...


Sure but under what license? Because slapping “open source” on the model doesn’t make it open source if it’s not actually license that way. The 3.1 license still contains their non-commercial clause (over 700m users) and requires derivatives, whether fine tunings or trained on generated data, to use the llama name.


"Use it for whatever you want(conditions apply), but not if you are Google, Amazon, etc. If you become big enough talk to us." That's how I read the license, but obviously I might be missing some nuance.


You also can't use it for training or improving other models.

You also can't use it if you're the government of India.

Neither can sex workers use it. (Do you know if your customers are sex workers?)

There are also very vague restrictions for things like discrimination, racism etc.


They're actually updating their license to allow LLAMA outputs for training!

https://x.com/AIatMeta/status/1815766335219249513


> You also can't use it if you're the government of India.

Why is that?


I'm also very curious. On the other hand I don't want governments touching AI models either


Also it isn't source code, it is a binary. You need at least the data curation code and preferably the data itself for it to be actually source code in the practical sense that anyone can remake the build.

Llama could change the license on later versions to kill your business and you have no options as you don't know how they trained it or have the budget to.

It's not much more free than binary software.


Interesting discussion! While I agree with Zuckerberg's vision, the comments raise valid concerns. The point about GPU accessibility and cost is crucial. Public clusters are great, but sustainable funding and equitable access are essential to avoid exacerbating existing inequalities. I also resonate with the call for CUDA alternatives. Breaking the dependence on proprietary technology is key for a truly open AI ecosystem. While existing research clusters offer some access, their scope and resources often pale in comparison to what companies like Meta are proposing. We need a multi-pronged approach: open-sourcing models AND investing in accessible infrastructure, diverse hardware options, and sustainable funding models for a truly democratic AI future.


3nm chip fabs take years to build. You don't just go to AWS and spin one up. This is the very hard part about AI that breaks a lot of the usual tech assumptions. We have entered a world where suddenly there isn't enough compute, because it's just too damn hard to build capacity and that's different from the past 40 years.


I suspect we are still early in the optimization evolution. The weights are what matter. The ability to run them anywhere might come.


The training datasets and methodology are what matters. None of that is disclosed by anyone


> Third, a key difference between Meta and closed model providers is that selling access to AI models isn’t our business model. That means openly releasing Llama doesn’t undercut our revenue, sustainability, or ability to invest in research like it does for closed providers. (This is one reason several closed providers consistently lobby governments against open source.)

The whole thing is interesting, but this part strikes me as potentially anticompetitive reasoning. I wonder what the lines are that they have to avoid crossing here?


>> ...but this part strikes me as potentially anticompetitive reasoning.

"Commoditize your complements" is an accepted strategy. And while pricing below cost to harm competitors is often illegal, the reality is that the marginal cost of software is zero.


spending a very quantifiable large amount of money to release something your nominal competitors charge for without having your own direct business case for it seems a little much


Companies spend very large amounts of money on all sorts of things that never even get released. Nothing wrong with releasing something for free that no longer costs you anything. Who knows why they developed it in the first place, it makes no difference.


Llama isn't open source. The license is at https://llama.meta.com/llama3/license/ and includes various restrictions on use, which means it falls outside the rules created by the https://opensource.org/osd


Additional Commercial Terms. If, on the Llama 3.1 version release date, the monthly active users of the products or services made available by or for Licensee, or Licensee’s affiliates, is greater than 700 million monthly active users in the preceding calendar month, you must request a license from Meta, which Meta may grant to you in its sole discretion, and you are not authorized to exercise any of the rights under this Agreement unless or until Meta otherwise expressly grants you such rights.

Which open-source has such restrictions and clause?


Which open source costs dedicated usage of 16 thousand H100 over several months?

C'mon folks, they're opening up for free to 99.99% of potential users what cost hundreds of millions of dollars, if not in the ballpark of a billion.

Let's appreciate that, instead of focusing on semantics for a while.


Licensing is not a simple semantic problem. It is a legal problem that have strong ramifications, especially things are on their way to standardize. What Facebook is trying to do with their "open source" models is to exhaust possibility of fully open source models to be industry standarts. and create an alternative monopoly to Microsoft/OpenAI. Think of it as if an entity had right to ISO standards, they would be extremely rich. Eventually researchers will release pretty advance ML models that are fully open source(from dataset to training code) and Facebook is trying to block them even before start to prevent of the possibility of this models to be standard. This is a complementary tactic of the industry to closed source rivals and should not be understood as challenging to them.

A good wording for this is "open-washing" as described in this paper: https://dl.acm.org/doi/fullHtml/10.1145/3630106.3659005


I don’t think the largest tech companies in the world have earned that view of benevolence. Its real hard to take altruism seriously when it is coming from Zuckerberg.


Just call it not open-source? May be. Freemium?


Open source "AI" is a proxy for democratising and making (much) more widely useful the goodies of high performance computing (HPC).

The HPC domain (data and compute intensive applications that typically need vector, parallel or other such architectures) have been around for the longest time, but confined to academic / government tasks.

LLM's with their famous "matrix multiply" at their very core are basically demolishing an ossified frontier where a few commercial entities (Intel, Microsoft, Apple, Google, Samsung etc) have defined for decades what computing looks like for most people.

Assuming that the genie is out of the bottle, the question is: what is the shape of end-user devices that are optimally designed to use compute intensive open source algorithms? The "AI PC" is already a marketing gimmick, but could it be that Linux desktops and smartphones will suddenly be "ΑΙ natives"?

For sure its a transformational period and the landscape T+10 yrs could be drastically different...


Unfortunately it is barely more open source than Windows. Llama 3 weights are binary code and while the license is pretty good it isn't open source.


The FTC also recently put out a statement that is fairly pro-open source: https://www.ftc.gov/policy/advocacy-research/tech-at-ftc/202...

I think it's interesting to think about this question of open source, benefits, risk, and even competition, without all of the baggage that Meta brings.

I agree with the FTC, that the benefits of open-weight models are significant for competition. The challenge is in distinguishing between good competition and bad competition.

Some kind of competition can harm consumers and critical public goods, including democracy itself. For example, competing for people's scarce attention or for their food buying, with increasingly optimized and addictive innovations. Or competition to build the most powerful biological weapons.

Other kinds of competition can massively accelerate valuable innovation. The FTC must navigate a tricky balance here — leaning into competition that serves consumers and the broader public, while being careful about what kind of competition it is accelerating that could cause significant risk and harm.

It's also obviously not just "big tech" that cares about the risks behind open-weight foundation models. Many people have written about these risks even before it became a subject of major tech investment. (In other words, A16Z's framing is often rather misleading.) There are many non-big tech actors who are very concerned about current and potential negative impacts of open-weight foundation models.

One approach which can provide the best of both worlds, is for cases where there are significant potential risks, to ensure that there is at least some period of time where weights are not provided openly, in order to learn a bit about the potential implications of new models.

Longer-term, there may be a line where models are too risky to share openly, and it may be unclear what that line is. In that case, it's important that we have governance systems for such decisions that are not just profit-driven, and which can help us continue to get the best of all worlds. (Plug: my organization, the AI & Democracy Foundation; https://ai-dem.org/; is working to develop such systems and hiring.)


making food that people want to buy is good actually

i am not down with this concept of the chattering class deciding what are good markets and what are bad, unless it is due to broad-based and obvious moral judgements.


Except of 90% of the food in the supermarket shelves out there, which is packed in sugar and conservatives.


In general I look back on my time at FB with mixed feelings, I’m pretty skeptical that modern social media is a force for good and I was there early enough to have moved the needle.

But this is really positive stuff and it’s nice to view my time there through the lens of such a change for the better.

Keep up the good work on this folks.

Time to start thinking about opening up a little on the training data.


Who knew FB would hold OpenAI's original ideals, and OpenAI now holds early FB ideals/integrity.


FB needed to differentiate drastically. FB is at its best creating large data infra.


Mark Zuckerberg was attacked by the media when it suited their tech billionaire villain narrative. Now there's Elon Musk so Zuckerberg gets to be on the good side again


Meta's article with more details on the new LLAMA 3.1 https://ai.meta.com/blog/meta-llama-3-1/


The irony of this letter being written by Mark Zuckerburg at Meta, while OpenAI continues to be anything but open, is richer than anyone could have imagined.


Interview with Mark Zuckerberg released today: https://www.bloomberg.com/news/videos/2024-07-23/mark-zucker...


Meanwhile Facebook is flooded with AI-generated slop with hundreds of thousands of other bots interacting with it to boost it to whoever is insane enough to still use that putrid hellhole of a mass-data-harvesting platform.

Dead internet theory is very much happening in real time, and I dread what's about to come since the world has collectively decided to lose their minds with this AI crap. And people on this site are unironically excited about this garbage that is indistinguishable from spam getting more and more popular. What a fucking joke


The feed certainly is, but I suspect most activity left on Facebook is happening in group pages. Groups are the only thing I still log in for as some of them, particularly the local ones, have no other way of taking part. They are also commonly membership by request and actively moderated. If I had the time (and energy) I might put some effort into advocating to moving to something else, but it will be an uphill battle.


It is probably a mix of people who got nowhere else to interact with people, and people using Groups. Facebook was where you'd go to talk to all your friends and family, most of my friends have been getting shadowbanned since 2012 ~ so it made me use it less. I got auto striked on my account for making a meme joke about burning a house down due to a giant spider in a video. I appealed, and it got denied. I'm not using a platform that will inadvertently ban me by AI. But the people actually posting to kill others, and actually burn shit down, and bots stay just fine?

Plus I didn't want to risk my employers Facebook App being in limbo if I got banned, so I left Facebook alone, never to return.

Facebook trying to police the world is the only thing keeping me away, if I can use the platform and post meme comments again, maybe I might reconsider, but I doubt it. Reddit is in a similar boat. You can get banned, but all the creepy pedophile comments from decades and recently are still up no problem.


> But the people actually posting to kill others, and actually burn shit down ...

That kind of burning down is classified as "mostly peaceful" by mainstream and AI.


I stopped going on facebook a few years ago and don't miss it; I don't even need messenger as everyone migrated to whatsapp (yes I know, normal people don't want to move to signal, but got quite a few techy friends to migrate). The FB-only groups are indeed a problem, I'm delegating them to my wife.

IF I ever had to go to FB for anything, I'd probably install a wall-removing browser extension. Mobile app is of course out of question.


> IF I ever had to go to FB for anything, I'd probably install a wall-removing browser extension. Mobile app is of course out of question.

You’ll probably find you can no longer make an account. I’m in the same boat as you (not used and haven’t missed in over a decade), however, my partner needed an account to manage an ad campaign for a client and neither of us were able to make one. Both tried a load of different things and, ultimately, gave up. Had to tell the client what they needed over a video call


I just tried making one after reading your comment, and it was... pretty straightforward? I'm curious what blockers you encountered


Must be something to do with our situation? Basically both I and my partner just got told we were making fraudulent accounts. Used our real names, emails, phone numbers and same result. Used multiple other phone numbers to see if they were blacklisted for some reason, nope. Used my mother’s WiFi in case it was something to do with IP, nope. Tried from India with an Indian phone, nope.


For me it's the Marketplace. Left FB many years ago only to come back to keep an eye out for used Lego for the kiddos. At least in my region, and for my purposes, Marketplace is miles better than any other competing sites/apps.


Same here, Groups + Marketplace is actually a wealth of information. There are still a few dark patterns but most manageable for a "free" platform.

OPs comments read like we're describing something the SS built (Godwin says hi).


I'd assume that any platform which get's sufficiently popular will become a bot and AI content target...


> If I had the time (and energy) I might put some effort into advocating to moving to something else, but it will be an uphill battle.

What are the alternatives for local groups? I've recently seen an increase in the amount of Discourse forums available, which is nice, but I don't think it'd be very appealing to the average cycling or hiking group.


The challenges of moving to alternative solutions


The irony of a bot account sliding into a convo about internet slop is not lost.


How do you know?


The comment history does read much like you'd expect from a bot, lots of short, generic statements that vaguely tie to the subject of the post


There was no need to scroll through my history and analyze it. Your analysis is mistaken. I'm not a bot. Sometimes my comments are brief because I'm trying to be concise. I don't just express my agreement with a simple 'I agree', but respond in a more detailed manner yet also shortly.


They're brief but they also sound so bot-ish in the way they're written, I'm not the only one to have pointed that out in the thread


Yes, there are four of you here. And this is the first time I’ve encountered something like this. I hope I won’t trouble you with my answers anymore!


No trouble, you just have a different writing style!


And I don't understand why people lump all AI together as if a coding assistant is the same thing as AI generated spam and other garbage. I'm pretty sure no-one here is excited about that.

I'm excited about the former since AI has massively improved my productivity as a programmer to a point where I can't imagine going back. Everything is not black or white and people can be excited about one part of something and hate another at the same time.


Seeing some of the code my colleagues are shitting out with the help of coding "assistants", I would definitely categorize their contributions as spam, and has had nothing but an awful effect on my own time and energy, having to sift through the unfiltered crap. The problem being, of course, that the idiotic C-suite in their infinite wisdom decided to push "use AI assistants" as a KPI, so people are even encouraged to spam PRs with terrible code.

If this is what productivity looks like then I'm proud to be unproductive.


I'm sorry that you work at a dysfunctional company.


fear is why


That's actually the preferred outcome. The open internet noise ratio will be so high that it turns into pure randomness. The traditional old venues (reputed blogs, small communities forums, pay for valued information, pay for your search, etc..) will resurface again. The Popular Web has been in a slow decline, time to kill it.


> The traditional old venues […] will resurface again.

… to be subsequently drowned out by AI “copies” of themselves, which in turn are used to train more AIs, until we don't have a Dead Internet¹ but a Habsburg Internet.

--

[1] https://en.wikipedia.org/wiki/Dead_Internet_theory


Is it really a decline? If people are looking for and consuming the slop, where is the issue?

There is still plenty high quality stuff too if that is what you’re looking for. If you want to roll with the pigs in the shit, who am I to tell you no?


My concern is that these platforms will soon sell Human Created (tm) content back to us.


I get your frustration of a scorched internet. But I don't think it's all that gloomy. Whether we like it or not, LLMs and some kind of a "down-to-earth AI" is here to stay, once the dust settles. Right now, it feels like everything is burning because we're in the thick of an evolving situation; and the Silicon Valley tech-bros are in a hurry to ride the wave and make a quick buck with their ChatGPT wrapper. (I can't speak of social networks, I don't have any accounts for 10+ years, except for HN.)

    * * *
On "collective losing of minds", you might appreciate this quote from 1841 (!) by Charles MacKay. I quoted it in the past[1] here, but is worth re-posting:

"In reading the history of nations, we find that, like individuals, they have their whims and their peculiarities; their seasons of excitement and recklessness, when they care not what they do. We find that whole communities suddenly fix their minds upon one object, and go mad in its pursuit; that millions of people become simultaneously impressed with one delusion, and run after it, till their attention is caught by some new folly more captivating than the first [...]

"Men, it has been well said, think in herds; it will be seen that they go mad in herds, while they only recover their senses slowly, and one by one."

— from MacKay's book, 'Extraordinary Popular Delusions and the Madness of Crowds'

[1] https://news.ycombinator.com/item?id=25767454