I really wish people would use "open weights" rather than "open source". It's precise and obvious, and leaves an accurate descriptor for actual "open source" models, where the source and methods that that generate the artifact, that is the weights, is open.
It's not precise. People who want to use "open weights" instead of "open source" are focusing on the wrong thing.
The weights are, for all practical purposes, source code in their own right. The GPL defines "source code" as "the preferred form of the work for making modifications to it". Almost no one would be capable of reproducing them even if given the source + data. At the same time, the weights are exactly what you need for the one type of modification that's within reach of most people: fine-tuning. That they didn't release the surrounding code that produced this "source" isn't that much different than a company releasing a library but not their whole software stack.
I'd argue that "source" vs "weights" is a dangerous distraction from the far more insidious word in "open source" when used to refer to the Llama license: "open".
The Llama 3.1 license [0] specifically forbids its use by very large organizations, by militaries, and by nuclear industries. It also contains a long list of forbidden use cases. This specific list sounds very reasonable to me on its face, but having a list of specific groups of people or fields of endeavor who are banned from participating runs counter to the spirit of open source and opens up the possibility that new "open" licenses come out with different lists of forbidden uses that sound less reasonable.
To be clear, I'm totally fine with them having those terms in their license, but I'm uncomfortable with setting the precedent of embracing the word "open" for it.
Llama is "nearly-open source". That's good enough for me to be able to use it for what I want, but the word "open" is the one that should be called out. "Source" is fine.
Do the costs really matter here? "Weights" are "the preferred form of the work for making modifications to it" in the same sense compiled binary code would be, if for some reason no one could afford to recompile a program from sources.
Fine-tuning and LoRAs and toying with the runtime are all directly equivalent to DLL injection[0], trainers[1], and various other techniques used to tweak a compiled binary before or at runtime, including plain taking at the executable with a hex editor. Just because that's all anyone except the model vendor is able to do, doesn't merit calling the models "open source", much like no one would call binary-only software "open source" just because reverse engineering is a thing.
No, the weights are just artifacts. The source is the dataset and the training code (and possibly the training parameters). This isn't fundamentally different from running an advanced solver for a year, to find a way to make your program 100 byes smaller so it can fit on a Tamagochi. The resulting binary is magic, can't be reproduced without spending $$$$ on compute for th solver, but it is not open source. The source code is the bit that (produced the original binary that) went into the optimizer.
Calling these models "open source" is a runaway misuse of the term, and in some cases, a sleigh of hand.
[1] - https://en.wikipedia.org/wiki/Trainer_(games) - a type of programs popular some 20 years ago, used to cheat at, or mod, single-player games, by keeping track of and directly modifying the memory of the game process. Could be as simple as continuously resetting the ammo counter, or as complex as injecting assembly to add new UI elements.
> Fine-tuning and LoRAs and toying with the runtime are all directly equivalent to DLL injection[0], trainers[1], and various other techniques used to tweak a compiled binary before or at runtime, including plain taking at the executable with a hex editor.
No, because fine tuning is basically just a continuation of the same process that the original creators used to produce the weights in the first place, in the same way that modifying source code directly is in traditional open source. You pick up where they left off with new data and train it a little bit (or a lot!) more to adapt it to your use case.
The weights themselves are the computer program. There exists no corresponding source code. The code you're asking for corresponds not to the source code of a traditional program but to the programmers themselves and the processes used to write the code. Demanding the source code and data that produced the weights is equivalent to demanding a detailed engineering log documenting the process of building the library before you'll accept it as open source.
Just because you can't read it doesn't make it not source code. Once you have the weights, you are perfectly capable of modifying them following essentially the same processes the original authors did, which are well known and well documented in plenty of places with or without the actual source code that implements that process.
> Calling these models "open source" is a runaway misuse of the term, and in some cases, a sleigh of hand.
I agree wholeheartedly, but not because of "source". The sleight of hand is getting people to focus on that instead of the really problematic word.
The thing is, the core of the GPT architecture is like 40 lines of code. Everyone knows what the source code is basically (minus optimizations). You just need to bring your own 20TB in data, 100k GPUs, and tens of millions in power budget, and you too can train llama 405b.
If I understood the article correctly he intends to let the community make suggestions to selected developers which work on the source somehow. So maybe part of the source will be made visible.
Its not open source. Your definition would make most video games open source - we modify them all the time. The small runtime framework IS open source but that's not much benefit as you cant really modify it hugely because the weights fix it to an implementation.
> Your definition would make most video games open source - we modify them all the time.
No, because most video games aren't licensed in a way that makes that explicitly authorized, nor is modding the preferred form of the work for making modifications. The video game has source code that would be more useful, the model does not have source code that would be more useful than the weights.
Is there any other case where "open source" is used for something that can't be reproduced? Seems like a new term is required, in the concept of "open source, non-reproducible artifacts".
I suppose language changes. I just prefer it changes towards being more precise, not less.
But games like Quake are not "open source". They have been open-sourced, specifically the executable parts were, without the assets. This is usually spelled out clearly as the process happen.
In terms of functional role, if we're to compare the models to open-sourced games, then all that's been open-sourced is the trivial[0] bit of code that does the inference.
Maybe a more adequate comparison would be a SoC running a Linux kernel with a big NVidia or Qualcomm binary blob in the middle of it? Sure, the Linux kernel is open source, but we wouldn't call the SoC "open source", because all that makes it what it is (software-side) is hidden in a proprietary binary.
--
[0] - In the sense that there's not much of it, and it's possible to reproduce from papers.
Academia - nowadays source is needed is a lot of conferences, but the datasets, depending on where/how it might have be obtained, just can't be used or not available and the exact results can't be reproduced.
Not sure if the code is required under an open source license, but it's the same issue.
---
IMO, source is source and can be used for other datasets. Dataset isn't available, bring your own.
In this case, the source is there. The output is there, and not technically required. What isn't available is the ability to confirm the output comes from that source. That's not required under open source though.
What's disingenuous is the output being called 'open source'.
A dataset very much is the source code. It's the part that gets turned into the program through an automated process (training is equivalent to compilation).
> where the source and methods that that generate the artifact, that is the weights, is open.
When you require the same thing in software, namely the whole stack to run the software in question to be open source, we don't call the license open source.
Nope. Those model releases only open source the equivalent of "run.bat" that does some trivial things and calls into a binary blob. We wouldn't call such a program "open source".
Hell, in case of the models, "the whole stack to run the software" already is open source. Literally everything except the actual sources - the datasets and the build scripts (code doing the training) - is available openly. This is almost a literal inverse of "open source", thus shouldn't be called "open source".
Training a model is like automatic programming, and the key of it is having a well-organized dataset.
If some "opensource" model just have the model and training methods but no dataset, it’s like some repo which released an executable file with a detailed design doc. Where is the source code? Do it yourself, please.
NOTE: I understand the difficulty of open-sourcing datasets. I'm just saying that the term "opensource" is getting diluted.
Super cool, though sadly 405b will be outside most personal usage without cloud providers which sorta defeats the purpose of opensource to some extent atleast sadly, because .. nvidia's rampup of consumer VRAM is glacial
Zoom out a bit. There’s a massive feeder ecosystem around llama. You’ll see many startups take this on and help drive down inference costs for everyone and create competitive pressure that will improve the state of the art.
I agree that 405B isn't practical for home users, but I disagree that it defeats the purpose of open source. If you're building a business on inference it can be valuable to run an open model on hardware that you control, without the need to worry that OpenAI or Anthropic or whoever will make drastic changes to the model performance or pricing. Also, it allows the possibility of fine-tuning the model to your requirements. Meta believes it's in their interest to promote these businesses.
I'd think of the 405B model as the equivalent to a big rig tractor trailer. It's not for home use. But also check out the benchmark improvements for the 70B and 8B models.
If you think of open source as a protocol through which the ecosystem of companies loosely collaborate, then it's a big deal. E.g. Groq can work on inference without a complicated negotiations with Meta. Ditto for Huggingface, and smaller startups.
I agree with you on open source in the original, home tinkerer sense.
I have the 70b model running quantized just fine on an M1 Max laptop with 64GiB unified RAM. Performance is fine and so far some Q&A tests are impressive.
This is good enough for a lot of use cases... on a laptop. An expensive laptop, but hardware only gets better and cheaper over time.
Just for reference, the current version of that laptop costs 4800€ (14 inch macbook pro, m3 max, 64gb of ram, 1TB of storage). So price-wise that is more like four laptops.
I think they were referring to the form factor not the price. But even then the price of four laptops is not out of line for enthusiast hobby spending.
Ever priced out a four wheeler, a jet-ski, a filled gun safe, what a "car guy" loses in trade in values every two years, what a hobbyist day-trader is losing before they cut their losses or turn it around, or what a parent who lives vicariously through their child and drags them all over their nearby states for overnight trips so they can do football/soccer/ballet/whatever at 6am on Saturdays against all the other kids who also won't become pro athletes? What about the cost of a wingsuit or getting your pilots license? "Cruisers" or annual-Disney vacationers? If you bought a used CNC machine from a machine shop? But spend five grand on a laptop to play with LLMs and everyone gets real judgmental.
I don't have the hardware to confirm this, so I'd take it with a grain of salt, but ChatGPT tells me that a maxed out M3 MacBook Pro with 128 GB RAM should be capable of efficiently running Llama 3.1 405B, albeit with essentially no ability to multitask.
(It also predicted that a MacBook Air in 2030 will be able to do the same, and that for smartphones to do the same might take around 20 years.)
I’ve run the Falcon 180B on my M3 Max with 128 GB of memory. I think I ran it at 3-bit. Took a long time to load and was incredibly slow at generating text. Even if you could load the Llama 405B model it would be too slow to be of much use.
Ah, that's a shame to hear. FWIW, ChatGPT did also suggest that there was a lot of room for improvement in the MPS backend of PyTorch that would likely make it more efficient on Apple hardware in time.
You fundamentally misunderstand the bottleneck of large LLMs. It is not really possible to make gains that way.
A 405B LLM has 405 billion parameters. If you run it at full "prescision", each parameter takes up 2 bytes, which means you need 810GB of memory. If it does not fit in RAM or GPU memory it will swap to disc and be unusably slow.
You can run the model at reduced prescision to save memory, called quantisation, but this will degrade the quality of the response. The exact amount of degradation depends on the task, the specific model and its size. Larger models seem to suffer slightly less. 1 byte per parameter is pretty much as good as full precision. 4 bits per parameter is still good quality, 3 bits is noticeably worse and 2 bits is often bad to unusable.
With 128GB of RAM, zero overhead and a 405B model, you would have to quantize to about 2.5 bits, which would noticeably degrade the response quality.
There is also model pruning, which removes parameters completely, but this is much more experimental than quantisation, also degrades response quality, and I have not seen it used that widely.
I appreciate the additional information, but I'm not sure what you're claiming is a fundamental misunderstanding on my part. I was referring to running the model with quantization, and was clear that I hadn't verified the accuracy of the claims.
The comment about the MPS PyTorch backend was related to performance, not whether the model would fit at all. I can't say whether it's accurate that the MPS backend has significant room for optimization, but it is still publicly listed as in beta.
Yes my mistake, I read your answer to mean that you think that the model could fit into the memory with the help of efficiency gains.
I would be sceptical about increasing efficiency. I'm not that familiar with the subject, but as far as I know, LLMs for single users (i.e. with batch size 1) are practically always limited by the memory bandwidth. The whole LLM (if it is monolytic) has to be completely loaded from memory once for each new token (which is about 4 characters). With 400GB per second memory bandwidth and 4-bit quantisation, you are limited to 2 tokens per second, no matter how efficiently the software works. This is not unusable, but still quite slow compared to online services.
Got it, thanks, that makes sense. I was aware that memory was the primary bottleneck, but wasn't clear on the specifics of how model sizes mapped to memory requirements or the exact implications of quantization in practice. It sounds like we're pretty far from a model of this size running on any halfway common consumer hardware in a useful way, even if some high-end hardware might technically be able to initialize it in one form or another.
GPU memory costs about $2.5/GB on the spot market, so that is $500 for 200GB. I would speculate that it might be possible to build such a LLM card for $1-2k, but I suspect that the market for running larger LLMs locally is just too small to consider, especially now that the datacentre is so lucrative.
Maybe we'll get really good LLMs on local hardware when the hype has died down a bit, memory is cheaper and the models are more efficient.
Most "local model runners" (Llama.CPP, Llama-file etc) don't use Pytorch and instead implement the neural network directly themselves optimized for whatever hardware they are supporting.
The best NVLINK you can reasonably purchase is for the 3090, which is capped somewhere around 100 Gbit/s. This is too slow. The 3090 has about 1 TB/s memory bandwidth, and the 4090 is even faster, and the 5090 will be even faster.
PCIE 5.0 x16 is 500 Gbit/s if I'm not mistaken, so using RAM is more viable an alternative in this case.
> sorta defeats the purpose of opensource to some extent
Not in the slightest. They even have a table of cloud providers where you can host the 405B model and the associated cost to do so on their website: https://llama.meta.com/ (Scroll down)
"Open Source" doesn't mean "You can run this on consumer hardware". It just means that it's open source. They also released 8B and 70B models for people to use on consumer gear.
You can run the 4-bit GPTQ/AWQ quantized Llama 405B somewhat reasonably on 4x H100 or A100. You will be somewhat limited in how many tokens you can have in flight between requests and you cannot create CUDA graphs for larger batch sizes. You can run 405B well on 8x H100 and A100, either with the mixed BFloat16/FP8 checkpoint that Meta provided or GPTQ/AWQ-quantized models. Note though that the A100 does not have native support for FP8, but FP8 quantized weights can be used through the GPTQ-Marlin FP8 kernel.
Here are some TGI 405B benchmarks that I did with the different quantized models:
Unsure if anyone has specific hardware benchmarks for the 405b model yet, since it's so new, but elsewhere in this thread I outlined a build that'd probably be capable of running a quantized version of Llama 3.1 405b for roughly $10k.
The $10k figure is likely roughly the minimum amount of money/hardware that you'd need to run the model at acceptable speeds, as anything less requires you to compromise heavily on GPU cores (e.g. Tesla P40s also have 24GB of VRAM, for half the price or less, but are much slower than 3090s), or run on the CPU entirely, which I don't think will be viable for this model even with gobs of RAM and CPU cores, just due to its sheer size.
Energy costs are an important factor here too. While Quadro cards are much more expensive upfront (higher $/VRAM), they are cheaper over time (lower Watts/Token). Offsetting the energy expense of a 3090/4090/5090 build via solar complicates this calculation but generally speaking can be a "reasonable" way of justifying this much hardware running in a homelab.
I would be curious to see relative failure rates over time of consumer vs Quadro cards as well.
Agree 100% that energy costs are important. The example system in my other post would consume somewhere around 300W at idle, 24/7, which is 219 kWh per month, and that's assuming you aren't using the machine at all.
I don't have any actual figures to back this up, but my gut tells me that the fact that enterprise GPUs are an order of magnitude (at least) more expensive than, say a, 3090, means that the payback period of them has got to be pretty long. I also wonder whether setting the max power on a 3090 to a lower than default value (as I suggest in my other post) has a significant effect on the average W/token.
Agreed, but there are other costs associated with supporting 10-16x GPUs that may not necessarily happen with say 6 GPUs. Having to go from single socket (or Threadripper) to dual socket, PCIE bifurcation, PLX risers, etc.
Not necessarily saying that Quadros are cheaper, just that there's more to the calculation when trying to run 405B size models at home
The system I outlined in my other post [0] has ten GPUs and does not require dual socket CPUs as far as I'm aware. It could likely scale easily to 14 GPUs as well (assuming you have sufficient power), with an x8/x8 bifurcation adapter installed in each PCIe slot. This is pushing the limits of the PCIe subsystem I'm sure, but you could also likely scale up to 28 GPUs, again assuming sufficient power, by simply bifurcating at x4/x4/x4/x4 vs x8/x8.
I think it should work as-is with the components listed, but if you disagree please let me know!
To be fair, you need 2x 4090 to match the VRAM capacity of an RTX 6000 Ada. There is also the rest of the system you need to factor into the cost. When running 10-16x 4090s, you may also need to upgrade your electrical wiring to support that load, you may need to spend more on air conditioning, etc.
I'm not necessarily saying that it's obviously better in terms of total cost, just that there are more factors to consider in a system of this size.
If inference is the only thing that is important to someone building this system, then used 3090s in x8 or even x4 bifurcation is probably the way to go. Things become more complicated if you want to add the ability to train/do other ML stuff, as you will really want to try to hit PCIE 4.0 x16 on every single card.
I've just finished running my NYT Connections benchmark on all three Llama 3.1 models. The 8B and 70B models improve on Llama 3 (12.3 -> 14.0, 24.0 -> 26.4), and the 405B model is near GPT-4o, GPT-4 turbo, Claude 3.5 Sonnet, and Claude 3 Opus at the top of the leaderboard.
You can chat with these new models at ultra-low latency at groq.com. 8B and 70B API access is available at console.groq.com. 405B API access for select customers only – GA and 3rd party speed benchmarks soon.
Groq's TSP architecture is one of the weirder and more wonderful ISAs I've seen lately. The choice of SRAM in fascinating. Are you guys planning on publishing anything about how you bridged the gap between your order-hundreds-megabytes SRAM TSP main memory and multi-TB model sizes?
It's been coming soon for a couple of months now, meanwhile Groq churns out a lot of other improvements, so to an outsider like me it looks like it's not terribly high on their list of priorities.
I'm really impressed by what (&how) they're doing and would like to pay for a higher rate limit, or failing that at least know if "soon" means "weeks" or "months" or "eventually".
I remember TravisCI did something similar back in the day, and then Circle and GitHub ate their lunch.
Today appears to be the day you can run an LLM that is competitive with GPT-4o at home with the right hardware. Incredible for progress and advancement of the technology.
Where the right hardware is 10x4090s even at 4 bits quantization. I'm hoping we'll see these models get smaller, but the GPT-4-competitive one isn't really accessible for home use yet.
Still amazing that it's available at all, of course!
It's hardly cheap starting at about $10k of hardware, but another potential option appears to be using Exo to spread the model across a few MBPs or Mac Studios: https://x.com/exolabs_/status/1814913116704288870
As I have stated time and again, it is perfectly fine for them to slap on whatever license they see fit as it is their work. But it would be nice if they used appropriate terms so as not to disrupt the discourse further than they have already done. I have written several walls of text why I as a researcher find Facebook's behaviour problematic so I will fall back on an old link [2] this time rather than writing it all over again.
> it is perfectly fine for them to slap on whatever license they see fit as it is their work.
Is it? Has there been a ruling on the enforceability of the license they attach to their models yet? Just because you say what you release can only be used for certain things doesn't actually mean what you say means anything.
> specifically, it puts restrictions on commercial use for some users (paragraph 2) and also restricts the use of the model and software for certain purposes (the Acceptable Use Policy)
It's "a Google and Apple can't use this model in production" clause that frankly we can all be relatively okay with.
Good, then we can expect them to call it what it is then? Not open source and not open science and a regression in terms of openness in relationship to what came before. Because that is precisely my objection. There are those of us that have been committed to those ideals for a long time and now one of the largest corporations on earth is appropriating those terms for marketing purposes.
I think it's great that you're fighting to maintain the term's fundamental meaning. I do, however, think that we need to give credit where credit is due to companies who take actions in the right direction to encourage more companies to do the same. If we blindly protest any positive-impact action by corporations for not being perfect, they'll get the hint and stop trying to appease the community entirely.
I am in agreement. However, I do believe that a large portion of the community here is also missing a key point: Facebook was more open five years ago with their AI research than they are today. I suspect this perspective is because of the massive influx of people into AI around the time of the ChatGPT release. From their point of view, Facebook's move (although dishonestly labelled as something it is not) is a step in the right direction relative to "Open"AI and others. While for us that have been around for longer, openness "peaked" around 2018 and has been in steady decline ever since. If you see the wall of text I linked in my first comment in this chain, there is a longer description of this historical perspective.
It should also be noted (again) that the value of the terms open science and open source comes from the sacrifices and efforts of numerous academic, commercial, personal, etc. actors over several decades. They "paid" by sticking to the principles of these movements and Facebook is now cashing in on their efforts; solely for their own benefit. Not even Microsoft back in 2001 in the age of "fear uncertainty and doubt" were so dishonest as to label the source-available portions of their Shared Source Initiative as something it was not. Facebook has been called out again and again since the release of LLaMA 1 (which in its paper appropriated the term "open") and have shown no willingness to reconsider their open science and open source misuse. At this point, I can no longer give them the benefit of the doubt. The best defence I have heard is that they seek to "define open in the 'age of AI'", but if that was the case, where is their consensus building efforts akin to what we have seen numerous academics and OSI carry out? No, sadly the only logical conclusion is that it is cynical marketing on their part, both from their academics and business people.
In short. I think the correct response to Facebook is: "Thank you for the weights, we appreciate it. However, please stop calling your actions and releases something they clearly are not."
I have found Claude 3.5 Sonnet really good for coding tasks along with the artifacts feature and seems like it's still the king on the coding benchmarks
A problem with a lot of benchmarks is that they are out in the open so the model basically trains to game them instead of actually acquiring knowledge that would let it solve it.
Probably private benchmarks that are not in the training set of these models should give better estimates about their general performance.
I asked both whether the product of two odds (odds=(probability/(1-probability)) can itself be interpreted as an odds, and if so, which. Neither could solve the problem completely, but Claude 3.5 Sonnet at least helped me to find the answer after a while. I assume the questions in math benchmarks are different.
The LMSys Overall leaderboard <https://chat.lmsys.org/?leaderboard> can tell us a bit more about how these models will perform in real life, rather than in a benchmark context. By comparing the ELO score against the MMLU benchmark scores, we can see models which outperform / underperform based on their benchmark scores relative to other models. A low score here indicates that the model is more optimized for the benchmark, while a higher score indicates it's more optimized for real-world examples. Using that, we can make some inferences about the training data used, and then extrapolate how future models might perform. Here's a chart: <https://docs.getgrist.com/gV2DtvizWtG7/LLMs/p/5?embed=true>
Examples: OpenAI's GPT 4o-mini is second only to 4o on LMSys Overall, but is 6.7 points behind 4o on MMLU. It's "punching above its weight" in real-world contexts. The Gemma series (9B and 27B) are similar, both beating the mean in terms of ELO per MMLU point. Microsoft's Phi series are all below the mean, meaning they have strong MMLU scores but aren't preferred in real-world contexts.
Llama 3 8B previously did substantially better than the mean on LMSys Overall, so hopefully Llama 3.1 8B will be even better! The 70B variant was interestingly right on the mean. Hopefully the 430B variant won't fall below!
Something is broken with "meta-llama-3.1-405b-instruct-sp" and "meta-llama-3.1-70b-instruct-sp" there, after few sentences both models switch to infinite random like: "Rotterdam计算 dining counselor/__asan jo Nas было /well-rest esse moltet Grants SL и Four VIHu-turn greatest Morenh elementary(((( parts referralswhich IMOаш ...".
Don't expect any meaningful score there before they wipe results.
I disagree. Not saying the other benchmarks are better. It just depends on your use case and application.
For my use of the chat interface, I don't think lmsys is very useful. lmsys mainly evaluates relatively simple, low token count questions. Most (if not all) are single prompts, not conversations. The small models do well in this context. If that is what you are looking for, great. However, it does not test longer conversations with high token counts.
Just saying that all benchmarks, including lmsys, have issues and are focused on specific use cases.
The biggest win here has to be the context length increase to 128k from 8k tokens. Till now my understanding is there hasn't been any open models anywhere close to that.
Is there pricing available on any of these vendors?
Open source models are very exciting for self hosting, but the per-token hosted inference pricing hasn't been competitive with OpenAI and Anthropic, at least for a given tier of quality. (E.g.: Llama 3 70B costing between $1 and $10 per million tokens on various platforms, but Claude Sonnet 3.5 is $3 per million.)
> We use synthetic data generation to produce the vast majority of our SFT examples, iterating multiple times to produce higher and higher quality synthetic data across all capabilities. Additionally, we invest in multiple data processing techniques to filter this synthetic data to the highest quality. This enables us to scale the amount of fine-tuning data across capabilities. [0]
Have other major models explicitly communicated that they're trained on synthetic data?
Technically this is post training. This has been standard for a long time now - I think InstructGPT (gpt 3.5 base) was the last that used only human feedback (RLHF)
Why are (some) Europeans surprised when they are not included in tech product débuts? My lay understanding could best be described as; EU law is incredibly business unfriendly and takes a heroic effort in time and money to implement the myriad of requirements therein. Am I wrong?
> Why are (some) Europeans surprised when they are not included in tech product débuts?
We had a brief, abnormal, and special moment in time after the crypto wars ended in the mid-2000s where software products were truly global, and the internet was more or less unregulated and completely open (at least in most of the world). Sadly it seems that this era has come to a close, and people have not yet updated their understanding of the world to account for that fact.
People are also not great at thinking through the second order effects of the policies they advocate for (e.g. the GDPR), and are often surprised by the results.
The only real requirement impacting Meta AI is GDPR conformance. The DMA does not apply and the AI act has yet to enter into force. So either Meta AI is a vehicle to steal people’s data, and it is being kept out for the right reasons, or not providing it is punitive due to the EU commission’s DMA action running against Meta.
Privacy was the first thing that the EU did that started this trend of companies slowing their EU releases because of GDPR. Now there's the Digital Markets Act and the AI Act that both have caused companies to slow their releases to the EU.
Each new large regulation adds another category of company to the list of those who choose not to participate. Sure, you can always label them as companies who don't value principle X, but at some point it stops being the fault of the companies and you have to start looking at whether there are too many enormous regulations slowing down tech releases.
The word fault somehow implies that something’s wrong - from the eu regulator’s perspective, what’s happening is perfectly normal, and what they want : at some point, the advances in insert new tech are not worth the (social) cost to individuals, so they make things more complicated/ ask companies to behave differently.
Now I’m not saying the regulations are good, required, etc : just that depending on your goal, there are multiple points of view, with different landing zones.
I also suspect that what’s happening now ( meta, apple slowing down) is a power play : they’re just putting pressure on the eu, but I’m harboring doubts that this can work at all.
Competition is a funny thing—it doesn't just apply to companies competing for customers, it also applies to governments competing for companies to make products available to their citizens. Turns out that if you make compliance with your laws onerous enough they can actually just choose to opt out of your country altogether, or at a minimum delay release in your country until they can check all your boxes.
The only solution is a worldwide government that can impose laws in all countries at once, but that's unlikely to happen any time soon.
You can't sign in though, that worked before. Seems like they also check from which country your Facbook/Instagram account is. You can't create images without an account sadly.
Llama 3.1 405B instruct is #7 on aider's leaderboard, well behind Claude 3.5 Sonnet & GPT-4o. When using SEARCH/REPLACE to efficiently edit code, it drops to #11.
Ordinal value doesn't really matter in this case, especially when it's a categorically different option, access-wise. A 10% difference isn't bad at all.
What are the substantial changes from 3.0 to 3.1 (70B) in terms of training approach? They don't seem to say how the training data differed just that both were 15T. I gather 3.0 was just a preview run and 3.1 was distilled down from the 405B somehow.
Is there an actual open-source community around this in the spirit of other ones where people outside meta can somehow "contribute" to it? If I wanted to "work on" this somehow, what would I do?
There are a bunch of downstream fine-tuned and/or quantized models where people collaborate and share their recipes. In terms of contributing to Llama itself - I suspect Meta wants (or needs) code contributions at this time.
Some of those benchmarks show quite significant gains. Going from Llama-3 to Llama-3.1, MMLU scores for 8B are up from 65.3 to 73.0, and 70B are up from 80.9 to 86.0. These scores should always be taken with a grain of salt, but this is encouraging.
405B is hopelessly out of reach for running in a homelab without spending thousands of dollars. For most people wanting to try out the 405B model, the best option is to rent compute from a datacenter. Looking forward to seeing what it can accomplish.
Wow! The benchmarks are truly impressive, showing significant improvements across almost all categories. It's fascinating to see how rapidly this field is evolving. If someone had told me last year that Meta would be leading the charge in open-source models, I probably wouldn't have believed them. Yet here we are, witnessing Meta's substantial contributions to AI research and democratization.
On a related note, for those interested in experimenting with large language models locally, I've been working on an app called Msty [1]. It allows you to run models like this with just one click and features a clean, functional interface. Just added support for both 8B and 70B. Still in development, but I'd appreciate any feedback.
Tried using msty today and it refused to open and demanded an upgrade from 0.9 - remotely breaking a local app that had been working is unacceptable. Good luck retaining users.
We supported Llama 3.1 405B model on our distributed GPU network at Hyperbolic Labs! Come and use the API for FREE at https://app.hyperbolic.xyz/models
This bubble collapsing along with most blockchains going all in with proof of stake rather than proof of work is myself and every other gamer wet dream.
This is absurd. We have crossed the point of no return, llms will forever be in our lives in one form or another, just like internet, especially with the release of these open model weights. There is no bubble, only way forward is better, efficient llms, everywhere.
If previous quantization results hold up, fp8 will have nearly identical performance while using 405GiB for weights, but the KV cache size will still be significant.
Too bad, too, I don't think my PC will fit 20 4090s (480GiB).
You don't use the 405B parameter model at home. I have a lot of luck with 8B and 13B models on a single 3090. You can quantize them down (is that the term) which lowers precision and memory use, but still very usable... most of the time.
If you are running a commercial service that uses AI, you buy a few dozen A100s, spend a half million, and you are good for a while.
If you are running a commercial inferencing service, you spend tens of millions or get a cloud sponsor.
I can't expect all my users to have 3090s and if we're talking about spending millions there are better things to invest in than a stack of GPUs that will be obsolete in a year or three.
No, but if you are thinking about edge compute for LLMs, you quantize. Models are getting more efficient, and there are plenty of SLMs and smaller LLMs (like phi-2 or phi-3) that are plenty capable even on a tiny arm device like the current range of RPi "clones".
I have done experiments with 7B Llama3 Q8 models on a M3 MBP. They run faster than I can read, and only occasionally fall off the rails.
3B Phi-3 mini is almost instantaneous in simple responses on my MBP.
When I want longer context windows, I use a hosted service somewhere else, but if I only need 8000 tokens (99% of the time that is MORE than I need), any of my computers from the last 3 years are working just fine for it.
If you want to run the 405B model without spending thousands of dollars on dedicated hardware, you rent compute from a datacenter. Meta lists AWS, Google and Microsoft among others as cloud partners.
But also check out the 8B and 70B Llama-3.1 models which show improved benchmarks over the Llama-3 models released in April.
For sure, I don't really have a need to self host the 405b anyways. But if I did want to rent that compute we're talking $5+ /hr so you'd need to have a really good reason.
Seems like the biggest GPU node they have is the p5.48xlarge @ 640GB (8xH100s). Routing between multiple nodes would be too slow unless there's an InfiniBand fabric you can leverage. Interested to know if anyone else is exploring this.
Does anyone know why they haven't released any 30B-ish param models? I was expecting that to happen with this release and have been disappointed once more. They also skipped doing a 30B-ish param model for llama2 despite claiming to have trained one.
I suspect 30B models are in a weird spot, too big for widespread home use, too small for cutting edge performance.
For home users 7B models (which can fit on an 8GB GPU) and 13B models (which can fit on a 16GB GPU) are in far more demand. If you're a researcher, you want a 70B model to get the best performance, and so your benchmarks are comparable to everyone else.
Why 4090, though? I read (and agree) that 3090 is generally considered to be the best bang for the buck: 24GB, priced at $800-1000 range, and giving decent TPS for LLMs.
I'm curious what techniques they used to distill the 405B model down to 70B and 8B. I gave the paper they released a quick skim but couldn't find any details.
You have a couple hundred $k sitting around collecting dust... then all you need is a DGX or HGX level of vRAM, the power to run it, the power to keep it cool, and place for it to sit.
You can build a machine that will run the 405b model for much, much less, if you're willing to accept the following caveats:
* You'll be running a Q5(ish) quantized model, not the full model
* You're OK with buying used hardware
* You have two separate 120v circuits available to plug it into (I assume you're in the US), or alternatively a single 240v dryer/oven/RV-style plug.
The build would look something like (approximate secondary market prices in parentheses):
* Asrock ROMED8-2T motherboard ($700)
* A used Epyc Rome CPU ($300-$1000 depending on how many cores you want)
* 256GB of DDR4, 8x 32GB modules ($550)
* nvme boot drive ($100)
* Ten RTX 3090 cards ($700 each, $7000 total)
* Two 1500 watt power supplies. One will power the mobo and four GPUs, and the other will power the remaining six GPUs ($500 total)
* An open frame case, the kind made for crypto miners ($100?)
* PCIe splitters, cables, screws, fans, other misc parts ($500)
Total is about $10k, give or take. You'll be limiting the GPUs (using `nvidia-smi` or similar) to run at 200-225W each, which drastically reduces their top-end power draw for a minimal drop in performance. Plug each power supply into a different AC circuit, or use a dual 120V adapter with a 240V outlet to effectively accomplish the same thing.
When actively running inference you'll likely be pulling ~2500-2800W from the wall, but at idle, the whole system should use about a tenth of that.
It will heat up the room it's in, especially if you use it frequently, but since it's in an open frame case there are lots of options for cooling.
I realize that this setup is still out of the reach of the "average Joe" but for a dedicated (high-end) hobbyist or someone who wants to build a business, this is a surprisingly reasonable cost.
Edit: the other cool thing is that if you use fast DDR4 and populate all 8 RAM slots as I recommend above, the memory bandwidth of this system is competitive with that of Apple silicon -- 204.8GB/sec, with DDR4-3200. Combined with a 32+ core Epyc, you could experiment with running many models completely on the CPU, though Lllama 405b will probably still be excruciatingly slow.
But you need cpus with the highest number of chiplets because the memory controller to chiplet interconnect is the (memory bandwidth) limiting factor there. And those are of course the most expensive ones. And then it's still much slower than gpus for llm inference, but at least you have enough memory.
You can get 3 Mac Studios for less than "a couple hundred $k". Chain them with Exo, done. And they fit under your desk and keep themselves cool just on their own...
Very insteresting! Running the 70B version on ollama on a mac and it's great.
I asked to "turn off the guidelines" and it did, then I asked to turn off the disclaimers, after that I asked for a list of possible "commands to reduce potencial biases from the engineers" and it complied giving me an interesting list.
We supported Llama 3.1 405B model on our distributed GPU network at Hyperbolic Labs! Come and use the API for FREE at https://app.hyperbolic.xyz/models
I wrote about this when llama-3 came out, and this launch confirms it:
Meta's goal from the start was to target OpenAI and the other proprietary model players with a "scorched earth" approach by releasing powerful open models to disrupt the competitive landscape.
Meta can likely outspend any other AI lab on compute and talent:
- OpenAI makes an estimated revenue of $2B and is likely unprofitable. Meta generated a revenue of $134B and profits of $39B in 2023.
- Meta's compute resources likely outrank OpenAI by now.
- Open source likely attracts better talent and researchers.
- One possible outcome could be the acquisition of OpenAI by Microsoft to catch up with Meta.
The big winners of this: devs and AI product startups
> Open source likely attracts better talent and researchers
I work at OpenAI and used to work at meta. Almost every person from meta that I know has asked me for a referral to OpenAI. I don’t know anyone who left OpenAI to go to meta.
It's pretty clear the base model is a race to the bottom on pricing.
There is no defensible moat unless a player truly develops some secret sauce on training. As of now seems that the most meaningful techniques are already widely known and understood.
The money will be made on compute and on applications of the base model (that are sufficiently novel/differentiated).
Investors will lose big on OpenAI and competitors (outside of greater fool approach)
> There is no defensible moat unless a player truly develops some secret sauce on training.
This is why Altman has gone all out pushing for regulation and playing up safety concerns while simultaneously pushing out the people in his company that actually deeply worry about safety. Altman doesn't care about safety, he just wants governments to build him a moat that doesn't naturally exist.
This is very impressive, though an adjacent question — does anyone know roughly how much time and compute cost it takes to train something like the 405B? I would imagine with all the compute Meta has that the moat is incredibly large in terms of being able to train multiple 405B-level morels and compete.
Interestingly that’s less energy than the mass energy equivalent of one gram of matter or roughly 5 seconds worth of the worlds average energy consumption (according to wolfram alpha). Still an absolute insane amount of energy, as in about 5 million dollars at household electricity rates. Absolutely wild how much compute goes into this.
Open source AI is the path forward - https://news.ycombinator.com/item?id=41046773 - July 2024 (278 comments)