Without NVLink different layers are loaded onto individual cards and only one card often has to wait for the other card, slowing down generation. It's still faster than CPU offloading. You can even mix and match GPUs like a 4090 + 3060.
The hardware requirements on these models is basically at a fixed floor, and the democratisation will come from cheaper, possibly specialised, hardware, not reduced requirements, right?
I think there is still lots room for improvement to reduce hardware requirements, such as 3-bit quantization or pruning weights from sparse models.
But if you're willing to spend $1500 on two used RTX 3090, it's the sweet spot in terms of the ability to run large models right now. Everything beyond that is much more expensive.
Multi-Query Attention, used here, should make 40B inference viable on systems where even 33B LLaMA with Multi-Head is basically unusable, so sometimes improvements still come from software optimization (it's no free lunch though).
Why would you think there's an ASIC/FPGA design that significantly improves over GPUs specifically targeted at running large models already? Where's the win?
The fundamental limit for hardware acceleration are number of gates you can squeeze on a die, right now. (Or, alternatively. memory bandwidth)
analog chips like what MythicAI is developing seem like the next obvious leap forward for deploying inferences broadly. ASIC/FPGA wouldn't be much different than a GPU. ASIC seems like a brittle solution
It most definitely was. The chat output of GPT4 is much faster now and much worse quality. If you go in the playground and use the March 14 api (as opposed to the default GPT4 API) it is high quality and slow. Yes, there are two GPT4 API endpoints .
maybe the comment was about safety related training leading to performance loss. Sebastien used the "Unicorn benchmark" to visualize such nerfing. Watch his talk at timestamp 26:22. Ref: https://youtu.be/qbIk7-JPB2c?t=1582
Am I the only one who finds this very sketchy? They had the whole license things, there's been some loud complaining by the HF CTO on social media that this model is not getting enough attention, and there are also press releases about how Falcon tops the "leaderboard":
I've never seen this kind of "strategy" with an ML model before. Maybe I'm seeing something that isn't there...
It could be a question of not being used to see blatantly commercial advertising in places we're used to being about software. Feels like we're moving more towards the bro-ification of AI.
Seriously though, I think it's mostly the opposite. How much advertising did Georgi Gerganov do for ggml / llama.cpp and it's super popular. Maybe other people are just being more subtle, but I feel generally merit stands out on it's own and advertising is a poor substitute.
I saw this, I still am highly suspicious about the future of these models and later attempts at monetization. Did they really just drop their 10% royalty thing and decide they'll just open source all their models now?
Another comment mentions llama may get an open license, and there are other emerging alternatives. In six months there will be lots of options. I would not spend my time building anything around a model that started in such a sketchy way.
It would be interesting to hear about why they decided to change their license and what their plans are for the future.
I admit Falcon has really pissed my off because they pretended they had an open source model when they really released a sleazy freemium thing.
Do you consider the Unity model "sleazy"? But I agree that it was poor form to call it open-source at the time. Sadly, it seems to be standard practice in the LLM space to release models with all kinds of restrictions on use and call them "open source".
Well now Falcon is open source. Given they are giving away something that was very expensive to train, I am grateful.
I don't know unity's model. I call Falcon sleazy because the royalty stuff (plus a bunch of absurd related terms) were buried in the license while the headline said open source. I'm not against commercial products, nor products with free and premium tiers. This otoh felt, like I say, just sleazy.
The explanation is less interesting than the fact they chose not to be upfront in their PR about the limitations of their original license. Personally, I don't care what terms they choose - but I do care if they misrepresent them. Honesty is important, and they weren't.
That's part of why I'm interested in their motivation changing licenses. They obviously had a profit motive and a model that centered on collecting rents on the model and decided to back off from it.
Will they keep maintaining their code and improving the models and continue releasing everything under Apache 2.0?
I'm just saying I feel like we got a glimpse of what they're about and it wasn't pretty, so why build around them when there are lots of options.
Look at Stable Diffusion 1.5 and LLaMA: They are thriving, but the original implementations are ancient history, and Meta/StabilityAI/RunawayML have done precisely nothing. And to be blunt, their legality is very ugly, which already makes Falcon more attractive.
Agree. The Stable Diffusion Open RAIL M license: "You agree not to use the Model or Derivatives of the Model ... To defame, disparage or otherwise harass others"
Does "disparage" have a settled meaning in law? Is a parody disparaging?
> Look at Stable Diffusion 1.5 and LLaMA: They are thriving, but the original implementations are ancient history, and Meta/StabilityAI/RunawayML have done precisely nothing.
I mean, that’s true of SD 1.5 in the sense that what the original creators have done since is new versions (SD 2.0, 2.1, and currently SDXL, which is apparent another SD2-architecture model, and DeepFloyd.) 2.1 has also seen some community uptake, and XL likely will once it is released unless there’s something inhibiting that. DF seems to be slowed by different architecture and high resource cost, but I’ve seen posts about people integrating the DeepFloyd early stage models with other models from the SD ecosystem for the last stage upscaling and final rendering, so I wouldn’t be surprised to see it integrated in some of the community UIs as both an integrated workflow and with access to the individual models for mix-and-match workflows.
I dunno. Theres some experimentation with 2.1, but the consensus seems to be that it produces inferior output to 1.5 outside of some niches, and thats before taking the 768x768 1.5 finetunes into account.
Deepfloyd is niche.
SDXL is indeed interesting, especially if its happy with 4/8 bit quant... we will see about that.
Nevertheless StabilityAI seems kinda disconnected from all the innovations going on in the community compared to, say, huggingface.
It's a faulty consensus that came from people comparing outputs from the base model with fine tunes of the SD1.5 model. SD2.1 actually is a far superior model once fine-tuned.
I agree that it's a shame that StabilityAI seem to struggle so much to actually leverage their community (ideally with much more open development)... One could say they're a little too "full of themselves" and think they know better than everyone else.
Very good point, if a community builds up around it, that would make it much more attractive. So chicken / egg in a way. I can definitely see that happening.
Yeah. TBH the biggest hurdle is that Falcon is a different architecture, so the existing tooling won't "just work." A drop in LLaMA or RWKV replacement has a better chance of catching on, even if it leaves some architecture improvements on the table.
This Falcon-40B royalty free license may force Meta ... that LLama-7B/13B may soon be fully open sourced as Meta wants open source LLM advancements and contributions on its own LLM architecture.
And also the potential, unprecedented legal issues that accompany releasing and defending a free/open model. The brownie points they'd receive aren't worth it, at least yet.
GPT2 was released open. It’s been years. Microsoft’s CELA has vetted it. If they are thumbs up, I don’t see why others would be averse. Unless it’s not the reason. Methinks it isn’t.
Why do people think that Meta released their model in order to get open source coders to improve their models? They will get absolutely no competitive advantage from this. Every other team developing a closed source LLM can easily copy the innovations that open source coders have applied to Llama on their own, closed source models.
There's no advantage here. Meta just spent $10 million on releasing fun chaos into the world and increasing their recruiting power.
Meta's most valuable asset is their users, not their technology, so giving away technology is incidental to them. It's not the UI or superior features that makes Meta, Instagram, etc such powerful platforms, it's the network effect.
ChatGPT was the fastest growing app in history, leaders at Meta (The ones who do M&A, strategy, etc) probably raised an eyebrow. They don't really give a crap about some stupid talking chatbot, but OpenAI getting smart and building a Social Network around millions of brand new users could be an existential problem for them. When Lecun wanted to OSS it they were probably like, sure, we can kill a few birds with one stone. If LLMs are a commodity that stops OpenAI and Google before they even get off the ground.
Yeah, but some of the innovations being made at OpenAI will be replicated by the open source community, and Meta can use those for free.
They aren't looking to create a technological edge themselves, they want to remove the edge that OpenAI has so that they can win using their user count/brand recognition/etc.
To be fair, they contributed Pytorch which has defined a whole industry and is responsible for creating hundreds of billions of dollars in value or more. Contributing a set of model weights is an extremely minor thing in the shadow of that, so wouldn't exactly be uncharacteristic.
Serious question - why doesn’t someone crowd source the funds to train a GPT scale model for open source? I assume it’s not just a matter of a ton of GPU instances?
These seem like things affordable with money and crowd sourcing of volunteer labor. In some ways it feels too valuable to leave in the hands of megacorps.
In terms of building something that's usable (considering cost, speed, scale, etc) if comparing an OpenAI API call to these, it's difficult for me to see a current path where these have any viable application outside some niche scenario.
From what I understand, even to run locally you/your team needs to be able to afford a machine with a 4090. These are super expensive in some countries.
I played around with the smaller Llama/Alpaca models and it wasn't really viable to build anything with.
Not really seeing a use-case for fine-tuning either compared to just few-shot prompting.
Can someone fill me in on what I'm missing? It feels like I'm out of the loop
I'm running Vicuna on a free 4core Oracle VPS, and it's perfectly usable for a Discord bot. Responses rarely take more than 15 seconds with <256 max token limit, and the responses are much more entertaining than GPT 3.5. I'm not using the streaming API my server software[0] offers, but if I did it would probably load somewhere between the speeds of GPT-3.5 and GPT-4. It's more or less the same time a human would take to compose the same message.
So... not exactly a serious use-case. But it's what I'm using, and now I'm saving 10s of dollars on inferencing costs per month!
They're okay. This isn't the place for a full review of their offerings (especially considering everyone's mixed feelings on Oracle), but I'm confident that it's better than most 1core/$5 deals you'll find elsewhere.
> Are they any good?
Yep, free tier allows you to spec up to 24gb of RAM without paying, which is cool. The bottleneck is really the disk speed, but that's not an issue with mmaped models. There's enough cached memory that it loads instantly, so it's good-ish for this use case.
> Is the free-tier time limited?
No, but there are a lot of strings attached:
- The cores are vCPUs, not dedi (duh)
- You can't create new instances when demand is high (unless you add a credit card)
- Technically Oracle reserves the right to shut down the instance if demand gets really high (although I haven't heard any stories about this personally)
Proceed with caution. It's still a great place to start before you shell out $1/hr for dedi GPU rackspace.
It's one of those "say a keyword with a question, get a response" type bots. I added in a few other "prompt sources" though, where it grabs the first part of an RSS entry or HN comment and tries to autocomplete the rest. Mostly just a boring testbed for me to play with models, for free, with friends.
Fine-tuning is a much better proposition than you’re giving it credit for. Papers are coming out demonstrating that 7B parameter models can outperform GPT-4’s quality when trained on a limited set of tasks. Yet, a 7B model offers comparatively cheap and fast inference. Furthermore, for a lot of use cases, few-shot prompting is infeasible because you need to supply 2-3k tokens worth of few-shot examples with every prompt in order to fully specify the behavior you want. (As an example, think of long-form summarization where you want the summary to adhere to certain rules.)
Adding up these developments (all of which occurred during the span of one week), I don’t see how huge, slow, general-purpose models maintain their relevance in the long term, when a lean, domain-focused model is right there within reach of every application developer.
One benefit of finetuning larger models, like 65B, is to free up limited context space vs few-shot prompting.
If you want a specific kind of interaction with the model then you could take up 1/3rd of the 2048 token context window with few-shot or you could simply finetune it with QLoRA for a few hours on a consumer GPU and then get to use the full 2048 context with the finetuned model.
Ah. Perhaps the larger models will find use in in-house deployments where companies want their employees to have access to ChatGPT-like general purpose assistants, but want to prevent data from leaving their premises. LIMA shows the LLaMA 65B hitting a quality level somewhere between DaVinci-003 and GPT-4 with minimal alignment, so the base models are probably powerful enough already for this to work. Just speculating.
>From what I understand, even to run locally you/your team needs to be able to afford a machine with a 4090. These are super expensive in some countries.
That's because we're only half a year into LLMs becoming mainstream. Give it 3-4 years. The advancements in bringing down model size, optimizations, and newer GPUs, SoCs from Nvidia, AMD, Apple, Intel, Qualcomm, etc will make it so that top LLMs will run on a highend laptop/desktop.
Not specific to this model, but beyond the large players (OpenAI, Cohere, etc) are there any free hosted versions of the open(ish) LLMs? Even the smaller 7B parameter ones? I'm prototyping out a project and using OpenAI for now, but it feels like there has to be a hosted alternative somewhere.
I spent some time today exploring HuggingFace's Inference API but if the model is sufficiently large (> 10gb), HF requires you to use their commercial offerings.
> HF requires you to use their commercial offerings
Some of which are quite affordable ($80 per month). Larger ones can be like 2000 a month which is still ok to prototyping phase. You're basically paying for aws/gcp infrastructure.
I quite liked the UX of it, very intuitive. My trouble was finding a model that executes out-of-the-box tho. All of the GPT ones crash on startup.
I do use a P40 for my machine learning box, but I'm curious how you put three on the same system, given they need a CPU power plug and a pci-e port. Then, to cool them, you need to plug your own cooling system, requiring more specific power plugs to be available. What kind of chassis, motherboard, power unit you use to do that? It'll certainly will cost more than $1000 anyway, especially since you also need a decent amount of RAM to preload the models before you move them to the GPUs.
http://nonint.com/ has some interesting posts about how he build a custom server to house 8 GPU's (3090's in this case). You're right that that will set you back more than $1000, though I was only referring to the GPU's themselves.
I'm less interested in using someone else's computer (not as much, but similar to how i'm disinterested in an API from someone like OpenAI), would rather pay the upfront hardware cost than worry about how many tokens i generate (kind of hinders creativity and excitment about it).
Stupid question but for feed-forward models why do we not yet have some kind of CPU RAM memory swap mechanism? Why is Pytorch still trying to load the whole damn model into GPU RAM at once and then complaining when it can't, instead of swapping portions of the model to CPU RAM, or hell, even SSD?
Sure, it might be a lot slower, but that's a lot better than "I give up, go buy $20K worth of hardware"
That’s what llama.cpp does, including offload to a disk. It allows you to run models as big as any combination of your VRAM, RAM or disk. But in the end, if it doesn’t fit into GPU VRAM, it will be slow.
For example Guanaco-33B generates ~10 token per second running fully from VRAM of my 3090, the ~1 token/second running from DDR4 RAM of my Ryzen. I would imagine it would do like a token per minute from NVM SSD.
70GB of RAM would cost around $150 these days depending on how you get there. 64GB (2x32GB) of DDR4 is around $140 then another 8GB stick would be around $15.
I've been running vicunda-13B on a workstation that can be aquired from ebay for about $500. It's slow compared to online services, but probably slightly faster than text to speech would recite its output, so plenty.
Falcon 40B is probably too much for it, but apparently there's similar cheap hardware that could work.
I know, I have a weak GPU but 64 GB RAM. Using those models works, but it’s more "ask a question, then do something else for a while, while your fans spin up" ;)
There no reason, particularly, to believe either that it does, or it would.
For openai to scramble and try to “catch up” with a competitor and make such a massive change in strategy would require someone to be offering an equivalent service (hosted inference) that was either orders of magnitude cheaper than their offering and just as good, or significantly better than it. Or legal compulsion.
It would as open source improvements would start to exceed performance of 3.5 for specific use-cases. At the very least they would have to make it fine-tunable.
Sam Altman said recently that they are already working on making GPT-3.5/GPT-4 finetunable, they are just limited by the availability of compute (partially since none of their SFT infrastructure uses LoRa).
I had previously assumed it was safety concerns, since I don't see what stops someone from finetuning away all guard rails.
At the most basic, you give it text and it can guess at what comes next. This means you could type "10 types of ferns" and it will build a list of 10 ferns. Or you could type it out how a transcript of a conversation would look like and it will basically fill in the other "side" of the conversation to make a chatbot (all the complicated chatbots are basically abstracting this). Think of it like a text box with a super-smart (arguably) person also having access where you can type one thing and then it'll type whatever it thinks would be the next thing to type.
As with everything, follow the money, pretty much.
There will be an ASIC as soon as serious money is being made from LLMs, most use cases atm seem to be in prototype/toy stage, but I imagine we'll start seeing that change.
Inference is very slow right now but it works!