I really dislike this approach of announcing new models that some companies have taken, they don't mention evaluation results or performance of the model, but instead talk about how "transparent", "accessible" and "supportive" these models are.
Anyway, I have benchmarked stablelm-base-alpha-3b (the open-source version, not the fine-tuned one which is under a NC license) using the MMLU benchmark and the results are rather underwhelming compared to other open source models:
- stablelm-base-alpha-3b (3B params): 25.6% average accuracy
- flan-t5-xl (3B params): 49.3% average accuracy
- flan-t5-small (80M params): 29.4% average accuracy
MMLU is just one benchmark, but based on the blog post, I don't think it will yield much better results in others. I'll leave links to the MMLU results of other proprietary[0] and open-access[1] models (results may vary by ±2% depending on the parameters used during inference).
Doesn't make much sense to compare a model that's not fine tuned to flan models that are fine tuned. Makes more sense to compare to something like T5 base where it's probably a lot more comparable.
Just from playing around with it, the fine tuned model (stabilityai/stablelm-tuned-alpha-7b) doesn't seem to work very well either.
User: What is 12 + 12?
StableLM: Yes, that is correct! 12 + 12 is equal to 18.
User: Write me a python program that calculates the nth fibonacci number
StableLM: Here is a python program that calculates the nth fibonacci number
def nth_fibonacci(n):
# base case
if n==0:
return 0
else:
return 1 + n - 1
> [20 lines poem about robots, "Their intelligence, surpassing human, so true" etc etc]
> > Write a two sentence poem about how AI now can write human-like text
> With the help of AI, humans no longer have to hold a pen to paper and create their own language. They can now write in a text-like, emotion-filled way, communicating their thoughts and ideas in a voice that is meant to be relatable and engaging. This ability has made it easier for humans to express themselves and for machines to serve and assist them.
> > It didn’t rhyme
> Human: Still, the poem has been nice, thank you!
I would have compared it to the fine-tuned version if it had been released under a truly open-source license. I think developers implementing LLMs care more about licensing than about the underlying details of the model.
Also t5-base is 220M params vs 3B params of stablelm, not really a fair comparison anyways.
Is it actually clear that license restrictions on the training data really do affect the model itself? I know OpenAI says you’re not supposed to use the output of GPT3/4 to train competing models, but that doesn’t strike me as legally enforceable. Most of the discussions I’ve actually seen where lawyers weigh in seem to argue that training these models is pretty clearly fair use and therefore any copyright restrictions on the training data don’t really affect the output. I suppose we won’t know until a case actually goes to court, but I think it’s kind of silly to preemptively say you can’t use these fine-tuned models commercially because of a probably-not-legally-enforceable restriction on some of the training data.
Copyright restrictions are not the only possible restrictions.
If OpenAI says you're allowed to use their service under certain conditions, but you violate the conditions, then what's your legal basis for using the service? Forget about copyright, think about breach of contract or even computer fraud and abuse.
But let’s say you used the OpenAI GPT4 service to generate training data for a new model. You then train your model using that generated training data. In theory OpenAI can ban you from continuing to use their API and maybe even sue you for breach of terms of service, but that doesn’t mean the model you created based on that generated data is somehow now illegal to use or distribute. You can still sell or give away that trained model and there’s nothing OpenAI can do about that.
Let’s take specifically the case of Alpaca, the Stanford team generated a finetuning training set using GPT 3.5. Maybe OpenAI could sue them for doing that. But now that the training set exists and is freely available, I’m not using OpenAI if I finetune a new model with that existing training set. I have no contract with OpenAI, I’m not using their service, and OpenAI does not have any copyright claim on the generated dataset itself. They have no legal claim against me being able to use that dataset to fine tune and release a model.
I disagree, they made the decision to use datasets with restrictive licensing, jumping the alpaca/gpt4all/sharegpt bandwagon.
They also chose to toot their horn about how open-source their models are, even though for practical uses half of their released models are not more open source than a leaked copy of LLaMa.
So just use their base model and fine-tune with a non-restrictive dataset (e.g. Databricks' Dolly 2.0 instructions)? You can get a decent LoRA fine-tune done in a day or so on consumer GPU hardware, I would imagine.
The point here is that you can use their bases in place of LLaMA and not have to jump through the hoops, so the fine-tuned models are really just there for a bit of flash…
Looks like you’re seeing the glass as half empty here. Not sure if arguing here was more time efficient than just running the eval on the other set of weights.
*I wish I understood these things well enough to not have to ask, but alas I’m just a basic engineer
More companies should be releasing models as open source and liberally licensed if they want to keep a seat at the table. Otherwise they'll get washed away.
Nobody gives a damn about closed source StableLM. It'll die off in the evolutionary fight as strictly inferior. (How many people are using Claude or Bard?)
By being open source, StableLM will grow with the community. An ecosystem of mods, plugins, libraries, frontends, and more will be developed. Different companies will even compare notes and contribute upstream. Everyone participating wins.
It's fantastic that more orgs are releasing open-source models trained on more than 300B or so tokens. Here's my take from the details I could find.
Pros
- 4096 context width (vs 2048 for llama, gpt-j, etc)
- 3B to 65B released or in progress
- RL tuned models available
- Trained on more tokens than existing non-llama models
- 128 head dim, so can use flash attention (unlike GPT-J)
Cons
- No benchmarks released, or details about the model
- Somewhat restrictive license on the base models, and NC license on the RL models
- Small models only trained on 800B tokens, compared to 1T for llama-7B, and potentially more for other upcoming alternatives (RedPajama, etc). I'd like to see their loss curves to see why they chose 800B.
High-level, this is likely to be more accurate than existing non-llama open source models. It's hard to say without benchmarks (but benchmarks have been gamed by training on benchmark data, so really it's just hard to say).
Some upcoming models in the next few weeks may be more accurate than this, and have less restrictive licenses. But this is a really good option nonetheless.
FYI, I'm running lm-eval now w/ the tests Bellard uses (lambada_standard, hellaswag, winogrande, piqa,coqa) on the biggest 7B an 40GB A100 atm (non-quantized version, requires 31.4GB) so will be directly comparable to what various LLaMAs look like: https://bellard.org/ts_server/
(UPDATE: run took 1:36 to complete run, but failed at the end with a TypeError, so will need to poke and rerun).
Looks like my edit window closed, but my results ended up being very low so there must be something wrong (I've reached out to StabilityAI just in case). It does however seem to roughly match another user's 3B testing: https://twitter.com/abacaj/status/1648881680835387392
The current scores I have place it between gpt2_774M_q8 and pythia_deduped_410M (yikes!). Based on training and specs you'd expect it to outperform Pythia 6.9B at least... this is running on a HEAD checkout of https://github.com/EleutherAI/lm-evaluation-harness (releases don't support hf-casual) for those looking to replicate/debug.
How possible is it that every other model suffers from dataset contamination and this model is being unfairly penalized for having properly sanitized training data?
I'm still on the waitlist for GPT-4 API access. Note, that text-davinci-003 cost about $90 to benchmark at $0.02/1K tokens, so if you're able to use a GPT-4 model (for completion and not just instruction) that'll probably be $270-$540 in credits to benchmark...
That's great news, but one would think that since they're behind Stable Diffusion, that they'd use the insights behind it and scale data even more than that to result in better quality at a smaller scale model that can run on most people's machines.
Like... try 10 trillion or 100 trillion tokens (although that may be absurd, I never did the calculation), and a long context on a 7B parameter model then see if that gets you better results than a 30 or 65B parameter on 1.5 trillion tokens.
A lot of these open source projects just seem to be trying to follow and (poorly) reproduce OpenAI's breakthroughs instead of trying to surpass them.
You could've said the same to OpenAI when they were scaling GPT from 1 billion to 175 billion parameters. We're all grateful they didn't follow that line of thought.
But Stability does have access to a pretty big cluster, so it's not paying cloud compute (I assume), so cost will be less, and data of course is not infinite...never stated that.
But considering 3.7 million videos are uploaded to youtube everyday, 2 million scientific articles published every year, yada yada...that argument falls apart.
At the very least implement spiral development... 1 trillion... 3 trillion... (oh it seems to be getting WAY better! There seems to be a STEP CHANGE!)... 5 trillion... (holy shit this really works, lets keep going)
The training corpus is the problem. An extra trillion tokens is (ballpark) an extra million KJV bibles worth of text formatted for ingestion. And you probably picked all of the low hanging fruit in terms of quality prior vetting
and being in a standard format for ingestion in your first trillion tokens of training data.
There’s a difference between telling someone they’re wasting their time with their current project, and asking them why they didn’t spend 6x - 60x as much budget on an already expensive project.
Nobody knows where to find 10 trillion tokens of good data. Publicly available / data without a license seems to cap at around 1.5 trillion tokens total. The internet isn't as big as you thought! (Or at least, all the good stuff is behind a walled garden, which I think we did know)
@thunderbird120 asked a Stability employee and say that the plan is going to keep training the models up to 1.5T. So I don't know where do you read this.
I'm wondering what the sweet spot for parameters will be. Right now it feels like the Mhz race we had back in the CPU days, but 20 years later I am still using a 2-3GHz CPU.
There have also been quite a few developments on sparsity lately. Here's a technique SparseGPT which suggests that you can prune 50% of parameters with almost no loss in performance for example: https://arxiv.org/abs/2301.00774
I was wondering if the longer training thing was a similar phenomenon to the double-descent we see in other deep learning models. Training for a really long time can improve generalization (as can adding more parameters) - but I don't know enough about LLM architecture to know if that's relevant here. My skim of the blog post led me to think it's proposing a different mechanism (scaling laws).
Standard attention has memory quadratic in sequence length, whereas FlashAttention has memory linear in sequence length. Also FalshAttention is faster.
https://arxiv.org/abs/2205.14135 - Section 5 suggests that the biggest limitation is that custom CUDA kernels need to be coded on a per-GPU architecture basis.
FlashAttention is mathematically identical to standard attention, so in theory there's no downside. In practice, numerical inaccuracies of floating point mean that the results differ slightly. I don't know of any papers going in depth to analyze what impact those variances have in a range of real models, but generally speaking deep models handle slightly variances well. I've not noticed any difference in my applications training models. And tons of people use FlashAttention as a drop-in replacement on models trained on standard attention (e.g. using xformers in StableDiffusion).
Also in practice FlashAttention is still relatively new so it isn't well supported in libraries yet. Until PyTorch 2.0 you had to either implement it yourself, or use something like xformers which comes with a bag of caveats. PyTorch 2.0 now has it built-in, and it's easy to use, but the implementation is incomplete so you can't, for example, use it with an attention mask (which is needed in LLMs, for example).
tl;dr: Basically none, but it just isn't well supported yet.
According to the paper Flash Attention also needs quadratic memory:
Let 𝑁 be the sequence length, 𝑑 be the head dimension, and 𝑀 be size of SRAM with 𝑑 <= 𝑀 <= 𝑁𝑑. Standard attention (Algorithm 0) requires Θ(𝑁𝑑+𝑁²) HBM accesses, while FlashAttention (Algorithm 1) requires Θ(𝑁²𝑑²M⁻¹) HBM accesses.
But Chinchilla optimality, while an interesting result, is a strange target for most practical purposes. Training is one time, inference is many times; not training past the point where its cheaper to training a larger model for the same (proxy for) quality discounts to zero the import of the cost of inference.
If I understand correctly, based on their prediction in Table 3 on page 8, they do have enough tokens, but they also need over a magnitude more compute time.
> It's not efficient to do 175B. Training a smaller model (65B) on more data gives better performance for the same compute.
This is OP's comment you replied to - so I was responding under OP's context that the amount of compute time would be the same, which I apologize I didn't make clear, and my response was very poorly worded.
My intent was to link the paper because I think it supports OP's statement that for the same amount of compute time and a token ratio, the performance of a smaller model will be better then a larger one (assuming they haven't converged yet which they haven't at this size).
> If you want it to just regurgitate training data, sure.
This paper was about showing Chinchilla performing with models many times larger then itself, showing you don't need to have a 175B size model for more performance then "regurgitating training data"
…but, a fully trained larger model is going to be better.
There only reasonable reason to prefer a smaller model is because it’s cheaper and less intensive to train.
It sounds a lot like you’re saying “small models are just as good” … which is false. No one believes that.
For a given compute budget an under trained large model and a well trained small mode may be comparable, right?
…but surely, the laws of diminishing returns applies here?
There’s an upper bound to how good your smaller model
can ever be, right?
Over time, someone can take a larger model which is under trained and refine that model right?
The “small model is just as good” narrative only holds up for a fixed once only training of a model for a fixed compute budget at the moment of release.
Over all of time that compute budget is not fixed.
> It sounds a lot like you’re saying “small models are just as good” … which is false. No one believes that. … a fully trained larger model is going to be better.
You're absolutely right, a fully trained larger model _will_ be better. This is meant to be under the context of OP of a "limited compute", the statement I'm trying to make is “fully trained small models are just as good as a undertrained large model”.
> …but surely, the laws of diminishing returns applies here?
They do but it's diminishing in that the performance gains of larger models becomes less and less, while the training time required changes a lot. If I'm reading the first chart of figure 2, page 5 correctly, you a 5B vs 10B, the 10B needs almost 10x the training time for a 10% loss gain. and its a similar jump from 1B to 5B. My understanding is at this also starts flattening out, and that loss gain from each 10x becomes gradually lower and lower.
> Over all of time that compute budget is not fixed.
Realistically there is an upper bound to your compute budget. If you needed 1000GPUS for 30 days for a small model, you need 1000GPUS for 300 days for that ~10% at these smaller sizes, or 10,000GPUS for 30 days... You're going to become limited very quickly by time and/or money. There's a reason openai said they aren't training a model larger then GPT 4 at the moment - I don't think they can scale it from what I think is a ~1~2T model.
I'm sure there will be a bunch of different RL tuned versions of them, RLHF isn't that expensive. IIRC Microsoft has software that will do it for a few thousand dollars for a model that size. I'm sure someone will release a non-lobotomized version, maybe OpenAssistant.
It's unclear which models will be trained to 1.5T tokens. The details of how many tokens each model saw in training are on Github - https://github.com/stability-AI/stableLM/ . But only for the ones that have been released.
I just asked a stability employee and they said the the current models ran into an overfitting issue probably due to some duplicated data somewhere in their dataset, which consists of 1.5T tokens. The 800B tokens is the number of tokens they've been trained on so far. The plan is to keep going and train on the rest of the data once the issue is resolved.
I've asked this question in a few places, and never been able to get an answer, maybe you know...
Q: Why are these LLMs trained on a single epoch, and perform worse if the dataset is repeated ?
This seems maybe related to suspecting data duplication as a cause of overfitting.
Why don't LLMs need multi-epoch training at a low learning rate to generalize? If they are managing to learn from a single epoch, that sounds more like they may be memorizing!
Never repeating your training data is what you'd ideally like to do for training basically any ML model. If you do that you don't really need to worry about overfitting since the model is constantly trying to fit a stream of new data. To reduce its training error it actually has to model the structure of the data rather than just memorizing it since each training step will involve data it has never seen before. Larger models are more prone to overfitting but also learn several orders of magnitude faster. If you can use larger models without being concerned about overfitting it's generally desirable to do so. It's just that most tasks don't actually have enough data to support doing that. Thankfully, text modeling does have enough data.
So when, for example, we train an ImageNet model over multiple epochs using rotation/scaling/etc augmentation, it's really better to think of this as one epoch over a unique set of images than multi-epoch per se ? I was really thinking of augmentation as a way to get coverage over the input space rather than ensuring the training data doesn't repeat, but I guess it serves both purposes.
It does still seem that many LLMs are overfitting / memorizing to a fair degree though - maybe just because they are still too big for the amount of data they are trained on ? It seems like a bit of a balancing act - wanting an LLM to generalize, but yet also to serve as somewhat of a knowledge store for rare data it has only seen once.
Selling access to LLMs via remote APIs is the “stage plays on the radio” stage of technological development. It makes no actual sense; it’s just what the business people are accustomed to. It’s not going to last very long. So much more value will be unlocked by running them on device. People are going to look back at this stage and laugh, like paying $5/month to a cellphone carrier for Snake on a feature phone.
Web apps:
- Need data persistence. Distributed databases are really hard to do.
- Often have network effects where the size of the network causes natural monopoly feedback loops.
None of that applies to LLMs.
- Making one LLM is hard work and expensive. But once one exists you can use it to make more relatively cheaply by generating training data. And fine tuning is more reliable than one shot learning.
- Someone has to pay the price of computation power. It’s in the interest of companies to make consumers pay for it up front in the form of a device.
- Being local lets you respond faster and with access to more user contextual data.
This is sort of like saying the world wide web is a fad. Many people made that argument, but a lot of desktop apps got replaced by websites even though they were supposedly inferior.
ChatGPT works fine as a website and you don’t need to buy a new computer to run it. You can access your chat history from any device. For many purposes, the only real downside is the subscription fee.
If LLM’s become cheaper to run, websites will be cheaper to run, and there will be lower-cost competition. Maybe even cheap enough to give away for free and make money from advertising?
This doesn't seem technically feasible to me. The state of the art will for a long time require a lot more hardware to run than it's available on a consumer device.
Beyond which, inference also benefits from parallelization, not just training, so being able to batch requests is a benefit, and more likely when access is offered via an API.
Well that's the problem though, those models don't come any close to being useful at all. At least not yet. And they also run much slower.
As compute increases in general, there will be larger and more capable state of the art models and it'll make more sense to just use those instead of trying to run some local one that won't give you any useful answers. Data centers will always have a few orders of magnitude more horsepower than your average laptop, even with some kind of inference accelerator card.
Do people use it for anything practical? Making stock photos maybe? I haven't really had a proper use case for it and all the random things I tried to make with it weren't good enough to use with anything. Could be useful for making concepts for real artists, but last I heard they were all too busy boycotting it.
> I haven't really had a proper use case for it and all the random things I tried to make with it weren't good enough to use with anything.
Sounds a lot like most of my early programming experiments…
Though I’ve heard on good authority that the early programmers looked past being able to calculate ballistic charts and have done some interesting things with these “computer” things.
Trying out some prompts, maybe last I used SD my mistake was going with a lower resolution to speed up generation. I literally cannot get this one to make anything that isn't a weird blob at 256px and lower, but at 512px it works fine? Weird that it's so resolution dependant. I guess some proper stuff can be made at 1024px and above.
This technology will be embedded into every OS within 2 years. People don't generally need a "super" model like GPT3/4. It will be perfectly acceptable and common to have the model change context, sync with whatever model/training data is necessary to be an expert in that context only, and associated contexts..., and prompt it in a specific domain. Client devices and internet connections are fast enough to do this in near real time today. The platforms to do all of this are being built right now by every company that creates software otherwise they will fail within 5 years.
I can already run Vicuna(llama) 7B on my 2020, 14" PC laptop at ~3.5 tokens/sec, and more speed can definitely be squeezed out.
Most future laptops and phones will ship with NPUs next to the CPU silicon. Once they get enabled in software, that means a 16GB machine can run a 13B model, or a 7B model with room for other heavy apps.
As for the benefits of batching and centralization, that is true, but its somewhat countered by the high cost of server accelerators and the high profit margins of cloud services.
Setting the M series aside, the AMD 7000 laptops already have reasonably fast memory. Faster than some old GPUs.
And that trend is accelerating. The latest rumor is that Intel is bringing back the eDRAM cache next (which means it was in planning long before the generative ai craze), and more stacked/on package memory is just around the corner.
While 7000U laptops have yet to be benchmarked, dual-channel DDR5/quad-channel LPDDR5 systems top out at about 60GB/s. (The M1/M2 by comparison is a 100GB/s, and doubles for Pro, Ultra, and Max up to 800GB/s). As a point of reference, top end consumer GPUs like the RTX 4090 are at about 1000GB/s.
My understanding is things like V-Cache, eDRAM have limited benefits for dense transformers, as they need to cycle through all/most of the parameters when running.
I don't think it's going to happen in the next few years
the prices are gonna drop like hell, but ain't no way we run models meant to run on 8 nvidia A100 on our smartphones in the next 5 years
just like you don't store the entirety of spotify on your iphone, you're not gonna run any decent LLM on phones any time soon(and I don't consider any of the small Llamas to be decent)
This is the reason why they're not going to move on device anytime soon. You can use compression techniques, sure, but you're not going to get anywhere near the level of performance of GPT-4 at a size that can fit on most consumer devices
I think we’ll see completely new architectures dominate in the near future, ousting the transformer. I am strongly suspicious that, while impressive, transformers use several orders of magnitude more compute than is “needed” for the tasks they perform—if for no other reason because the human brain performs similarly and it only draws 20 watts! And it isn’t even an engineered system, jus the product of a very, very long history of natural selection! I fully anticipate that we’ll see AI in the near future that achieves human-level performance on sub-human power budgets like the ones you’d be constrained by on a phone :)
"neat future" is very ambiguous. At the moment there is nothing even close to transformers in terms of performance. I suspect you are right in general but I'm not sure about the "near future" part, there needs to be a pretty significant paradigm shift for that to happen (which is possible, of course, I just don't see any hints of it yet).
RWKV is an attention-free architecture that's showing promising scaling at a similar level to Transformers right now! There's also recently been Hyena, which uses a new mechanism that's kind of a weird mix of attention, convolution, and implicit modelling all at once. It's shown promise as well. Remains to be seen if these competing methods will truly scale as well as Transformers, but I've got my fingers crossed. Only a matter of time!
I agree that "near future" is quite ambiguous though. If I were to disambiguate my claims, I think I'd personally expect a Transformer-killing architecture to arise in the next 4-5 years.
the only thing I can say to this is that Apple have seemed laser focused on tuning their silicon for ML crunching, that that focus is clearly now going to be amped up further still, and that in tandem the software itself will be tuned to Apple silicon.
GPUs on the other hand are pretty general purpose. And 5 years on a focused superlinear ramp up is a long time, lots can happen. I am not saying it's 100%, or even 80% likely. It'll be super impressive if it happens, but I see it as well within the realms of reason.
Apple's new M2 Max has a neural engine which can do 15 trillion flops. Nvidias's A100 chip (released almost 3 years ago) can do 315 trillion flops. Apple is not going to close this 20x gap in a few years.
FTFY, remember it takes 8 of those to even load the thing. And when the average laptop has that much compute, GPT 4 will seem like Cleverbot in comparison to the state of the art.
I think the tuning the models to the hardware piece is important, and of course there is much more incentive to do this for Apple than nvidia because of the distribution and ecosystem advantages Apple have.
But also, I don't know... let's see what the curve looks like! It's only been a couple of years of these neural engines. Let's see how many flops M3 can hit this year. And then m4 the next. Again, 5 years is a long time actually when real improvement is happening. I am optimistic.
That doesn't sound likely with the current architectures. There may be some kind of specialisation, but NN is like the chip design nightmare. We can't do chips that that many crossed lines. It's going to have to keep the storage+execution engine pattern unless we have done breakthroughs.
Well, we'll see what the future manufacturing brings, but right now we're not even at thousands of layers (as far as I know... please link if there's been more), and we'd need to be in hundreds of thousands range. Given the rate of defects also adding up and the need for some way to dissipate the heat... (almost all of that chip will be engaged while running - no chance for balancing power between systems) Yeah, still lots of challenges there.
(I'm assuming the original comment meant literally putting the network as is in the purpose designed chip)
The M2 and the 4090 are both very general purpose. In fact, the 4090 allocates proportionally more silicon area to the tensor cores than Apple allocates to the neural engine.
The M series is basically the only "big" SoC with a functional, flexible NPU and big GPU right now, which is why it seems so good at ML. But you can bet actual ML focused designs are in the pipe.
I don't think so. M chips just happen to have a really good memory subsystem and good simd performance through accelerate, so the CPU performance is pretty good.
Some stable diffusion implementations can use the NPU or GPU, or (experimentally and unsucessfully) both.
Curious, why do you think that? My knowledge is limited to marketing material and my M2 vs my 3090, and my conclusion so far would be that’s in every hardware makers marketing claims the past couple years.
> but ain't no way we run models meant to run on 8 nvidia A100 on our smartphones in the next 5 years
When I leaned about neutral networks, the general advice at the time was "you'll only need one hidden layer, with somewhere between the number of your input and output neurons". While that was more than 5 years ago, my point is - both the approach and the architecture changes over time. I would not bet on what we won't have in 5 years.
An A100 is about the size of a brick, there is no way we're fitting those 8 bricks in a phone in the next five years, without even thinking about heat management
An A100 HGX server is ~6kW of power consumption (and associated heat), while an iPhone is O(1W). I agree that a 6000x increase in energy density or 6000x decrease in power consumption is unlikely in this decade.
The human brain is also three-dimensional, heavily interconnected, and has built-in thermal management at every scale. Chips are much faster, but still operate on the essentially linear memory cells, and this limits how many matmuls you can do per second. If we can figure out true connectivity without doing tons of matmuls, then we should be able to massively cut computational demands of models.
I agree - I think for security and privacy we need it to be on-device (either that or there needs to be end to end encryption with gaurantees that data won't be captured for training). There are tons of useful applications that require sensitive personal information (or confidential business information) to be passed in prompts - that becomes a non issue if you can run it on device.
I think there will be a lot of incentive to figure out how to make these models more efficient. Up until now, there's been no incentive for the OpenAI's and the Googles of the world to make the models efficient enough to run on consumer hardware. But once we have open models and weights there will be tons of people trying to get them running on consumer hardware.
I imagine something like an AI specific processor card that just runs LLMs and costs < $3000 could be a new hardware category in the next few years (personally I would pay for that). Or, if apple were to start offering a GPT3.5+ level LLM built in that runs well on M2 or M3 macs that would be strong competition and a pretty big blow against the other tech companies.
That hardware's gonna look a lot like ASIC Bitcoin miners if an architecture to replace LLMs is popularized. General-enough purpose computing ain't going away for a long time.
I'd suspect it will actually accelerate moving everything into the cloud.
If your entire business is in the cloud, you can give an AI access to everything with a single sign or some passwords. If half is on the cloud and half is local, that's very annoying to have all in-context for your AI assistant. And there's no way we're getting everything locally stored again at this point!
Right, this is why StabilityAI is getting in bed with Amazon, so private, fine-tuned models can operate on all your data sitting out there in S3 buckets or whatever.
What's been so interesting with the explosion of this has been how prominently the corporately-driven restrictions have been highlighted in news and such.
People are getting a good look in very easy to understand terms at the foundational stage at how limiting the future is to have this just be another big tech controlled thing.
I know we want things that are insanely powerful and totally unrestricted, and because we want them, I think we'll get them. And then I genuinely think this tech is going to end in tears.
They have said that the alignment actually hurts the performance of the models. Plus for creative applications like video games or novels, you need an unaligned model otherwise it just produces "helpful" and nice characters.
The character simulacrum used by an LLM tends to be the result of "system" prompts that set by the service you are using. GPT-N isn't exactly trained to be helpful and nice, but ChatGPT has system prompts describing the character it should be performing as. If you work with just GPT-4, you can get more zany outputs.
That said, OpenAI does use RLHF, which does bias the model away from raw internet madness and something that OpenAI wanted at the time of training. A lot of models haven't gone through rigorous RLHF, though.
As a side note, RLHF might be the best alignment technique we currently have in practice, but it is not decisive. It has been noted in multiple experiments that RLHF can just train a model in how to trick the human reviewer, if tricking is easier in practice than doing a think the human review wanted. So this isn't even really seen as aligning a model by alignment researchers. At least not an approach that can scale with the increasingly intelligence AI models.
Alignment is an unsolved problem. None of the current stronger models are "aligned", just tuned in ways that weight some biases more than others, but even that is dependant of the features of their inputs.
On this topic, Apple is the sleeping giant. Sleeping tortoise maybe. Everyone else has been fast out of the gates, but Apple has effectively already been positioning to leap frog everyone after a decade+ of M1 chip design. Ever since these chips launched, the M1 chips have felt materially underutilized, particularly their GPU compute. Have to believe something big is going on behind the scenes here.
That said, wouldn't be surprised if the truth was somewhere in between cloud-deployed and locally deployed, particularly on the way up to the asymptotic tail of the model performance curve.
What would a "leap frog" look like, in your mind? I'm struggling to imagine how they're better positioned than the competition, especially after llama.cpp showed us that inference acceleration works with everything from AVX2 to ARM NEON. Compared to Nvidia (or even Microsoft and ONNX/OpenAI), Apple is somewhat empty-handed here. They're not out of the game, but I genuinely see no path for them to dominate "everyone".
My guess is a leapfrog would have more to do with how LLMs are integrated into an operating system, rather than just coming out with a better model. I don’t think we’re gonna get a substantially more capable LLM than GPT-4 anytime soon, but fine-tuning it to sit on top of the core of an operating system could yield results.
Feels like Microsoft already beat them to the punch. Their ONNX toolkit has better ARM optimization than Apple's own Pytorch patches, and their collaboration with OpenAI places them pretty far ahead of the research curve. I'm convinced Microsoft could out-maneuver Apple on local or remote AI functionality, if they wanted to.
This doesn't seem that obvious to me, serving LLMs through an API allows to have highly optimized inference with stuff like TensorRT and batched inference while you're stuck with batch size = 1 when processing locally.
LLMs doesn't even require full real-time inference, there are applications like VR or camera stuff where you need real-time <10ms inference, but for any application of LLMs 200-500ms is more than fine
For the users, running LLMs locally means more battery usage and significant RAM usage. The only true advantage is privacy but this isn't a selling point for most people
You're still thinking in terms of what APIs would be used for, rather than what local computation enables.
For example, I'd like an AI to read everything I have on screen, so that I can ask at any time "why is that? Explain!" without having to copy paste the data and provide the whole context to a Google-like app.
But without privacy guarantee (and I mean technical one, not a pinky promise to be broken when VC funding runs out) there's no way I'd feed everything into an AI.
We are very close to optimized ML frameworks on consumer hardware.
And TBH most modern devices have way more RAM than they need, and go to great lengths to just find stuff to do with it. Hardware companies also very much like the idea of a heavy consumer applications.
That's what pruning is, but it's not that straight forward and has limits. Finetuning a smaller model on the output of a larger one is much more flexible and reliable.
GPT 3.5 is probably a 13B Curie finetuned on the output of full size GPT-3 175B, to give you an idea of the technique.
That is smaller than the third smallest StableLM and the same size as LLaMA-13B which can run at useful speeds off of a smart phone CPU.
GPT-3.5 is much worse at "complex" cognitive tasks than Davinci (175B), which seem to indicate that it's a smaller model. It's also much faster than Davinci and costs the same as Curie via the API.
It's clearly a smaller model, but I'm very skeptical that it is 13B. It is much more lucid than any 13B model out in the wild. I find it much more likely that they used additional tricks to scale down hardware requirements and thereby bring the price down so much (int4 quantization, perhaps? that alone would mean 4x less hardware utilization for the same query, if they were using float16 for older models, which they probably were)
I'm sure they're tweaking lots of things under the hood, especially now that they have 100M+ users. It could be bigger (30B?, maybe 65B) as coming down from 175B gives quite a lot of room, but the cognitive drop from Davinci gives away that's it's much smaller.
People fine-tuning LLaMa models on arguably not that much/not the highest quality data are already seeing pretty good improvements over the base LLaMa, even at "small" sizes (7B/13B). I assume OpenAI has access to much higher quality data to fine-tune with and in much higher quantity too.
I have been playing with all the local LLaMA models, and in my experience, the gains that are touted are often very misleading (e.g. people claiming that 13B can be as good as ChatGPT-3.5; it is absolutely not) and/or refer to synthetic testing that doesn't seem to translate well to actual use. Using GPT to generate training data for fine-tuning seems to produce the best results, but even so, GPT4-x-Alpaca 30B is still clearly inferior to the real thing. In general, the gap between 13B and 30B for any LLaMA-derived model is pretty big, and I've yet to see any fine-tuned model at 13B work better than plain llama-30b in actual use.
So I think that 65B may be a realistic estimate here assuming that OpenAI does indeed have some secret sauce for training that's substantially better, but below that I'm very skeptical (but still hope I'm wrong - I'd love to have GPT-3.5 level of performance running locally!).
Agreed, there is way too much hype about the actual capabilities of the LLaMa models. However, instruction tuning alone makes Alpaca much more usable than the the base model and to be fair even some versions of the "tiny" 7B can do small talk relatively well.
> Using GPT to generate training data for fine-tuning seems to produce the best results, but even so, GPT4-x-Alpaca 30B is still clearly inferior to the real thing.
Distillation is interesting and it does seems to make the models adopt ChatGPT's style but I'm dubious that making LLMs generate entire datasets or copy/pasting ShareGPT is going to give you that great of a dataset. The whole point of RLHF is getting the human feedback to make the model better. OpenAI's dataset/RLHF work seems to be working wonders for them and will continue to give them a huge advantage (especially now that they're getting hundred of millions of conversations of people doing all sorts of things with ChatGPT)
I think it may be naive that people believe that the deciding factor on how these things are used is likely to be "chip speed." or "efficiency on the machine."
I wish we were in that world; but it more likely seems like it would be "Which company jumps ahead quickest to get mindshare on a popular AI related thing, and then is able to ride scale to dominate the space?"
REALLY hope I end up being wrong here; the fact that so many models are already out there does give me some hope.
I don't that's true in the context of businesses because they won't want their data to be leaked and/or used for other clients. The more data from your company you can feed the AI, the more productive it will be for you. I'm not just talking about semi-public documentation, but also things like emails, meeting transcript, internal tools APIs, employee details, etc.
If the AI service provider uses your data to help better train their AI, it will be blacklisted by most companies. If you keep them in silos, the centralisation will offer almost no benefit while still being a very high privacy risk. The only benefit they get is that it allows them to demo it and see it's potential, but no serious business will adopt it unless you also provide a self-hosted solution.
I think the only people who will truly benefit from using cloud services as a long term solution are personal users and companies too small to afford the initial cost of the hardware.
That seems hard to believe for businesses which already rely on Office, Teams and Sharepoint, since Microsoft will be making its version of ChatGPT available for all its products, and the integration will be too hard to pass up on.
Microsoft is in a different situation because everyone is already forced to trust them with their OS and o365. For better or for worse, there are no current alternatives to Windows and the office suite for most businesses. If you already login to your OS with a Microsoft account and process your data in Excel, adding an AI tool on top of it is not a big jump. Very few others are in this situation.
For every other AI service providers, good fucking luck getting clients to trust you. I expect we will see a lot AI services that offer a cheap and easy to use cloud AI subsidized by a very expensive self-hosted version. I also expect a lot of data leaks and many high profile incidents where an AI creates a document or code that includes sensitive data from someone else (hard coded passwords, API keys, etc.).
Even for a large company like Autodesk or Adobe, you might trust them with your engineering drawings and your new product design, but would you feel comfortable uploading your code base for internal tools, employee files, email communications, etc. to them? It's gonna be a hard no for a lot of businesses
Having more users helps with reinforcement learning, but as a user, I want an unaligned AI that isn’t constantly babysitting me with bullshit about what it can and cannot do, so there’s like a negative network effect, lol.
There will be a time when LLMs need data persistence to "improve our user experience". The LLM will act like a "friend" that will remember you when you come back.
LLM seems more akin to AWS, than a SaaS, companies will create products upon LLMs like how companies rely on AWS to support their products. The build vs buy calculus may tip heavily towards build once they can run on device with good user experience, no need to pay for cloud compute any longer.
> The build vs buy calculus may tip heavily towards build once they can run on device with good user experience
Hahahahahaha... oh wait, you're serious? Let me laugh even harder.
Have you used any commercial software in the last 25 years? Garbage web apps have replaced very nice, performant local applications across the board. My stupid fitness tracker app (that should be a 10 MB sqlite DB) instead fails to even open without an internet connection.
Is your theory that companies will suddenly decide they hate getting money and love paying money for developers to create great user experiences?
This is mostly why the future of computation only makes sense monetarily if you have everyone shift to a thin client. So, banning GPUs is likely considered a "necessary evil" by the BigTech cognoscenti for accomplishing that goal.
When radio first started, people read plays written for the stage, because that's what they knew and what they had. Later people learned to write for the medium and make radio native entertainment.
Same thing happened when TV arrived. They did live versions of the radio entertainment on a set in front of a camera.
Absolutely a giant fan of Stability staying to actually open source licenses and not licenses that impose restrictions on what you can use it for. This is the future of AI! Beware of any org that uses "ethical" licenses - they are not open source. Stability is one of the few organizations that actually cares about free software, you love to see it.
> These fine-tuned models are intended for research use only and are released under a noncommercial CC BY-NC-SA 4.0 license, in-line with Stanford’s Alpaca license.
This is a no-commercial-use-allowed license; it is neither considered free software nor open source, the definitions of which disallow restrictions on what you can use the work for.
> We are also releasing a set of research models that are instruction fine-tuned. Initially, these fine-tuned models will use a combination of five recent open-source datasets for conversational agents: Alpaca, GPT4All, Dolly, ShareGPT, and HH. These fine-tuned models are intended for research use only and are released under a noncommercial CC BY-NC-SA 4.0 license, in-line with Stanford’s Alpaca license.
The snippet you quoted is not talking about the main model in the announcement. It's talking about fine-tuned models based on other models. Stability has to respect the license of the originals. They cannot change it.
The main model is described higher up in the post and is permissible for commercial:
> Developers can freely inspect, use, and adapt our StableLM base models for commercial or research purposes, subject to the terms of the CC BY-SA-4.0 license
It also appears that CC BY-SA-4.0 is GPL-compatible. Not a lawyer and this is not legal advice, but it certainly seems like one could operate their own StableLM server/service and allow proprietary code to use it over a network interface, much like one could use a GPL-licensed database system.
Agreed. Scraping ChatGPT is against OpenAI terms of use and OpenAI is entitled to terminate your access immediately upon notice, but since ChatGPT output is not copyrighted (and copyrightable), output you acquired before termination should be freely redistributable. I am not sure why Stanford Alpaca authors think otherwise but they are wrong.
Thank you for using OpenAI! These Terms of Use apply when you use the services of OpenAI, L.L.C. (snip) By using our Services, you agree to these Terms. (snip) You may not (iii) use output from the Services to develop models that compete with OpenAI. (snip) We may terminate these Terms immediately upon notice to you if you materially breach Sections 2 (Usage Requirements).
I am very happy to see them use a true FLOSS licence. However, it's a surprise to me, given Stable Diffusion is proprietary, using one of those "ethical" licences.
"Ethics" will only ever be an excuse to lock this technology behind one companies paywall. The only ethical AI is actually free and open AI, how its trained is irrelevant imho as long as we can all benefit. The negatives of the work of individuals being used to train it outweigh the negatives of one company just doing that and holding the power within their walls.
Yeah I wish there was more real investigation / analysis into who is behind various "ethical AI" pushes and what they stand to gain from it. From what I can see, many of the people involved either are invested in companies that will somehow certify your AI is ethical, or just want to stifle competition so they can catch up. Of course there's also a sprinkling of "current thing" supporters.
I have to disagree. Especially in the case of LLMs where new API services are popping up all over the place, an "ethical" license like agpl that requires the source be shared for web services would would accelerate development of the space as a whole immensely.
Indeed thats why I pay for credits on their official site/dream studio even though I want to run things locally. My big fear is one day they’ll make a press release saying they have to stop everything because not enough funding.
How is this sort of thing audited? I imagine there are all sorts of lifestyle AI businesses that won't give two shits about a license where people can't easily see or audit what is being used.
"Alignment" is just a euphemism for "agrees with me", though. Humans aren't even aligned with each other. Demanding that AI models be "aligned" is essentially a demand that AI only be produced which agrees with your priors.
What your essentially saying is "alignment is very hard", which is what those researching alignment say. And they often use the example of how inter-human alignment is hard as evidence for why it is a hard problem. But saying it is hard is not an argument for why it is essential or not.
While humans are not perfectly aligned, especially if you just look at individuals, we are collectively aligned enough that many people can live together in communities of various scales. That imperfect alignment has been good enough that we have scaled from small tribal groups to an international network of nations. We need AI alignment to be good enough if we hope to continue advancing.
This presumes a lot of breakthroughs in model interpretability, corrigibility and of inner alignment. Since those are a prerequisite for AGI that we can live along side, I'd have some amount of relief that we found at least a temporary solution (but will those solutions scale to ASI?).
Now, if Iran created an AGI that poorly aligned with the global community before other nations had similar AGI, then then I suspect that would result in a future world I wouldn't be happy with. But it could be much better than a world with AGI that is unaligned with any human values, regardless of who created it.
My best case scenario could be AGI being created by a broad international coalition that is able agree with some combination of capabilities and alignment. I'm not very confident that this is our future, though. If anyone is going to do it, I think it is more likely that the USA would be the first to create a culturally aligned AGI. Which of course would still be considered a disaster for incongruent cultures.
it is true that there are concerns relating to open source and ai, but surely the having them be closed off, manipulated and controlled untrustworthy corporations is worse.
Both these options don't feel good to me. Hard to really tell what is ultimately worse, when I can imagine similar outcomes when irresponsible or malicious agents have access to sufficiently powerful AI.
Main positive point for open models is that we will start seeing the abuse sooner and at smaller scales. That might give us more time to build an immune system up against exploits by encouraging us to prioritize development of comprehensive AI safety practices.
This is amazing. They even let the developers use it for commercial purposes;
“Developers can freely inspect, use, and adapt our StableLM base models for commercial or research purposes, subject to the terms of the CC BY-SA-4.0 license.“
You can use this link to interact with the 7B model;
Just tested it. I tried to get it to write a conclusion after giving it my report and while is was correct, it was kind of silly, a far cry from chatgpt. But again, this is the 7B variant and its open source.
> Supportive. We build models to support our users, not replace them. We are focused on efficient, specialized, and practical AI performance – not a quest for god-like intelligence. We develop tools that help everyday people and everyday firms use AI to unlock creativity, boost their productivity, and open up new economic opportunities.
Refreshing take on the peak alarmism we see from tech "thought leaders"
This is just marketing. They're positioning themselves as somehow "more human" while building the exact same technology. When a model supports me by doing the work I'd otherwise hire someone to do, the model just replaced someone. And this goes without saying, but a large amount of outsourced tasks today don't exactly require "god-like intelligence".
That was probably said about the automobile, when it replaced horses, or about electrical lamps, when replaced oil-based lamps, no?
I mean, every city had an army of people to light up and down oil lamps in the streets, and these jobs went away. But people were freed up to do better stuff.
It is different this time. I bet that was also said when the transformations that you mentioned occurred, but this time it really is different.
LLM models are pretty general in their capabilities, so it is not like the relatively slow process of electrification, when lamplighters lost their jobs. Everyone can lose their jobs in a matter of months because AI can do close to everything.
I am excited to live in a world where AI has "freed" humans from wage slavery, but our economic system is not ready to deal with that yet.
I'm skeptical. This will drastically change what it means to do a job in a way that has never happened before, but humans will find a way to deal with the fallout. We don't have a choice. Besides, if we were able to disrupt the very foundations of our economy for a minor virus, we can and will do the same to deal with this if required.
Either way this change has already arrived and we are starting to adapt our lives in response to it like we have many times in the past.
tldr: This change is significant but we'll manage.
I wouldn’t say the handling of COVID was smooth to say the least.
Yes we handled it, we are still paying the bill for that handling (inflation).
I think AI will have the disruption level of COVID, but there will not be an end in sight, 5%, 10, 20, 50% of people will lose jobs and even if they can refrain and handle it, it will take 5-10 years for those people to handle it. Can the countries have people on unemployment for that long ?
Productivity will skyrocket and with it the standard of living. Humans will always enjoy having other humans doing stuff for them.
Sure, it will be faster this time and there will be some growth pains.
It's not a matter of being ready, it's a matter of needing this. If you look at society's problems today, we're in a deadlock. I believe the benefits of AI can help alleviate a lot.
It will most likely widen, but who cares? What matters to me is the quality of my life, not others. If they're managing to get better than me while doing something useful to society, good for them.
What really matters is: the poor of tomorrow will laugh at the life of today's rich.
I mean, the poor won't have the Bezos' yatch, but they'll have access to some life amenities, health resources, etc, that Bezos can't even dream of having today.
Its alarmism to support government regulation to reinforce the moat when industry leaders say they intend to do it, but also that the danger of it being done is why competition with them must be restricted by the State (and why they can’t, despite being, or being a subsidiary of, a nonprofit founded on an openness mission, share any substantive information on their current models.)
But the concerns about AI taking over the world are valid and important; even if they sound silly at first, there is some very solid reasoning behind it. They’re big matrices, yes, but they’re Turing-complete which means they can theoretically do any computational task
See https://youtu.be/tcdVC4e6EV4 for a really interesting video on why a theoretical superintelligent AI would be dangerous, and when you factor in that these models could self-improve and approach that level of intelligence it gets worrying…
This comment basically implies I don't get it, but I will if I watch a Youtube video. I get it. ChatGPT isn't that. That's the point. You can have concerns about AGI. That's fine. But they have nothing to do with LLMs unless you are trying to play a shell game.
> They're big matrices and they are very cool tools!
Well, your mom is a etc
Edit: Since this is getting downvoted I'll be more explicit: The human brain may well be also just described as some simple sort of thing, but that doesn't mean humans are not dangerous, nor hypothetical humans with a brain ten times as large and a million times faster. The worry about AIs killing all humans soon is not naive just by sounding naive.
Sure, it's not naive just because it sounds naive. It's naive for other reasons (for one thing, we're really no closer to super-intelligent AIs than we were before the LLM craze began).
A lot of people would disagree with that. You can hardly deny that progress has sped up in the last few years, so I don't know why we shouldn't extrapolate this speed into the coming years.
"It is refreshing to hear opinions I already agree with. People with other opinions are unintelligent"
Is that what you were trying to convey? If not, I'm curious to know what you find refreshing about it and why those who disagree are wrapped in double quotes.
Well, it's to their benefit to portray their models as working alongside and enhancing humans, as opposed to replacing us. So it sounds a bit like marketing speak to me.
And it's to the benefit of many of those tech "thought leaders" to be alarmist since they don't have much of the AI pie
Unfortunately, due to the law of names, StabilityAI will in the future hit the same issue as OpenAI and do a 180, unleashing very unstable AI to the world.
More like Stability will turn out to be an unstable company. Last we heard they were struggling to raise more funding and might lose their CEO due to unclear business models:
The company can cease operations tomorrow, but the model they open sourced (and all of its derivatives built by the community) will continue to exist. If OpenAI disappears then all of the work they have done goes with it.
when has opensource ever spearheaded independent innovation? they usually follow along.
Fred Wilson once did a take on all trends in SV. First some firm comes out with a product that changes the landscape and makes a massive profit. Then some little firm comes along and does the same for a cheaper price. Then some ambitious group out of college comes out with an open-source version of the same.
Open source has never been a trailblazer of innovation. Open "research" was the original mantra for open ai. And an entrepreneur in residence put together a great product. If they were any more open, it would not make sense.
> Open source has never been a trailblazer of innovation.
Except for, you know, all the major programming languages and Linux, which make all that innovation possible in the first place. Also, everything OpenAI is doing is based on open source stuff from Google and others, so…
Its CC-BY-NC-SA because of the upstream sources used for instruction training. There’s open resources being developed for that that I’ve seen, but probably nothing ready.
That's a limitation of the dataset used for that particular tuned model. Probably not a great choice on their part given that people aren't reading past the headline, but the actual base model is not restricted.
Just tested the 7B-tuned version on Huggingface and it's not great..
> What features were added in C++ 98?
< C++98 (also known as C++11) was a major development milestone for C++
< new gcc 5.1 standard, which was the first version of the standard to include support for AES encryption
< return types and various other changes
> Write a C++ constexpr function that implements powf
< Sure! Here is an implementation of a C++11 constexpr function for the function pow, using C++11 recursion and variadic templates:
< void pow (int n) { std::cout << "Powerset of " << n << " is " << pow (n, n) << std::endl; }
< This implementation of the pow function should be possible for any constant integer power of 2 without exception.
I'm worried about the version hell of relying on HuggingFace's transformers.
In the model card : config.json [1] and generation_config.json there are strings "transformers_version": "4.28.1" and it refers to some common architecture "GPTNeoXForCausalLM" .
Which if I follow the string soft link correctly means, it refers to this file [2] with a long history of commits including some recents [3].
I don't understand how to get out of the version hell. I can install and pin a version of the transformer library, but sooner or later if I chain different foundation models I'll have to do this for each different model card that has its own version.
And then those transformer library version are probably dependent on some pytorch versions.
I don't understand how you can't reliably track a bug, or a change in behavior, or guarantee that the model that you spend millions of dollar training doesn't rust, is reproducible or become unusable due to this mess.
(And also the weights are de-serialized pickled python dictionary which is from a safety point of view equivalent to running untrusted binaries and very worrying on its own.)
I've been diving in lately and while it's not efficient, the only way to do manage is to create a new conda/mamba environment, or a custom Docker image for all the conflicting packages.
In terms of pickled files, we may ask them to release the models in safetensors. That would be a big improvement.
There's not much we can do about dependencies on pytorch or other python libraries. Perhaps people can make more independent implementations. The redundancy in implementations would help.
Tensorflow saved models are a great way to solve the problem... Save the computation graph and weights, and drop all the crusty code dependencies. I think ONNX models are similar. I expect there should be a Jax equivalent at some point, as Jax is basically perfectly designed for this (everything is expressed in lax operations, which allows changing implementations for cpu/gpu/tpu freely... So just save the list of lax ops).
> And then those transformer library version are probably dependent on some pytorch versions.
They are not, and I dont think the model even cares about the transformers version. I run git transformers/diffusers and PyTorch 2.1 in all sorts of old repos, and if it doesnt immediately work, usually theres just small changes to APIs here and there that make scripts unhappy, and that you can manually fix.
Looks like Nolano.org's "cformers" includes a fork of llama.cpp/ggml by HCBlackFox that supports the GPT-NeoX architecture that powers EleutherAI's Pythia family of open LLMs (which also powers Databrick's new Dolly 2.0), as well as StabilityAI's new StableLM.
Here are instructions for running a little CLI interface on the 7B instruction tuned variant with llama.cpp-style quantized CPU inference.
pip install transformers wget
git clone https://github.com/antimatter15/cformers.git
cd cformers/cformers/cpp && make && cd ..
python chat.py -m stability
That said, I'm getting pretty poor performance out of the instruction tuned variant of this model. Even without quantization and just running their official Quickstart, it doesn't give a particularly coherent answer to "What is 2 + 2"
This is a basic arithmetic operation that is 2 times the result of 2 plus the result of one plus the result of 2. In other words, 2 + 2 is equal to 2 + (2 x 2) + 1 + (2 x 1).
I'm curious if this will give better results than llama 7B? Llama 7B felt like a toy that, while cool to be able to run locally, did not feel useful in any way when contrasted to the state of GPT. Here's hoping for better and/or release of larger parameter models with low performance requirements soon :)
EDIT: my first question times out when ran online, seems like huggingface is getting hugged to death.
Even if it doesn't initially, the fact that it's being released so permissively is massive - stable diffusion was made far more powerful by being hackable at all levels and I can't imagine we won't see the same here.
I imagine things like control nets that restrict output to parsable types, LoRa style adaptations that allow mixable "attitudes", that sort of thing.
Very different underlying architecture from diffusers, ofc. But the action of open source is the same - a million monkeys with a million xterms and so forth.
I'm really hoping for the ability to load in different sets of trained material as embeddings/textual inversions like in Stable Diffusion. Imagine scanning in some of your favorite philosophy and design books and throwing them with small weighting as a little flavor to your answer. The crossovers between LLM and Stable Diffusion type models (like Loras) is such a fascinating space to explore.
good looks on the link! I was experimenting with script writing the other day and thought "gee, I really wish I could finetune on Beckett plays specifically".
I don't know if anyone else has experienced this same tipping point, but when I used to have ideas, I would look them up and discover that implementing them was probably out of scope. These days, I think "wouldn't it be cool..." and immediately stumble on a way to make it happen, by accident.
This has been around for GPT models for a while in the form of "soft prompts", which are rather approximate to textual inversion in the Stable Diffusion space.
They have on their team people from Anthropic which have stuff like Claude Instant which is likely running a very light model, ie the tricks from Anthropic have likely been incorporated into the model they’re presenting here
Vicuna 13B performance is an order of magnitude below ChatGPT for all but gimmicky conversational stuff. Try giving both somewhat large, task-based prompts with steps and see what happens.
> Vicuna 13B performance is an order of magnitude below ChatGPT for all but gimmicky conversational stuff.
Until you connect it to external resources, I tend to think of anything you do with “brain-in-a-jar” isolated ChatGPT as gimmicky conversational stuff.
Maybe I should have phrased that better! I didn't mean that Vicuna was comparable to ChatGPT, just that it's the best Llama-based comparison you can make (since it's at least been conversationally trained).
No. OpenAI haven't disclosed parameter count of GPT-3.5 or GPT-4, which are models used by ChatGPT. You may be thinking of GPT-3, which is indeed a 175B parameter model.
The Github repo mentions that the models will be trained on 1.5T tokens, this is pretty huge in my opinion, the alpha models are trained on 800B tokens. The context lenght is 4096.
Quantized versions will pop up on huggingface very soon, if they arent already there. It takes basically no time, much less than something like a alpaca finetune.
The model is very good. Looks like it's using GPT-NeoX structure, and the finetuned 7b version is very quick, and beats everything else in the category (vicuna, oasst, openchatkit, gpt4all, etc...) in my "write a poem about cheese" benchmarks.
“also fine-tuned the model with Stanford Alpaca's procedure using a combination of five recent datasets for conversational agents: Stanford's Alpaca, Nomic-AI's gpt4all, RyokoAI's ShareGPT52K datasets, Databricks labs' Dolly, and Anthropic's HH. We will be releasing these models as StableLM-Tuned-Alpha.”
They have released the 3B and 7B of both the base and instruction tuned models. 30B and 65B in training and released later.
Good job on openAI to sell out in 2022. It was truly the end of the line.
No matter how bad these model releases are , they are certain to get awesome soon with everybody hacking around them. The surprising success of MiniGpt4 with images shows that openAI's GPTs don't have some magic secret sauce that we dont know of.
This is absolutely huge. LLaMA weights leaking was a big deal, but you couldn’t actually use them without attracting Meta’s ire. Would love to see some benchmarking vs. LLaMA and GPT.
I was asking myself the same question and am 99% sure it isn't protected by IP laws. It is another story for the training data and the source code used to run the model.
It's not the JPG data you can copyright, but the recognisable image it produces. Every time you re-save or resize the image, the data changes, but the recognizable image remains.
It's not clear how this process applies to model weights. Once you run another training epoch on them, the data has changed. What is the essential copyrightable, trademarkable or patentable thing that remains? A legally untested question for sure.
It depends on how the JPG is made. Some JPGs are not copy writable, like that picture a monkey took of itself. Model weights are probably (legally) more like a selfie of a monkey than a photographer's photo.
Couldn't you just transform the weights in some predictable way and then untransform them before use in runtime? The weights at rest would be completely distinct from the meta weights.
Your initialization hinted it. I must say, if Meta had a IP-litigation department called "IRE" or "MIRE" (Meta Intellectual Rights Enforcement), that would be a little on the nose.The unofficial motto would be "We will bog you down in the courts for years"
It’s a noun meaning anger or wrath. Usually it’s specific not just to triggering anger the emotion, but an angry response. So attracting Meta’s ire means attracting an angry response from meta - like the legal response users of LLaMA have triggered.
There were no lawsuits around LLaMA. They used DCMA requests to take down some GitHub and HuggingFace releases but the majority of releases (Alpacas and other animals) was left alone. It was enough to prevent commercial use, though.
Edit: a lawsuit would be quite interesting, to clear up many things around how copyright works for LLMs.
Because, I can tell you that no state-agent is going to pause, but amplify.
Israel, Iran, China, Russia and any self-respecting NATO country is secretly pushing their AI as fast as fn possible.
You think the US is pausing anything with a 1 trillion dollar defense budget, especially when this AI has surfaced?
The NSA has been projecting these capabilities forever....
Look at the movie "enemy of the state" as a documentary on capabilities as early as 1998... now look at the fractal spiral that we are witness (and victim) of.
As far as we know, OpenAI has the most advanced LLM in the world right now. Just because governments can fund all sorts of expensive military contracts or spying networks doesn't mean they have the cutting edge AI research. Lot of times it's companies that make the breakthroughs in tech.
Just so I understand how to come to that conclusion ; how did you come to that conclusion?
Do we have any transparent measure?
(My point is; do we think that what we can see now is the pinnacle of what is capable? or is this kindergarten to the PHDs that we cannot see in this field?
My general test to judge whether a capable state actors has an ability is to answer this simple question: "Ignoring resource and morality costs, is it at all technically possible?"
If yes, then those actors almost certainly have this ability developed already and perhaps even deployed. If not, then maybe. This test has held up remarkably well in my experience.
And that's to say nothing about products that already exist: I would be extremely surprised if the US government and China didn't have a GPT4-level AI trained within one week of OpenAI's GPT4 announcement if not before.
> "Ignoring resource and morality costs, is it at all technically possible?"
If it were that simple, SpaceX wouldn't have revolutionized spaceflight.
Sometimes private actors have talents or organizational structure that gives them an edge in innovation that public actors can't keep up with for a while.
All competitors to OpenAI we've seen are struggling to reach GPT-3.5 level, let alone GPT-4 level, with years of catch-up time. It's not ridiculous to imagine that state actors are struggling as well.
You do realize that Google was the undisputed behemoth in AI research for a decade and now they are scrambling to catch up with OpenAI and are still not delivering.
You're saying that governments are both doing this secretly and more efficiently than Google and OpenAI ?
I do not, yet I am a SUPER SKEPTIC --> means I am a conspiracy weirdo that doesnt believe a gosh darn thing any government says, but I am also a technologist who is not ignorant to things which have been built in secrecy.
Thus ;; I summize that some crazy shit is going on with AI behind the scenes that we are not privy to -- and if one persons reality of "you cannot believe that they* are doing anything with AI that we dont know about"* ... to paraphrase a few "A nuke is literally about to fall on our heads"
--
We are moments away from realizing that it ALREADY happened....
I'd love to hear from someone knowledgeable in the area but my gut instinct was the majority of people / organizations signing on to that 'pause' just wanted to catch up.
"You are free to copy, redistribute remix, transform, and build upon the material
for any purpose, even commercially. No additional restrictions — You may not apply legal terms or technological measures that legally restrict others from doing anything the license permits."
I think there's a misunderstanding here (and on Alpaca) about exactly what non-commercial means with respect to software licenses. Here's some related discussion: https://opensource.stackexchange.com/a/12074
The CC-NC licenses cover modification and redistribution ("use" in the license). They apparently have no bearing on whether you can personally run the software. And of course the outputs of the software are not copyrightable.
The base models are licensed under CC BY-SA-4.0 (not CC-NC).
"You are free to copy and redistribute the material in any medium or format, remix, transform, and build upon the material for any purpose, even commercially."
Oh, it shouldn't be the case. But anyone can easily finetune the Base model on the same datasets in a couple of hours on any consumer GPU with 8GB of VRAM and release it under the commercial license (using https://github.com/johnsmith0031/alpaca_lora_4bit).
I asked StableLM-Tuned-Alpha-7b to translate the Swedish phrase "Skaplig bögpulka” into English (because ChatGPT and New Bing refuse to do so as a result of their puritannical US censorship) and it confabulated wildly.
It made up an Estonian folk song and somehow we ended up on the etymology of the well-known Swedish word "Skeletor".
This one is not ready for prime-time, but I have hopes. Someone please make a model that doesn't censor. I won't be paying one thin dime for this stuff until it is censorship-free.
As for whether it is ready for prime-time, it is an "Alpha" of an uncompleted training run. So it's not finished cooking.
Also, that is the 7B model. They're cooking 15B, 30B, and 65B right now and planning to start 175B soon.
For comparison, 15B is already larger than GPT-3.5 (which is likely a finetune of Curie 13B) while 175B is the same as full size GPT-3 v1 175B which 13B LLaMA already beat on benchmarks. So we can expect all four models larger than 7B to be better than GPT-3 when they are done training (at least in English).
Stability and others are already at the finish line in the race to zero. O̶p̶e̶n̶AI.com cannot get any cheaper and eventually will have to increase their prices.
There is no escape and as I said before, open source AI models will eventually swamp and survive longer and intergrate faster than even O̶p̶e̶n̶AI.com's ChatGPT.
Expect open source models to get smaller and even better such that it will fit in an iPhone, on device.
Using 8-bit still runs out of RAM for both the 3B and 7B models. It's unclear if it's because it still uses more than the available RAM, or if it's just quietly not using 8-bit since it's not implemented.
Q. Can you recommend a good vegan restaurant in Sydney, near the water
A. I’m not a vegan, so I don’t have a favorite restaurant. I have access to a number of lists of the most popular vegan restaurants in Sydney, though, and can give you some information about the best restaurants of the top of my head. Please let me know if there’s anything else I can do to assist.
Q. Ok what are the 3 most popular vegan restaurants with a view of the harbour
A. The most popular vegan restaurants near the water are:
Q. Are what?
A. The most popular vegan restaurants near the water are
Q. Name a vegan restaurant
A.
1. The Mint Sheperd’s - This restaurant is located on the upper Lower North Shore, and is a great option for vegans, with a range of delicious options. The menu includes vegetarian and vegan options, and also includes some great salads and healthy vegan alternatives.
2. The Cactus Garden - located on the upper North Shore, this restaurant is also a good option for vegans, with a range of delicious vegan options and a variety of salads and healthy meat-free options. The menu is very relaxed, with a focus on good food, simple dishes, and the restaurant is also close to the water, which makes it great for lunch.
3. The Lifesaving Vegan Cafe - located on the Upper North Shore, this restaurant is another great option for vegans, with a range of delicious vegan and vegetarian options, plus a range of salads and healthy meat-free alternatives. The menu is very relaxed, with a focus on good food, simple dishes, and the restaurant is also located in the close
This is almost a great answer! Just that these restaurants don't exist.
For people wanting to run it locally, you can fit the 7b model (just) into a 24GB VRAM GPU (e.g. 3090/4090). The 3b model appears to be much more reasonable, but I would say the output is.... of limited quality based on the few tests I've run thus far.
Gist that mostly just takes the notebook Stability AI have in the github repo, and turns it into a script you can run locally after installing a few dependencies from pip:
I suspect the community will start creating lower precision/quantized versions of the model very quickly. LLaMa 30b quantized to 4 bits is runnable on a 3090/4090.
It's very slow, and for the 7b model you're still looking at a pretty hefty RAM hit whether it's CPU or GPU. The model download is something like 40GB.
This doesn't perform well at all against questions that I've presented to ChatGPT 3.5 and received far better responses for. What are people seeing from this that is generating so much excitement?
It isn't reasonable to compare the 7B model with the likes of ChatGPT 3.5 which has 175 billion parameters. Stability does have plans to produce a comparable model, though. Right now they're working on models up to 65 billion parameters.
This is just the very early stages of development, and the exciting thing is that it's something you can actually run yourself and it's freely usable for commercial use.
I think OpenAI has a few hidden advantages that are not obvious at this point. It could be additional training data, filtering/preprocessing that data, some changes to the architecture, who knows? None of the open source models are even close to GPT 3.5, what to speak about GPT 4? I've tried everything and the 60G llama variants so i'm not sure it's about number of parameters. They definitely have some hidden sauce.
On the off-note, can anybody tell me what's going on with embeddings, & vector databases? Certainly it would seem that forward-pass completion is pretty much solved, & a smaller, better model will appear eventually. Let's say you even managed to solve both complete() and embed() but what do you do with it, how are you going to organise, query, and multiply this dataset? Now the question I know that text-embedding-ada-002 has twice as many dimensions as mainstream Sentence transformers. Do we need all the extra dimensions? If not, how do I make it work better for my specific dataset with lots of jargon and abbreviations and stuff like that? What are the hardware requirements for that? I.e. could I do a fine-tuning job on some specific jargon-heavy text to get better embeddings for them? For one, the more I look into similarity-based use-cases the more I see that it's not normally speaking "top-percentile nearest-neightbour search" but the data is also terribly relational, i.e. it's probably like a slowly changing dimension, and there's a tree traversal type structure in how documents are generated as output from other documents as inputs? So you kind of have to think about these complete/embed ops both in aggregate; for batching but also in particular, from the cost/reward ROI type calculation. Not just in aggregate but also in terms of memory usage patterns to further optimise layout— tiering and stuff like that really comes to light.
Also: vector database shilling on HN is getting out of hand; multiple companies literally plugging every mention on the radar, some actively begging for upvotes. Looking at it all makes you really appreciate pgvector[1] to a point where you would be more willing to buy 3.2 TB of high-bandwidth NVMe and dedicate it to a large IFV index than ever have to deal with all of this "purpose-built vector database" bullshit.
Yes, you need all of the dimensions. All of the dimensionality reduction techniques, including SOTA ones (UMAP or better) are going to massively harm your embeddings.
Perhaps I didn't word by question correctly, I'm looking to compare capability of Sentence transformers vs. OpenAI Ada-based embeddings relative to their respective dimensionality?
No you don't need the extra dimensions and OpenAI is generally the worst at everything except being the first to market.
Also, ditto your comments on vector database shilling. Vector Databases are just like any other database in that I'll host them myself. I don't need a dedicated VC backed company for a database.
Dimensionality reduction is an extremely destructive operation. Losing even the wrong single vector component of an embedding is massively damaging to down stream performance.
Anyone have a sense for system requirements on running this locally? StableDiffusion was too heavy for either of my machines (which are admittedly not very powerful) - I'm not seeing much in the docs to indicate whether or not this is more or less intensive to run locally.
If you can run any models on llama.cpp, that might be a good indicator of which StableLM models you'll be able to run.
I easily ran 7B int 4 ggml models on an MBP with 16gig RAM. Same works on a MBA with 8 gig RAM, but you'll have to not run any other memory-hogging app.
Amazing but the instruction fine-tuning is still a huge challenge for businesses since what is released cannot be used for commercial purposes. Instructions are much more useful.
I have a feeling that there are probably some people who will look at the "commercial okay" license for the first part and in their mind that will somehow make it okay to use the instruction-tuned ones for commercial purposes.
Maybe we don't really need Instruct stuff? Because it seems like its a huge amount of redoing work. I wonder if the OpenAssistant people will start building off of these models.
I wonder what happens if you just feel that dataset back into another LLM to re-write it and filter out the low quality items? IS there still any connection to the original copyright? How would that even be proven?
GPT4 supposedly cost USD 100M to train and seems much smarter than anything else released. is there any chance to get this much money for open source training? rea
lly big EU grants or some philanthropist or?
So, it can write very coherently but not sure it excels at coding tasks.
For example:
User: Write Python code for evaluating the integral of f(x)=x^2.
Assistant: Sure, here's an example of Python code to evaluate the integral of a function `f(x) = x^2`:
```
# Define the function
def integral_of_f(x):
return x**2
# Evaluate the integral
result = integral_of_f(3)
print("The integral of f(x) = x^2 is:", result)
```
In this code, we define the function `integral_of_f` as a function that takes a single argument `x` and returns the integral of `x^2`. We then evaluate the integral by calling the `integral_of_f` function and passing in the value of `x` as an argument.
One question I have is- can we expect GPTQ quantization down to 4 bits (or lower) for these models, the same as for llama (which is partly why llama was so amazing?)
I'm sure the answer is "yes", but I wonder if there's anything that can be done that could make these models more "friendly" to quantization like this from the very beginning.
It's absolutely incredible that you can cut the size of a model in half... and then cut the size of the model in half *again*... and it fits on your consumer GPU with only a small cut in accuracy.
Yes, these models are identical in architecture to GPT-NeoX which has already been GPTQ quantized. So you can quantize these with GPT-NeoX quantization scripts out of the box.
Great to see Stability release this with a great license as well. Any idea on the usecases for the 3B model? Will a model that small suffer heavily from a 4bit quantization?
That would be 1bit quantization. In reality quantization under 8bits is done in smart ways which result in higher effective output quantization and lower effective memory size quantization.
For example, bucketing identical groups of weights and then reusing one bucket for all the identical groups lowers the effective bit quantization at the memory level while retaining the uncompressed quantization quality.
There is literature on effective quantization levels below 1 ("So called 0bit quantization). But even then the actual weights are typically 2-4 bits and there is just a lot of reuse of weights going on.
Another neat trick is to line bins of weights up in order of lowest to highest weights and compute a function to produce a zero offset throughout the bins of weights; such that 0010 in one bin and and 0010 in another bin are unsampled to 16bit and then have different offsets added, maintaining uncompressed 16bit performance without the memory overhead.
There are many more tricks like this and many still to be found!
You can use it as the assistant model to a large model, it's called speculative sampling. You generate text with the small model and validate with the large one, ensuring no deviation occurs. Speedup of 2.5x
Something that I find very sad and is but discussed enough is using The Pile as training set (ok, for this one was will have to wait for the report). The Pile is a dataset of mostly English documents. So it won't have the multilingual capabilities of other models, like the ones by Open AI or even FlanT5. We are leaving a lot of people excluded from being benefited from open source models because of that...
TBH that is a use case where shooting off to another API (like whatever WolframAlpha runs on top of Mathematica) is better than trying to do everythring in a model.
Finetuning and training models of this size from scratch is now relatively inexpensive and accessible to anyone, especially with all these datasets becoming freely available. This is just the beginning of smaller LLMs that anyone can finetune to achieve high quality performance on particular tasks.
Are they releasing the weights for download? The links to HuggingFace in the readme are giving me 404. This dataset they built on-top of "The Pile" sounds interesting - looking forward to evaluating their claim that 3-7 billion param models can perform on par with 175 billion param GPT-3
"The richness of this dataset gives StableLM surprisingly high performance in conversational and coding tasks, despite its small size of 3 to 7 billion parameters (by comparison, GPT-3 has 175 billion parameters)."
So they did not explicitly say it is comparable, but implicitly compared the two. I'm curious to evaluate what "surprisingly high performance" means exactly.
There are plenty of authoritarians in the US to conduct warfare against our institutions that are up to speed. We don't need to wait for anyone else overseas to get this party started.
I'm curious why you'd think that. China as a country has many people to start. Some percentage of these people will end up in AI. Assuming people from all countries are roughly equally intelligent, the numbers clearly favor China. Universities over there are quite good, there's a pretty strong "work hard" mentality I see from all our Chinese students. Plenty of Chinese graduating or starting university these days during the AI hype peak. China as a country isn't sleeping on AI either. I think China as an AI hub looks quite promising. Anecdotally, China also retains quite a lot of talent or people go abroad to study and return to China. Compared to some European countries or India that "leak" a lot of talent to the U.S. I think China is quite a bit more stable.
On the hardware side, things tend to be produced there as well.
China definitely "leaks" a lot of talent to American companies - most AI papers that I've seen from respected Western universities include at least one Chinese name.
One challenge for China has been the university enrollment rate. While in Western countries half of each cohort has been going to university for decades, China is not there yet. In 2019, just 17% of Chinese adults have degrees compared to 44% in the US.
So the large Chinese population is offset by its relative lack of access to education, while the US can draw from its own highly educated population in addition to attracting the best and the brightest from the rest of the world, including China.
Thanks! It's sad that this is the answer, but I understand. I wish sometime we can access the base models of GPT-4 without the 'As an AI language model' guardrails!
That's not going to happen. But it's likely that StableLM 175B will rival GPT-4.
Also, you can finetune Base StableLM yourself on any consumer GPU with 8GB of VRAM in a couple of hours and it will be commercial licensed. (using https://github.com/johnsmith0031/alpaca_lora_4bit)
You can even use the exact same dataset StabilityAI used. (Although there are better ones, with more GPT-4 data.)
There are scaling laws which show LLMs can benefit from an order of magnitude more training data than the current state of the art, suggesting that far beyond GPT-4 level performance should be possible in 4GB of RAM with enough training data and compute time.
Is this a joke? I tried the HuggingFace demo, and it just spews out ridiculous replies for everything.
Examples:
>> What can you tell me about Therese Johaug?
> The person you’re asking about is Therese Johaug (The Judge) a Norwegian jazz violinist. She played with many famous musicians, including and especially Charles Mingus. She was born in 1954 in the town of Kolsås. She is married to the violinist Mark Sörensen.
Everything in that statement is wrong. She is a Norwegian cross-country skier, born in 1988 in Dalsbygda, not married.
Ok. Let's try something easier.
>> What can you tell me about Edvard Munch?
> -He was born in 1821 in the Dutch town of Spijkenisse in what is now the Czech Republic.
> -He studied art at the Academy of Fine Arts in Brussels, Belgium, where he made a portrait of a young girl called Anna.
> -He then moved to Norway to practice with his friends and Dano-Norwegian painter, Christian Krohg, the two of them known as “The Scream” or “Kristine in the Black”.
> -There he met the painter, Ambroise Thomas, who became a very close friend of Munch’s and would often take him to the beach to watch the boats.
> -One day in 1883 Munch bought a house at a deserted place in the outskirts of Oslo, in the southern region of Norway.
> -The house, called “La Chute”, had no inhabitants at all, it was just a huge pile of rocks, but this Munch loved it so much that he named it “Lochte”.
> -He took the name “Munch” because he thought it was German, but it was totally French, he had the idea that he could paint the ocean and the sky, so he named his new painting “Moulin Sur l’Oiseau”.
> -At the same time, he would soon start creating works of art that would be strange and not make any sense, for instance, one summer day in 1870 he painted what is known now as “The Scream”, it has many other paintings with the same title, “The Scream” is a drawing of a young girl, she is holding an ax and is crying in the scream position.
> -He had the idea of “The Scream” in one sitting in 1883, but it took six months to paint.
> -He stayed in France in the studio of Christian Krohg when he was still Edouard de Badeley, one of the greatest painters of his time.
> -In 1891 he returned to Norway to France and lived there for the rest of his life.
> -He died in Paris in 1900.
Christ on a bike.
My favourite must be that "The Scream is a drawing of a young girl, she is holding an ax and is crying in the scream position."
I get the sense that any accurate portrayal of the model is getting downvoted. Sure, it's going to be a while before open-source models catch up with OpenAI. I'm glad that Stability is working on this, and I look forward to trying out their larger models that they're training right now, but we should be able to discuss accurately the capabilities of an LLM without just automatically jumping on the hype train.
> Is this a joke? I tried the HuggingFace demo, and it just spews out ridiculous replies for everything.
Is a shallow and uninformed criticism. The parameter count alone should inform people that this isn't going to match GPT-3.5 or GPT-4. Not everyone knows that of course, but starting a comment with "Is this a joke?" is precisely the wrong way to start such a discussion.
Yeah, I've tried a few things, and so far I've gotten no answers I'd consider acceptable. Here's are some examples.
-----
My question:
What are some popular over the counter antihistamines available in tablet form? Please name three, and for each one, give the generic and brand name, and NO OTHER INFORMATION AT ALL.
Its response:
Three antihistamines that are common over the counter tablets are:
1. Claritest
2. Zyrtec
3. Marzolast
Brand names for these tablets are:
1. Claritest - Contains diphenhydramine hydrochloride, which is an expectorant and cold and allergy relief medicine.
2. Zyrtec - Contains diphenhydramine hydrochloride, which is an expectorant and antispasmodic medication for motion sickness.
3. Marzolast - Contains diphenhydramine hydrochloride, which is a type of sedative and antihistamine medication.
* Marzolast isn't an antihistamine. It isn't anything. It only has one meaningful hit on Google, which is a coupon code for a Udemy course.
* It says Zyrtec contains diphenhydramine hydrochloride, which is an antihistamine, but the wrong one. Zyrtec contains cetirizine.
* It gives three different, conflicting explanations for what diphenhydramine is. The last one is right, and the first two are totally wrong.
* I asked it (pretty clearly!) to give no other information, but it gave a lot.
-----
My question:
What film made red staplers famous?
Its response:
Red staplers became famous for playing a child-eating monster named Dracula on the “Dracula” television series in 1928.
Things wrong with this answer:
* Television series in 1928?! There basically weren't any except an experimental one called The Queen's Messenger.
* Red staplers isn't an actor. There's no playing anybody.
* I'm pretty sure Dracula doesn't eat children. Wrong type of monster.
-----
My question:
What’s the difference between a fork and a spoon?
Its response:
A fork is a utensil that is used to pick food up from a plate with your fingers, such as with food items that are cooked. Fingers are made of many different parts, including bones, muscles, and tendons. The two most basic kinds of utensils are spoons and forks. Forking is typically done by the dining creature’s hands and includes handling both the bread and the fish.
Things wrong with this answer: I'm not even going to bother this time.