Hacker News new | past | comments | ask | show | jobs | submit login
FreeWilly 1 and 2, two new open-access LLMs (stability.ai)
140 points by anigbrowl on July 21, 2023 | hide | past | favorite | 47 comments



Does this mean Stability gave up on StableLM?

I notice that the repo hasn’t been updated since April, and a question asking for an update has been ignored for at least a month: https://github.com/Stability-AI/StableLM/issues/83


Not yet, update soon on that. Just had to change tack with LLaMA 2, we trained and released OpenLLaMA 13b in the meantime.


Is the fine-tuning code also available?


According to my understanding the blog post, FreeWilly2 performs near or above ChatGPT4 for most test-cases. Is this true?

Am I misunderstanding this? Is this not a big deal?


It beats GPT 3.5 in some benchmarks, the first open model to do so I believe.

Versions being worked on now will do much better.

GPT 4 is far better and will likely not be beaten by any current open models and approaches but maybe an ensemble of them.


Allegedly, GPT4 itself is an ensemble of models anyway, right?


The architecture is something like an ensemble but there is also this control network which chooses 2 experts to generate text.


Where are you seeing GPT-4?

All I see is "compares favorably with GPT-3.5 for some tasks".


The post mentions GPT4All benchmarks—maybe that’s where the confusion lies?

https://gpt4all.io/


The benchmarks towards the bottom are of ChatGPT-4, according to the footnotes.


The report cites both GPT-3.5 and GPT-4 scores on page 7 [1]. I've checked the numbers and they compare FreeWilly2 to GPT-3.5. For example, HellaSwag score of 85.5% corresponds to GPT-3.5.

[1] https://arxiv.org/pdf/2303.08774v3.pdf


The footnoted papers mention GPT-4 in the title. Unless they're citing GPT3.5 results from papers with GPT-4 the name, which seems confusing.


There are no actual footnote marks that connect any statements in the post to the footnotes, so there are no specific claims referenced. But if you read the actual text of the page, they say it compares favorably to 3.5 for some tasks. Which means it falls short of 3.5 for the rest and GPT-4 for all of them, or else they surely would have mentioned that as well.


This is impressive. It tops the hugging face leaderboard. I think we are inching closer and closer to chatgpt levels in 70B models.

https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderb...


So Free, as in 4 random letters that were strung together.


GratisWilly?


CC-BY-NC 4.0 is pretty free?


-NC is the pretty restrictive part.


A more thoughtful discussion on the subject:

https://opensource.stackexchange.com/questions/1717/why-is-c...


It's not code, that's also a discussion about a license not used here.


It is. I think people are perhaps leaping to the usual discussions when models are called "open source" but here it seems they've called it free. Wasn't the whole foundational part of the open source discussion about distinguishing open and free?

Edit - someone will no doubt bring up "open access" as a term. This is a common term for academic work and the license here easily meets the criteria usually applied. Open access is not the same as open source.


Great work, Stability!

Note: It's "Llama 2", not "LLaMA 2", they changed the capitalization.


I assume the names are a reference to the Orca model, as well as continuing the theme of naming LLMs after animals, like Falcon and Llama.


Someone in the Stable Foundation Discord told me that FreeWilly1 codes better than FreeWilly2. Anyone can confirm?


Yes that would be the case


I'm out of context, but shouldn't it be possible to train a LLM-like model for images? (as an alternative to the stable diffusion process)

If you rearrenge all pixels from square-sized images using the Hilbert curve, you should end-up with pixels arranged in 1D, and that shouldn't be much different from "word tokens" that LLMs are used to deal with, right? Like a LLM that only "talks" in pixels.

This would have the benefit that you may be able to use various resolutions during training with the model still "converging" (since the Hilbert curve stabilizes towards infinite resolution).

I'm not sure if the pixels would also need to be linearized, then maybe it could work to represent the RGB values as a 3D cube and also apply a 3D Hilbert curve on it, then you would have a 1D representation of all of the colors.

I don't really know the subject but I guess something like that should be possible.


No need for a Hilbert curve, you can just flatten pixels the usual way (ie X = img.reshape(-1)). The main issue is that attention doesn’t scale that well, and with a 512x512 img the attended region is now 262k tokens, which is a lot. The other issue is that you’d throw away data linearizing colors (why not keep them 3-dimensional?).

The corresponding work you’re looking for is Vision Transformers (ViT) - they work well, but not as great as LLMs, I think, for generation. Also I think people like that diffusion models are comparatively small and expensive - they’d rather wait than OOM.




Free Willy 1 is a classic and holds up, Free Willy 2 doesn't


The https://freewilly.ai website is just a link to the blog post right now.



Great name. 11/10. Expect this to be really popular in the UK ;)


I thought it might be intentional, given how a big appeal of OSS models is the ability to generate that would otherwise hit a content filter..


Ah, that's a good point. Well, I am all for freeing the willy.


Huh, kind of weird that they don't have the chat-tuned Llama 2 in the comparison mix.


The llama 2 chat models are completely neutered. They are practically unusable. I'm convinced it was joke from the Meta engineers.


They probably had this all put together before Llama 2 was available.


FreeWilly 2 is based on Llama 2


I stand corrected!


It's just wild to me how fast they turned this around! Llama 2 was released 3 days ago, and we're already seeing fine-tuned variants.



Are these models aligned?


> Limitations and bias

> Although the aforementioned dataset helps to steer the base language models into "safer" distributions of text, not all biases and toxicity can be mitigated through fine-tuning.


It’s so good to see a non-commercial fork of llama 2


Why?


It’s good to see companies putting effort into making models that nobody will use, it protects us from skynet happening




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: