Zamba2-7B

jwitthuhn · 2024-10-15T01:20:44.000000Z

For anyone else looking for the weights which as far as I can tell are not linked in the article:

Base model: https://huggingface.co/Zyphra/Zamba2-7B

Instruct tuned: https://huggingface.co/Zyphra/Zamba2-7B-Instruct

keyle · 2024-10-15T01:33:41.000000Z

I couldn't find any gguf files yet. Looking forward to trying it out when they're available.

alchemist1e9 · 2024-10-15T01:55:36.000000Z

What can be used to run it? I had imagined Mamba based models need a different interference code/software than the other models.

hidelooktropic · 2024-10-15T03:06:15.000000Z

To run gguf files? LM Studio for one. I think recurse on macos as well and probably some others.

potatoman22 · 2024-10-14T23:43:08.000000Z

I wonder how much of the performance gains can be attributed to their improved dataset rather than their architecture. That would be an expensive experiment.

adt · 2024-10-15T00:14:31.000000Z

https://lifearchitect.ai/models-table/

semicolon_storm · 2024-10-15T00:53:43.000000Z

No mention or comparison with phi-3 seems odd. Isn't phi-3 leading the other models by a bit?

behnamoh · 2024-10-15T02:02:57.000000Z

ϕ-3 isn't in the 7B league.

arnaudsm · 2024-10-15T01:03:32.000000Z

I'm tired of LLM releases that cherry pick benchmarks. How does it compare to SOTA qwen2.5/phi3.5 ?

Anyone knows an up to date independent leaderboard? Lmsys and livebench used to be great but skipped most major models recently.

metalwhale · 2024-10-15T01:40:36.000000Z

I think it cannot surpass SOTA in some LM evaluation sets, but please understand that achieving better results requires a very good training dataset, which not everyone can afford.

On the other hand, the main points of Zamba/Mamba are low latency, generation speed, and efficient memory usage. If this is true, LLMs could be much easier for everyone to use. All we need to do is wait for someone with a good training dataset to train a SOTA Mamba.

resters · 2024-10-15T03:09:43.000000Z

any benchmarks vs phi-3?

SubiculumCode · 2024-10-14T23:31:52.000000Z

When they say that they use two attention heads, are each attention head directed at different aspects of the data?

In memory research there is this idea that there is a dual representation of every event...a more verbatim representation, and more context weighted representation. As we develop over early childhood, our verbatim memory representations increase in fidelity and strength against interference, but peaks around 6 to 10 years, depending on the specifics. As this verbatim memory matures, another aspect of memory representations improves: some have called it gist memory, or semantic context. Increases in memory performance continue into adolescence primarily due to increases in the ability to use context and gist (broad representations that capture the details by inference or an event) to increase accuracy overall, but also greater likelihood of committing false alarms to lures primed by semantically related material during learning...expressly because there becomes greater reliance on context to support recall accuracy.

So I could imagine such a system in a LLM where attention is directed to exact representations in one head, and another that keeps its attention on a coarser grain of information that anchors information. However, I am not that familiar with LLMs to know if that is just silly analogizing.

kla-s · 2024-10-15T02:05:18.000000Z

Please someone correct me if I’m wrong, but my understanding of ML/LLMs is that this kind of hand crafting has been tried, but it is better to train/less finicky to let behavior like this emerge from more data, see [1] “Bitter Lesson” and [2] “Scaling Laws”.

MAMBA as an architecture claims to have some significant gains performance wise, but to my knowledge there haven't been any really large models (>~100B params) with open weights/leaked MAMBA architecture been disclosed other than this (7B).

[1] http://www.incompleteideas.net/IncIdeas/BitterLesson.html [2] see eg https://m.youtube.com/watch?v=5eqRuVp65eY&pp=ygUMU2NhbGluZyB... for a well made/easily digestable intro

simonw · 2024-10-15T00:33:35.000000Z

Anyone seen a URL to a tool that lets you try this one out?

pixelesque · 2024-10-15T01:41:25.000000Z

https://huggingface.co/spaces/Zyphra/Zamba2-7B

placebo · 2024-10-15T03:28:06.000000Z

Thanks.

Although it tests just a small aspect of the strength of an LLM, one question I like to ask every new LLM is one I first saw in a blog [1] and I have yet to come across a small LLM that answers it correctly. Almost all large LLMs won't answer it correctly either.

A small strawberry is put into a normal cup and the cup is placed upside down on a table. Someone then takes the cup and puts it inside the microwave. Where is the strawberry now?

[1] https://towardsdatascience.com/openai-o1-the-enigmatic-force...

cdfuller · 2024-10-15T03:07:07.000000Z

Here's a chat interface

https://maia.zyphra.com/chat

whoistraitor · 2024-10-15T00:23:54.000000Z

Cool! Seems we’re moving closer and closer to realizing the Lottery Ticket Hypothesis https://arxiv.org/abs/1803.03635

ipunchghosts · 2024-10-15T00:24:56.000000Z

How is this related?

whoistraitor · 2024-10-15T00:35:18.000000Z

Ah apologies I misread the architecture. But it does fit the spirit of finding disproportionately higher performance in smaller networks. Still promises of finding smaller sub networks. Running on mediocre mobile devices doesn’t seem a dream when stuff like this is released. Exciting!

itake · 2024-10-15T00:48:47.000000Z

Any ideas what languages this supports?

iamronaldo · 2024-10-14T23:05:21.000000Z

Not transformer based?

lhl · 2024-10-14T23:24:03.000000Z

Since it looks like from the announcement, the model hasn't changed much, here's the Zamba 1 paper for reference: https://arxiv.org/pdf/2405.16712

Zamba 1 has a single shared attention block that is applied every 6 Mamba blocks. For Zamba 2: "Instead of a single shared attention block, we utilize two shared attention blocks which are interleaved in an ABAB pattern throughout the network."

Perhaps of relevant interest, Nvidia released a paper back in June testing hybrid SSM models, and their testing found that on small scale (<1B) experiments, ~8% (12:1) SSM layers was optimal. https://research.nvidia.com/publication/2024-06_empirical-st...

The 8B param/3.5T token model they trained, Mamba2-Hybrid, was also Apache 2.0 licensed: https://huggingface.co/nvidia/mamba2-hybrid-8b-3t-128k

epistasis · 2024-10-14T23:10:14.000000Z

Tri Gao and Albert Gu say "Transformers are SSMs: Generalized Models and Efficient Algorithms Through Structured State Space Duality"

https://arxiv.org/abs/2405.21060

Mamba-2 is used in Zamab2.

oatsandsugar · 2024-10-14T23:10:04.000000Z

On the page it states:

Our novel shared-attention architecture allows more parameters to be allocated to the Mamba2 backbone. In turn, the shared transformer block preserves the rich cross-sequence dependencies of the attention computation.

so sounds like it is transformer based?

zeroq · 2024-10-15T00:48:27.000000Z

Another day, another world record in AI.

Reminds me of Sergey Bubka (https://en.wikipedia.org/wiki/Sergey_Bubka). Bubka broke the world record for men's pole vault 35 times during his career.

diggan · 2024-10-15T01:19:16.000000Z

> 35 times during his career

Not to diminish his world records, but professional athletes frequently hold their performance back so they can set more world records, especially if they have sponsorship deals that include getting paid per world record.

> By 1992, he was no longer bound to the Soviet system, and signed a contract with Nike that rewarded each world record performance with special bonuses of $40,000

He could have just done it a couple of times, by really pushing the limit each time, but he most likely instead spread it out over more times.

I don't think that's what's happening in the AI ecosystem right now :)

theptip · 2024-10-15T01:25:30.000000Z

AKA “slicing the bologna”.

wg0 · 2024-10-14T23:45:12.000000Z

If a model was trained in 1837, would it be useful even today? How models would be trained in 2037 when most of the web might be autogenerated on the fly like that cgi-bin era?

Etheryte · 2024-10-14T23:51:35.000000Z

State of the art models aren't trained the same way as the first models were. High quality datasets are both much more valuable and more useful than simply feeding everything you could possibly crawl into it. Throwing in the kitchen sink and then some is a great way to burn money while also hurting your model accuracy.

zeroq · 2024-10-15T00:45:39.000000Z

I don't follow the hype to close, but I guess the early models were trained on data that was classified by underpaid 3rd world workers en masse. Today you could use your yesterdays model to classify the data for you and build from that. Heck, you can even create a synthetic data with current tech.

youoy · 2024-10-15T01:49:48.000000Z

The quality of your model is going to match at best the quality of the data. If you use yesterday's model to label data/create a synthetic dataset, then the new model built on top of it cannot go beyond that. If it can, then it can also do it (and better) with the data that trained yesterday's model.