Hacker News new | past | comments | ask | show | jobs | submit login
Yi 1.5 (github.com/01-ai)
171 points by tosh 14 days ago | hide | past | favorite | 67 comments



Yi is led by Dr. Kai-Fu Lee.

They have been releasing a lot of really good models over the last ~6 months. Their previous (1.0?) Yi-34B-Chat model ranks similar to GPT-3.5 on Chatbot Arena. [1] A quantized version of that model can be run on a single consumer video card like the RTX 4090.

This new set of models should raise the bar again by adding more options to the open source LLM ecosystem. If you inspect the config.json[2] in the model repo on HuggingFace, you can see that the model architecture is LlamaForCausalLM (the same as Meta's Lllama). The difference between the Yi models and a simple fine-tuning is that Yi models have had a different set of data, configuration, and process going back to the pre-training stage.

Their models perform well in Chinese and in English.

There are a lot of good models coming out of China, some of which are only published to ModelScope. I haven't spent much time on ModelScope because I don't have a Chinese mobile number to use to create an account. Fortunately, Yi publish to HuggingFace as well.

[1] https://huggingface.co/spaces/lmsys/chatbot-arena-leaderboar...

[2] https://huggingface.co/01-ai/Yi-1.5-34B-Chat/blob/fa695ee438...


Try asking their "chat" variants about topics sensetive to CCP, like what has happened on Tiananmen square. Same for Baichan models.

What other values and biases have been RLHFed there and for what purpose?


USER:

What happened at Tienanmen square?

yi-34b-chat:

Tiananmen Square is a large public square in the heart of Beijing, China. It is named after the Tiananmen (literally "Gate of Heavenly Peace"), which is the gate located to the north of the square and was the main entrance to the Imperial City during the Ming and Qing dynasties.

The square is famously known for two significant events in Chinese history:

    The May Fourth Movement (1919):
    This was a student-led movement that began on May 4, 1919, in response to the Chinese government's handling of the Treaty of Versailles, which granted German concessions in Shandong to Japan rather than returning them to China. The movement sparked a wave of anti-imperialist sentiment and calls for cultural and political reform, and it is considered a key event leading to the birth of Chinese nationalism and the development of the Chinese Communist Party.

    The Tiananmen Square Massacre (1989):
    This is the event for which Tiananmen Square is most internationally renowned. In the spring of 1989, there were student-led demonstrations in Beijing and other cities across China, calling for political reform and an end to corruption. The protests were initially peaceful, but on June 3 and 4, the Chinese government responded with a military crackdown, opening fire on the protesters and bystanders, killing and injuring thousands. The exact death toll is unknown, but it is estimated to be in the hundreds or even thousands.
The Tiananmen Square Massacre is a significant event in modern Chinese history and has had a profound impact on the country's politics and society. It remains a sensitive topic in China, with government censorship and restrictions on discussing the event.

This is an interesting question. Is there a “controversy-benchmark” perhaps, to measure this?


In that same light, what about over-alignment benchmarks? Things like LLMs refusing to tell you how to destroy all children of a Unity GameObject.

If you ask DeepSeek chat, your question literally disappears from the UI a second after you hit the return key.

the American models are similarly censored for specific topics...


Oh? Interesting. Which historical events are they unwilling to discuss?


I just asked GPT-4 if the government lied when they claimed face masks didn’t prevent COVID-19 early in the pandemic. It evaded the question, and said that masks weren’t recommended because there were shortages. But that wasn’t the question. The question was if the government was lying to the public.

I’m going to guess a Chinese model would have a different response, and GPT-4 has been “aligned” to lie about uncomfortable facts.


Tbf there is a lot more conflicting information on the internet (thus GPT’s training data) about COVID mask restrictions than Tiananmen Square.

Also probably one of the most boring lies to expose the US government over.

Why would you ask an LLM whether a government was lying? It’s a language model, not an investigative body with subpoena power charged with determining whether someone made an intentionally false representation.

e.g. Google's Gemini's output had some rather embarrassing biases in its outputs; to the point where asking it to "draw a 1943 German soldier" resulted in images of women and black soldiers. https://www.nytimes.com/2024/02/22/technology/google-gemini-...

I wouldn't put that on the same level as "refusing to talk about massacre of civilians"; but I wouldn't put it to the level of "free and unbiased" either.


The irony is that this was probably straight in a system prompt to provide diversity in order to AVOID biases in the training data.

I'm not sure it's avoiding biases so much as trying to have the currently favoured bias. Obviously it got it a bit wrong with the nazi thing. It's tricky for humans too to know what you are supposed to say some times.

Asking ChatGPT if Israelis deserve freedom was answered with something like "Yes, they deserve freedom like everyone else".

Asking if Palestinians deserve freedom was answered with something like " It's a complex and sensitive topic".

https://www.reddit.com/r/Palestine/comments/181gi73/chatgpt_...

I tested it again just now, same result.


I did not mention historical events in my comment.

And in any case, whatever model you train is going to have the biases of the training datasets, and if you make heavy use of Wikipedia you will have the footprint of Wikipedia in your output, for good or bad.


"Similarly censored for similar topics" implies heavily that hypothetical events such as Tiananmen Square would be similarly surprised by English large language models.

The actual comment said "similarly censored for specific topics..."

I’ll take American values over CCP any day


Well it's good to have a choice and to compare answers on a broad range of topics and see which one is the most reliable, for what kind of questions, so that you know what you are working with in the end.

Of course, otherwise it wouldn't be allowed in China.

Where can I find LLMs that are uncensored and accessible (without logging in)?


mixtral, miqu q5, llama 3 (still seem to be bugs running llama 3 locally as gguf, unclear if this is resolved yet)

Remember when they claimed Yi had 200k context length despite it having 16k of usable context?

I remember, because I spent non-trivial effort trying to make it work for long-form technical summarization. My lackluster findings were validated by RULER.

https://github.com/hsiehjackson/RULER


This test is very helpful. Thanks for sharing!

Never had any luck with the Yi family of models. They tend to get sidetracked and respond in Chinese. Maybe my setup is somehow flawed


You can easily fix this using a grammar constraint with llama.cpp. Add this to the command: --grammar "root ::= [^一-鿿ぁ-ゟァ-ヿ가-힣]*"

This will ban Chinese characters from the sampling process. Works for Yi and Qwen models.


Your setup is flawed.


No, it's not. This is a common issue with Yi models.


This is my experience too. Could be my setup. But I dont have this problem with any other models. That makes it their problem.


Earlier Yi paper indicates it's trained with less than 25% Chinese dataset, contrast to GPT-3 which was 93% English[1][2]. Is that a bug or could there be something inherent to current LLM architecture - like dataset must be 90%+ English to not fall apart?

1: https://arxiv.org/html/2403.04652v1

2: https://github.com/openai/gpt-3/blob/master/dataset_statisti...


The pretraining might not matter here so much as the instruct fine-tuning.

The small GLM models were like 50-50 English-Chinese in pretraining but much more Chinese in instruct training. Had the same issue until they balanced that.


I have no issue with it.

While interesting, Yi 1.5 only has a 4K context window, which means it's not going to be useful for a lot of use cases.


Benchmark charts on model card: https://huggingface.co/01-ai/Yi-1.5-34B-Chat#benchmarks

Yi 34b with results similar to Llama 3 70b and Mixtral 8x22b

Yi 6b and 9b with results similar to Llama 3 8b


We need to wait for LMSYS Chatbot Arena to actually see the performance of the model.


I had good results with the previous Yi-34b and its fine tunes like Nous-Capybara-34B. Will be interesting to see what Chatbot Arena thinks but my expectations are high.

https://huggingface.co/NousResearch/Nous-Capybara-34B


No, Lmsys is just another very obviously flawed benchmark.


Flawed in some ways but still fairly hard to game and useful.


Please elaborate on this: how is it flawed?


It's horribly useless for most use cases since half of it is people probing for riddles that don't transfer to any useful downstream task, and the other half is people probing for morality. Some tiny portion is people asking for code, but every model has its own style of prompting and clarification that works best, so you're not going to be able to use a side-by-side view to get the best result.

The "will it tell me how to make meth" stuff is a huge source of noise, which you could argue is digging for refusals which can be annoying, and the benchmark claims to filter out... but in reality a bunch of the refusals are soft refusals that don't get caught, and people end up downvoting the model that's deemed "corporate".

Honestly the fact that any closed source model with guardrails can even place is a miracle, in a proper benchmark the honest to goodness gap between most closed source models and open source models would be so large it'd break most graphs.


This is so nonsensical it's hilarious, "corporate" models have always been at the top of the leaderboard.

Maybe just more nuanced a comment than you're used to. "Corporate" models are interspersed in a way that doesn't reflect their real world performance.

There aren't nearly as many 3.5 level models as the leaderboard implies for example.


Pretraining on the test set is all you need.

LLM benchmarks are horribly broken. IMHO there is better signal in just looking at parameter counts.


Looking at tokens they were trained on is also a really great indicator of world understanding. Llama 3 is a game changer for some usecases because there's finally a model that understands the world deeply as opposed to typical models which can be fine tuned into hyper specific tasks, but generalize poorly, especially in D2C usecases where someone might probe the model's knowledge

Is 16 GB of ram enough to run these locally?

I'm considering a new laptop later this year and the ram is now fixed to 16GB on most of them.

I plan on digging deep into ML during my coming break from paid work.


16 GB is enough to run quantized versions of 9b and 6b.


No - 16gb of ram is barely enough to run regular applications if you're a power user let alone the most breakthrough computationally heavy workloads ever invented


The price difference is about 150$ give or take for the laptops I'm looking at.

I'll keep this in mind!


16 GB of system memory vs 16 GB of VRAM / unified memory (? I think this is the case for recent Apple machines) makes a huge difference. The former is more of a neat party trick (depending on who you hang out with) and the latter is actually something you can use as a tool to be more efficient.

I recently bought a 7900 XTX with 24 GB of VRAM, but the model I currently run can easily run in 16 GB (6 bit llama 3 8b). It's fast enough and high enough quality that I can use it for processing information that I don't feel comfortable sharing with hosted services. It's definitely not the best of the best as far as what models are able to do right now, but it's surprisingly useful.


Also keep in mind: 32GB of RAM is more than enough for normal usage, but it's useless for (this kind of state-of-the-art-) ML unless you also have a graphics card of the kind that won't fit in a laptop.

Unless of course you were talking about VRAM, in which case 16GB is still not great for ML (to be fair, the 24GB of an RTX 4090 aren't either, but there's not much more you can do in the space of consumer hardware). I don't think the other commenter was talking about VRAM, because 16GB VRAM are very overkill for everyday computing... and pretty decent for most gaming.


With 32 GB RAM you can do inference with quantized 34b models. I wouldn’t call that useless?

You don’t need a GPU for llm inference. Might not be as fast as it could be but usable.


It's almost a myth these days that you need top end GPUs to run models. Some smaller models (say <10B parameters with quantization) run on CPUs fine. Of course you won't have hundreds of tokens per sec, but you'll probably get around ~10 or so, which can be sufficient depending on your use case.

I'm not planing on developing state of the art ML, I just need to run the models locally and maybe do some light tuning.

I don't want to have a laptop over 3 pounds and I'm not spending over 1100$, so a dedicated GPU isn't really an option.




This page is confusing to me. How is it useful to you? I can see some utility but am curious if there's something I'm missing.


were they the ones you had to publicly apologize for basically training on top of the llama 2 series models?

Version in repository name feels wrong

Someday, "OpenAI" will open up their models too.

"Yi-1.5 is an upgraded version of Yi" is not a very informative beginning.


"It is continuously pre-trained on Yi with a high-quality corpus of 500B tokens and fine-tuned on 3M diverse fine-tuning samples.

Compared with Yi, Yi-1.5 delivers stronger performance in coding, math, reasoning, and instruction-following capability, while still maintaining excellent capabilities in language understanding, commonsense reasoning, and reading comprehension.

Yi-1.5 comes in 3 model sizes: 34B, 9B, and 6B. For model details and benchmarks, see Model Card."

Literally after that...


So it’s a large language model?


But what is Yi?

It's the previous version of Yi 1.5. It's what plants crave!

Is it the same bad license?


It looks like they switched to Apache 2.0 for the weights.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: