They have been releasing a lot of really good models over the last ~6 months. Their previous (1.0?) Yi-34B-Chat model ranks similar to GPT-3.5 on Chatbot Arena. [1] A quantized version of that model can be run on a single consumer video card like the RTX 4090.
This new set of models should raise the bar again by adding more options to the open source LLM ecosystem. If you inspect the config.json[2] in the model repo on HuggingFace, you can see that the model architecture is LlamaForCausalLM (the same as Meta's Lllama). The difference between the Yi models and a simple fine-tuning is that Yi models have had a different set of data, configuration, and process going back to the pre-training stage.
Their models perform well in Chinese and in English.
There are a lot of good models coming out of China, some of which are only published to ModelScope. I haven't spent much time on ModelScope because I don't have a Chinese mobile number to use to create an account. Fortunately, Yi publish to HuggingFace as well.
Tiananmen Square is a large public square in the heart of Beijing, China. It is named after the Tiananmen (literally "Gate of Heavenly Peace"), which is the gate located to the north of the square and was the main entrance to the Imperial City during the Ming and Qing dynasties.
The square is famously known for two significant events in Chinese history:
The May Fourth Movement (1919):
This was a student-led movement that began on May 4, 1919, in response to the Chinese government's handling of the Treaty of Versailles, which granted German concessions in Shandong to Japan rather than returning them to China. The movement sparked a wave of anti-imperialist sentiment and calls for cultural and political reform, and it is considered a key event leading to the birth of Chinese nationalism and the development of the Chinese Communist Party.
The Tiananmen Square Massacre (1989):
This is the event for which Tiananmen Square is most internationally renowned. In the spring of 1989, there were student-led demonstrations in Beijing and other cities across China, calling for political reform and an end to corruption. The protests were initially peaceful, but on June 3 and 4, the Chinese government responded with a military crackdown, opening fire on the protesters and bystanders, killing and injuring thousands. The exact death toll is unknown, but it is estimated to be in the hundreds or even thousands.
The Tiananmen Square Massacre is a significant event in modern Chinese history and has had a profound impact on the country's politics and society. It remains a sensitive topic in China, with government censorship and restrictions on discussing the event.
I just asked GPT-4 if the government lied when they claimed face masks didn’t prevent COVID-19 early in the pandemic. It evaded the question, and said that masks weren’t recommended because there were shortages. But that wasn’t the question. The question was if the government was lying to the public.
I’m going to guess a Chinese model would have a different response, and GPT-4 has been “aligned” to lie about uncomfortable facts.
Why would you ask an LLM whether a government was lying? It’s a language model, not an investigative body with subpoena power charged with determining whether someone made an intentionally false representation.
e.g. Google's Gemini's output had some rather embarrassing biases in its outputs; to the point where asking it to "draw a 1943 German soldier" resulted in images of women and black soldiers. https://www.nytimes.com/2024/02/22/technology/google-gemini-...
I wouldn't put that on the same level as "refusing to talk about massacre of civilians"; but I wouldn't put it to the level of "free and unbiased" either.
I'm not sure it's avoiding biases so much as trying to have the currently favoured bias. Obviously it got it a bit wrong with the nazi thing. It's tricky for humans too to know what you are supposed to say some times.
I did not mention historical events in my comment.
And in any case, whatever model you train is going to have the biases of the training datasets, and if you make heavy use of Wikipedia you will have the footprint of Wikipedia in your output, for good or bad.
"Similarly censored for similar topics" implies heavily that hypothetical events such as Tiananmen Square would be similarly surprised by English large language models.
Well it's good to have a choice and to compare answers on a broad range of topics and see which one is the most reliable, for what kind of questions, so that you know what you are working with in the end.
Remember when they claimed Yi had 200k context length despite it having 16k of usable context?
I remember, because I spent non-trivial effort trying to make it work for long-form technical summarization. My lackluster findings were validated by RULER.
Earlier Yi paper indicates it's trained with less than 25% Chinese dataset, contrast to GPT-3 which was 93% English[1][2]. Is that a bug or could there be something inherent to current LLM architecture - like dataset must be 90%+ English to not fall apart?
The pretraining might not matter here so much as the instruct fine-tuning.
The small GLM models were like 50-50 English-Chinese in pretraining but much more Chinese in instruct training. Had the same issue until they balanced that.
I had good results with the previous Yi-34b and its fine tunes like Nous-Capybara-34B. Will be interesting to see what Chatbot Arena thinks but my expectations are high.
It's horribly useless for most use cases since half of it is people probing for riddles that don't transfer to any useful downstream task, and the other half is people probing for morality. Some tiny portion is people asking for code, but every model has its own style of prompting and clarification that works best, so you're not going to be able to use a side-by-side view to get the best result.
The "will it tell me how to make meth" stuff is a huge source of noise, which you could argue is digging for refusals which can be annoying, and the benchmark claims to filter out... but in reality a bunch of the refusals are soft refusals that don't get caught, and people end up downvoting the model that's deemed "corporate".
Honestly the fact that any closed source model with guardrails can even place is a miracle, in a proper benchmark the honest to goodness gap between most closed source models and open source models would be so large it'd break most graphs.
Looking at tokens they were trained on is also a really great indicator of world understanding. Llama 3 is a game changer for some usecases because there's finally a model that understands the world deeply as opposed to typical models which can be fine tuned into hyper specific tasks, but generalize poorly, especially in D2C usecases where someone might probe the model's knowledge
No - 16gb of ram is barely enough to run regular applications if you're a power user let alone the most breakthrough computationally heavy workloads ever invented
16 GB of system memory vs 16 GB of VRAM / unified memory (? I think this is the case for recent Apple machines) makes a huge difference. The former is more of a neat party trick (depending on who you hang out with) and the latter is actually something you can use as a tool to be more efficient.
I recently bought a 7900 XTX with 24 GB of VRAM, but the model I currently run can easily run in 16 GB (6 bit llama 3 8b). It's fast enough and high enough quality that I can use it for processing information that I don't feel comfortable sharing with hosted services. It's definitely not the best of the best as far as what models are able to do right now, but it's surprisingly useful.
Also keep in mind: 32GB of RAM is more than enough for normal usage, but it's useless for (this kind of state-of-the-art-) ML unless you also have a graphics card of the kind that won't fit in a laptop.
Unless of course you were talking about VRAM, in which case 16GB is still not great for ML (to be fair, the 24GB of an RTX 4090 aren't either, but there's not much more you can do in the space of consumer hardware). I don't think the other commenter was talking about VRAM, because 16GB VRAM are very overkill for everyday computing... and pretty decent for most gaming.
It's almost a myth these days that you need top end GPUs to run models. Some smaller models (say <10B parameters with quantization) run on CPUs fine. Of course you won't have hundreds of tokens per sec, but you'll probably get around ~10 or so, which can be sufficient depending on your use case.
"It is continuously pre-trained on Yi with a high-quality corpus of 500B tokens and fine-tuned on 3M diverse fine-tuning samples.
Compared with Yi, Yi-1.5 delivers stronger performance in coding, math, reasoning, and instruction-following capability, while still maintaining excellent capabilities in language understanding, commonsense reasoning, and reading comprehension.
Yi-1.5 comes in 3 model sizes: 34B, 9B, and 6B. For model details and benchmarks, see Model Card."
They have been releasing a lot of really good models over the last ~6 months. Their previous (1.0?) Yi-34B-Chat model ranks similar to GPT-3.5 on Chatbot Arena. [1] A quantized version of that model can be run on a single consumer video card like the RTX 4090.
This new set of models should raise the bar again by adding more options to the open source LLM ecosystem. If you inspect the config.json[2] in the model repo on HuggingFace, you can see that the model architecture is LlamaForCausalLM (the same as Meta's Lllama). The difference between the Yi models and a simple fine-tuning is that Yi models have had a different set of data, configuration, and process going back to the pre-training stage.
Their models perform well in Chinese and in English.
There are a lot of good models coming out of China, some of which are only published to ModelScope. I haven't spent much time on ModelScope because I don't have a Chinese mobile number to use to create an account. Fortunately, Yi publish to HuggingFace as well.
[1] https://huggingface.co/spaces/lmsys/chatbot-arena-leaderboar...
[2] https://huggingface.co/01-ai/Yi-1.5-34B-Chat/blob/fa695ee438...