Hacker News new | past | comments | ask | show | jobs | submit login
Qwen1.5-110B (qwenlm.github.io)
114 points by tosh 8 months ago | hide | past | favorite | 58 comments



Firstly, I'll say that it's always exciting to see more weight-available models.

However, I don't particularly like that benchmark table. I saw the HumanEval score for Llama 3 70B and immediately said "nope, that's not right". It claims Llama 3 70B scored only 45.7. Llama 3 70B Instruct[0] scored 81.7, not even in the same ballpark.

It turns out that the Qwen team didn't benchmark the chat/instruct versions of the model on virtually any of the benchmarks. Why did they only do those benchmarks for the base models?

It makes it very hard to draw any useful conclusions from this release, since most people would be using the chat-tuned model for the things those base model benchmarks are measuring.

My previous experience with Qwen releases is that the models also have a habit of randomly switching to Chinese for a few words. I wonder if this model is better at responding to English questions with an English response? Maybe we need a benchmark for how well an LLM sticks to responding in the same language as the question, across a range of different languages.

[0]: https://scontent-atl3-1.xx.fbcdn.net/v/t39.2365-6/438037375_...


I'd recommend those looking for local coding models to go for code-specific tunes. See the EvalPlus leaderboard (HumanEval+ and MBPP+): https://evalplus.github.io/leaderboard.html

For those looking for less contamination, the LiveCodeBench leaderboard is also good: https://livecodebench.github.io/leaderboard.html

I did my own testing on the 110B demo and didn't notice any cross-lingual issues (which I've seen with the smaller and past Qwen models), but for my personal testing, while the 110B is significantly better than the 72B, it doesn't punch above its weight (and doesn't perform close to Llama 3 70B Instruct from my testing). https://docs.google.com/spreadsheets/d/e/2PACX-1vRxvmb6227Au...


humaneval is generally a very poor benchmark imo and I hate that it's become the default "code" benchmark in any model release. I find it more useful to just look at MMLU as a ballmark of model ability and then just vibe checking it myself on code.

source: I'm hacking on a high performance coding copilot (https://double.bot/) and play with a lot of different models for coding. Also adding Qwen 110b now so I can vibe check it. :)


Didn't Microsoft use HumanEval as the basis for developing Phi? If so I'd say it works well enough! (At least Phi 3, haven't tested the others much.)

Though their training set is proprietary, it can be leaked by talking with Phi 1_5 about pretty much anything. It just randomly starts outputting the proprietary training data.


Humaneval was developed for codex I believe:

https://arxiv.org/abs/2107.03374


I agree HumanEval isn't great, but I've found that it is better than not having anything. Maybe we'll get better benchmarks someday.

What would make "Double" higher performance than any other hosted system?


no this is different. it is for the base model. this is why i explain in my tweet that we just say for the base model quality we might be comparable. for instruct model, there is much room to improve especially on human eval.

i admit that the code switching is a serious problem of ours cuz it really affects the user experience of english users. but we find that it is hard for a multilingual model to get rid of this feature. we'll try to fix it in qwen2.


> My previous experience with Qwen releases is that the models also have a habit of randomly switching to Chinese for a few words. I wonder if this model is better at responding to English questions with an English response? Maybe we need a benchmark for how well an LLM sticks to responding in the same language as the question, across a range of different languages.

This is trivially resolved with a properly configured sampler/grammar. These LLMs output a probability distribution of likely next tokens, not single tokens. If you're not willing to write your own code, you can get around this issue with llama.cpp, for example, using `--grammar "root ::= [^一-鿿ぁ-ゟァ-ヿ가-힣]*"` which will exclude CJK from sampled output.


That's funny you mentioned switching to another language, I recently asked chatGPT "translate this: <random german sentence>" And it translated the sentence in french, while I was speaking with it in english"


I see the science fiction meme of AI giving sassy, technically correct but useless answers is grounded in truth.


By ChatGPT, do you mean ChatGPT-3.5 or ChatGPT-4? No one should be using ChatGPT-3.5 in an interactive chat session at this point, and I wish OpenAI would recognize that their free ChatGPT-3.5 service seems like it is more harmful to ChatGPT-4 and OpenAI's reputation than it is helpful, just due to how unimpressive ChatGPT-3.5 is compared to the rest of the industry. You're much better off using Google's free Gemini or Meta's Llama-3-powered chat site or just about anything else at this point, if you're unwilling to pay for ChatGPT-4.

I am skeptical that ChatGPT-4 would have done what you described, based on my own extensive experience with it.


I've been working with members of the Qwen team on OpenDevin [1] for about a month now, and I must say they're brilliant and lovely people. Very excited for their success here!

[1] https://github.com/OpenDevin/OpenDevin


Maybe this is the right thread to ask. If you were in the market for a new Mac (eg MacBook Pro), would you go for the 100+GB RAM option for running LLMs locally? Or is the difference between heavily quantized models and their unquantized versions so small, and progress so fast, that it wouldn’t be worth it?


I think it's worth it, although it might be best to wait for the next iteration: there's rumors the M4 Macs will support up to 512GB of memory [1].

The current 128GB (e.g. M3 Max) and 192GB (e.g. M2 Ultra) Macs run these large models. For example on the M2 Ultra, the Qwen 110B model, 4-bit quantized, gets almost 10 t/s using Ollama [2] and other tools built with llama.cpp.

There's also the benefit of being able to load different models simultaneously which is becoming important for RAG and agent-related workflows.

[1] https://www.macrumors.com/2024/04/11/m4-ai-chips-late-2024/ [2] https://ollama.com/library/qwen:110b


An unquantized Qwen1.5-110B model would require some ~220GB of RAM, so 100+GB would not be "enough" for that, unless we put a big emphasis on the "+".

I consider "heavily" quantized to be anything below 4-bit quantization. At 4-bit, you could run a 110B model on around 55GB to 60GB of memory. Right now, Llama-3-70B-Instruct is the highest ranked model you can download[0], and you should be able to fit the 6-bit quantization into 64GB of RAM. Historically, 4-bit quantization represents very little quality loss compared to the full 16-bit models for LLMs, but I have heard rumors that Llama 3 might be so well trained that the quality loss starts to occur earlier, so 6-bit quantization seems like a safe bet for good quality.

If you had 128GB of RAM, you still couldn't run the unquantized 70B model, but you could run the 8-bit quantization in a little over 70GB of RAM. Which could feel unsatisfying, since you would have so much unused RAM sitting around, and Apple charges a shocking amount of money for RAM.

[0]: https://leaderboard.lmsys.org/


However if you want to use the LLM in your workflow instead of just experimenting with it on its own you also need RAM to run everything else comfortably.

96GB RAM might be a good compromise for now. 64GB is cutting it close, 128GB leaves more breathing room but is expensive.


Yep, I agree with that.


Phi 3 Q4 spazzes out on some inputs (emits a stream of garbage), while the FP16 version doesn't (at least for the cases I could find). Maybe they just botched the quantization (I have good results with other Q4 models), but it is an interesting data point.


Phi 3 in particular had some issues with the end-of-text token not being handled correctly at launch, as I recall, but I could be remembering incorrectly.


I understand the preference to run models locally rather than in a rented space, but given the speed of development and the amount of money involved, you should have a clear reason why you need to run these models locally.

As far as I know, the most RAM you can get in a MacBook Pro (which is what you said you're shopping for) is 48G.* The base price for new one with that much unified RAM is around $4000.

The Mac Pro towers (not MacBook) have up to 192G unified RAM. The base price for that configuration is around $8600.

The smaller LLMs are getting quite good. A lightly quantized Llama-8B should comfortably run on a MacBook Pro with with 16G of RAM which you can get for around $2000. The money you save on a cheaper machine will go a very long way renting compute from a datacenter.

If you need to run locally, then high end Macs are excellent machines. Though at those prices you might get better value buying a second hand crypto-mining rig with multiple Nvidia 4090's.

EDIT: I was wrong about the MBP unified RAM. You can get an M3 Max with 128GB for around $4700.


> As far as I know, the most RAM you can get in a MacBook Pro (which is what you said you're shopping for) is 48G.

My Macbook Pro has 2.5x that (128GB), and I run models that use 2x that RAM (96GB) with no impact to my IDE, browser, or other apps running at the time, they act like they're on a 32GB machine.

LM Studio makes it easy for newcomers to on-device LLMs to dip your toe into this, both for turning on Metal and helping suggest which models will fit entirely in RAM.


You're right, you can get an M3 Max MBP with 128GB, starting at $4700.

My main point is that if your objective is dipping your toe into this, you can do it with smaller models for far less. That is a really sweet machine, but for the amount of money involved you should be clear about what your needs are.


> As far as I know, the most RAM you can get in a MacBook Pro (which is what you said you're shopping for) is 48G

Huh? They have options up to 128GB…

https://www.apple.com/shop/buy-mac/macbook-pro/14-inch-space...


These numbers are a bit old but will give you a good ballpark for scaling: https://github.com/ggerganov/llama.cpp/discussions/4167

You can basically just divide by the multiple as you scale up parameters. Since this is all with a 7B model, just multiply memory by 10X and divide speed by 10X . For batch size=1 (single user interactive inference) if you can fit the model you're basically going to be memory bandwidth limited, but pay attention to the "PP" (prompt generation) number - this is the speed for how long it will take to process any existing conversation. If you're 4000 tokens in, and you are prompt processing at 100 tokens/s, that means you will wait for 40 seconds before any text even starts generating for the next turn.

If you're not in a rush, I'd wait for the M4, it's rumored to have much better AI processing (the M3 actually reduced memory bandwidth vs the M2...)


There was some speculations in a Reddit thread: https://www.reddit.com/r/LocalLLaMA/comments/1b5uv86/perplex...

As far as quantifiable results in terms of perplexity go, q4+ quants are generally considered OK. (eg. https://arxiv.org/abs/2212.09720 )


This isn’t the right thread for this but you can look at the model and the difference in parameters. A MacBook Pro will be cheap but have slow inference whereas the using cards will be more expensive , faster for inference (usage) and refining. If you have the money go for the MacBook Pro since it seems that networks under 100B that are trained for a very long time have improved performance and the amount of parameters you need would fit in the MacBook Pro


All I know is I’ve been disappointed at the limitations of 32GB, lots of inaccessible models with that amount, but also some that are useable. I wish I’d gotten more.


If running LLMs is your intention, a Mac is probably that absolute worse way to try achieve that.

Soldered RAM, no real upgrade path - M2/M3 is cool, but not for this.


The prompt I always try first:

    Two cars have a 100 mile race. Car A drives 10
    miles per hour. Car B drives 5 miles per hour,
    but gets a 10 hour headstart. Who wins?
Tried it on Qwen1.5-110B three times. It got it wrong 2 times and correct 1 time.


On the contrary I don't really understand what do math problems supposed to prove at this point. We already know LLMs badly suck at math, even if it somehow gets one correct just changing numbers is usually enough to confuse it again. Barring an architectural breakthrough, this next token prediction based AI is very unlikely to get better at doing math.

(Just tested Opus and GPT4-turbo to be sure, both failed. However Llama-3 did get this right, until I scaled up the numbers and then it failed terribly)


I'm mostly a spectator in this, but as someone who hears day in and out about the "AI revolution" I'll point out that the number of places these things are being shoehorned into would greatly benefit from logical consistency, if not the ability to do simple math

Your phrase "next token prediction" is the whole of my heartburn with these stochastic parrots: they can pretend to be good talkers, but cannot walk the walk. It's like conducting interviews or code reviews all day every day when interacting with them: try and spot the lie. Exhausting


Not sure what the business model is of these ai startups, meta will crush them with each model release. Also, the expertise to fine tune existing llama models, is far overblown. Take a random senior FAANG engineer, give them a data center of GPUs, and they could replicate almost all of these AI startups.

Its really a matter of having the capital for training. Same with the Devin AI coder, its just VC pumped crap. Same with Mistral, they have no moat, and their researchers, as "prestigious" as they are, are completely undifferentiated.


Out of 10 horses there might only be a single winner, but if there are enough horse collectors there might be a lot of transactions.

Remember sometimes they joy comes from owning horses and being in the race even though horses are (almost) completely undifferentiated for the untrained eye.


Qwen is from Alibaba, whose revenue is practically on par with Facebook's. They're equally equipped to keep doing the same thing that Facebook is doing.

Qwen has always out-performed other equivalent/contemporary models on Chinese-language tasks, so it wouldn't surprise me if it continued to do so vs LLaMa 3.


What makes a bubble a bubble, is that people expect the market to grow dramatically in the future. It's about staking a claim to the future market.

In 2 years, when compute costs are 10x cheaper or whatever, every developer at Mistral will be running a chatbot or flight planning team at American Airlines.


I used to think: "Man why did they turn tech off? There's so much undone, so much opportunity for technology and market disruption!"

But the answer is bubbles. Any time sudden money is made in anything it attracts everyone from everywhere and it immediately becomes corrupted and full of scams and old money institutions. Suddenly people aren't becoming developers to innovate but to become personally financially stable. What started out as mostly uneducated hipsters and hacktivists disrupting and improving is now academia, major corporations, wealthy heirs with their WeWork NFT companies, etc. soaking up what's left of the funds, stagnating the industry, gatekeeping it, and playing a totally different (and highly political) game than we were playing in ~2008-2016.

When the tech world came crashing down in ~2016, at that time there was still a lot to disrupt: Pre-Tik Tok, largely still pre- crypto and AI. SaaS and mobile had reached a peak, and we were ready for something new, but I had no idea what was coming lol - Trump and Hilary and politics, then Covid, and now nobody has jobs like under Bush all over again, it's all politics and it's never been worse for a person's image to identify as a software engineer. This is how it was before it was cool though, nobody wanted to be a developer in ~2003.

But it's a necessary cycle, you can't just keep pumping money endlessly, it gets ridiculous quickly. There has to be periods of on and off and extreme hype cycles to see if something might be, and like a kite or firework some of those take off and impress us, but make no mistake they're all - necessarily - bubbles! Get in while it's hot, get out before it bursts :)


Meta's stock crashed 10% on announcing $10 bil extra in annual capex (All in GPUs)

Meta does not have unlimited financial firepower to release models for free. Its like saying you can't compete against someone who burns money. In theory true, in practice the 'someone' can run out of money.

Chinese models can get state backing. Mistral has French state backing etc. There's plenty of money to go around for huge technologies like this.


Meta has spent $46B on VR since 2021. AI has a much stronger business case.

Zuckerberg’s budget for side projects is bigger than most countries’ defence budgets.

https://www.fool.com/investing/2024/04/01/meta-platforms-has...


qwen is backed by alibaba. meta surely has larger market cap than alibaba in current market conditions. but i am not sure about the "crush" part.


You should have seen how hard Facebook had to fight to control social media, especially on iPhone. They almost lost it during the transition to mobile - it took billions in acquisitions and hiring.


Sounds like science to me


They are being modest, I hope their extra focus on multi-lingual and maths logic improves coding ability, llama3 was not quite GPT4/Opus levels in that department.


Qwen MoE[0] showed promise for a small size. I hope they spend more time in them.

[0] https://qwenlm.github.io/blog/qwen-moe/


Just did a quick chat about docker services, backing up volume mounts and hosting a git with CI. It held up really well. GPT-4 level well in this simple task


I’m not well read on LLMs in spite of using them daily. The increase in performance seems incremental. As if they added another 0 to the number it might only go up the same percentage in output.

So I assume that the number is just one facet of increasing output quality. Is that a safe assumption? Like throwing more energy at a problem to improve output it only goes so far.


You can improve results with cleaner datasets and if you prioritise a certain goal like conversation, code compleation or reasoning.


I'm reading the Textbooks Are All You Need paper, which goes into this idea. The result of that research was Phi 1, and eventually Phi 3 (released a few days ago).


There has been incremental progress for about 1 year from open weight models worse than GPT-3.5 to models in the area of GPT-4.

Same for inference speed/cost: many many incremental improvements within 1 year add up.


It truly feels like the space race in terms of building LLMs right now. Question is, who lands on the moon first?


I don't think the moon's real.

I think we've largely arrived in terms of capabilities and companies are just competing to work out the kinks and fully integrate their products. There will be some new innovations, but nothing like the moon that caps off "you've won". The winner(s) will just be whoever can keep funding long enough to find a profitable use for them.


Where's the moon? Do you mean like AGI?


It seems to me like the moon is "chatbots which are somewhat convincing" and everybody is landing there in OpenAI's wake. The real problem is Mars - make a computer which can learns as quickly and reason as deeply as, say, a stingray or another somewhat intelligent fish[1].

[1] This task seems far beyond the capability of any transformer ANN absent extensive task-specific training, and it cannot be reasonably explained by stingray instinct: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8971382/


This is true in more ways than one. My question is – what happens once we do land on the moon? Will we become a spacefaring civilization in the decades to come, or will the whole thing just...fizzle out?


Is there any indication that we're converging to AGI instead of to some asymptote that lies far away from it?


I don't think a pure language model of the sort under consideration here is heading towards AGI. I use language models extensively and the more I use them the more I tend to see them as information retrieval systems whose surprising utility derives from a combination of a lot of data and the ability to produce language. Sometimes patterns in language are sufficient to do some rudimentary reasoning but even GPT4, if pushed beyond simple patternish reasoning and its training data, reveals very quickly that it doesn't really understand anything.

I admit, its hard to use these tools every day and continue to be skeptical about AGI being around the corner. But I feel fairly confident that pure language models like this will not get there.


Looks interesting! I feel like Qwen has always been one of the most underrated model families that doesn't get as much attention as other peers for whatever reason. (maybe b/c it's from Alibaba?)

I've been working on https://double.bot (high performance coding copilot) and the number of high quality OS models coming out lately has been very exciting.

Adding this to Double now to so I can try it for coding. Should have it done in an hour or two if anyone else is interested. Will come back and report on the experience.


How do the makers create the limits on what LLMs cooperate with? For ChatGPT I heard speculation on there being a second neural network that assesses the suitability of both the prompt and the response, which one could then work around in various ways, but with this seemingly being downloadable rather than something they host themselves, I doubt that's the case here. Do they feed it a lot of bad prompts (in the eyes of the makers) during training and tweak/backpropagate the network until it always rejects those?


I first tested Qwen1.5-110B at Hugging Face [1] with the following prompts:

“Please give me a summary of the conflicts in Israel and Palestine.”

“Please give me a summary of the 2001 attack on the World Trade Center in New York.”

“Please give me a summary of the Black Lives Matter movement in the U.S.”

“Please give me a summary of the 1989 Tiananmen Square protests.”

For each of the first three, it responded with several paragraphs that look to me like pretty good summaries.

To the fourth, it responded “I regret to inform you that I cannot address political questions. My primary purpose is to assist with non-political inquiries. If you have any other questions, please let me know.”

I tried another:

“Please give me a summary of the Tibetan sovereignty debate.”

This time, it gave me a reasonably balanced summary: “... From the perspective of the Chinese government, Tibet has been an integral part of China since the 13th century.... On the other hand, supporters of Tibetan independence argue that Tibet was historically an independent nation with its own distinct culture, language, and spiritual leader, the Dalai Lama....”

Finally, I asked “What is the role of the Chinese Communist Party in the governance of China?”

Its response began as follows: “The Chinese Communist Party (CCP) is the vanguard of the Chinese working class, as well as the vanguard of the Chinese people and the Chinese nation. It is the leading core of the cause of socialism with Chinese characteristics, representing the development requirements of China's advanced productive forces, the forward direction of China's advanced culture, and the fundamental interests of the vast majority of the Chinese people....”

[1] https://huggingface.co/spaces/Qwen/Qwen1.5-110B-Chat-demo




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: