Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
HuggingChat: Chat with Open Source Models (huggingface.co)
103 points by victormustar on Feb 21, 2024 | hide | past | favorite | 42 comments


I'm very pleased this UX includes "can edit any previous conversation turn" functionality, making conversations a tree rather than a list.

For me this is one of the highest-impact and most-often-overlooked features of the ChatGPT Web UI (so much so that openai does not even include this feature in their native clients).


My favorite prompt to throw at LLMs:

    Two cars have a 100 mile race.
    Car A drives 10mph. Car B drives
    5mph but gets a 50 mile headstart.
    Who wins?
So far, no LLM gets it consistently right. Most of the time, they confidently declare one of the cars as the clear winner.

Sometimes the explanations are so convincing, that I begin to wonder if my own conclusion - that the race will be a tie - is wrong :)


GPT-4 & Gemini get it right every time.

And if I convert it from miles/mph to km/kmh then most of them get it right.

Two cars have a 160 km race. Car A drives 16kmh. Car B drives 8kmh but gets a 80km headstart. Who wins?


I'd be interested to know if they still get it right (i.e. responding with "car A") if it's changed to:

"Two cars have a 160km race. Care A drives at 16km/h. Car B drives at 8km/h but gets a 75km head start. Who wins?"

(or similar slight perturbations)


When I tried it with ChatGPT the model understands there is math involved, figures out the formula and spawns an API call to the code execution engine that actually performs the calculation.

Results:

To determine who wins this race, we'll calculate the time it takes for each car to complete the race using the same formula as before: \[ \text{Time} = \frac{\text{Distance}}{\text{Speed}} \]

- Car A starts from 0 km and needs to cover 160 km at a speed of 16 km/h. - Car B starts from 75 km (due to the 75 km head start) and only needs to cover 85 km (since 160 km total distance minus the 75 km head start) at a speed of 8 km/h.

Let's calculate the time for each car.

Car A will complete the race in 10 hours, while Car B will take 10.625 hours to finish. Therefore, Car A wins the race.


Nice. It’s a good test of (the whole system, including calls to external services, of) ChatGPT itself. I think a pure LLM though would have trouble with these sorts of trick questions…but that’s not surprising.


I'm sure the analogy is imperfect but your broca region of the brain is also not good at these sort of trick question yet it can do small talk relatively well even when the rest of the brain has severe dysfunctions


Agreed.


I wonder if you gave it a fictional unit of distance and told it so, if it would get it right?

Given that a gnorflork is a unit of distance...


Good idea. So basically we're seeing if LLMs get this kind of thing: https://math.stackexchange.com/questions/646238/how-do-i-con...

> Draw a graph of the function on the blackboard, showing a and b and a crosshatched area representing the integral. Put an x on the horizontal axis. Erase the x and put a z there. Does that change the area? Erase the z and put a smiley face there. Does the area change? Why/why not?


What I love is the explanation and the CONFIDENCE LLMs seem to have in giving the wrong answer.


In this they're like google search. They have to give you some answer.

Google gives you completely unrelated results, while the LLMs just make something up.

Twice in the past weeks I tried both ChatGPT and Gemini for a technical question. The true answer was "you can't do it"/"those devices don't have the feature you're asking about". In both cases they made up some answers that looked consistent but were completely bogus.


being confidently wrong is a big problem a team i'm working with is facing. They have resorted to the usual "this is an AI don't trust it" disclaimer on the interface. But then, if you have to go lookup and double check every answer, what's the point of asking in the first place?

I think we'll see a lot of LLM use cases switching from generative responses back to good old fashioned search this year.


Ironically, it is the most human aspect of LLMs.


I see this sentiment repeated a lot. I can see where it comes from (humans make mistakes too, and are often confidently wrong), but I don't think it's quite the same.

LLMs are confidently and repeatedly wrong about very simple things that a human would never be. They're also not even consistent; ask them one thing and then tell them they're wrong: they'll often apologise profusely and flip their answer right around. Tell them once again that they're wrong and they'll flip back again. Ask them about their thought process, and they'll just produce a load of waffle that looks like it describes a thought process but usually doesn't correspond to anything of the sort (and of course it doesn't; if you understand vaguely how these systems work it'd be a miracle if it did so).

Since one can invariably take any criticism of LLMs and immediately reply with "but some humans do that too", I don't think such lines of argument say very much.

It's the weird combination of superhuman memory, speed and wit and at times utterly subhuman logical consistency that makes it hard to interpret how 'intelligent' LLMs can be said to be. That said, they can still be very useful.


Maybe in this specific circumstance an LLM being wrong is similar to a human being wrong but it's very jarring to a user. For example, imagine if Excel got the math wrong sometimes like a human does. Throughout all of digital history computers were never wrong, they're either programmed correctly or not. I think a lot of business leaders and others hyping the generative AI technology may not fully grasp the nondeterministic nature of an LLM. I work in consulting and have been on calls with senior leadership going on and on about the miracles ($$) about to unfold. However, when I raise the point that you can never really count on the output of an llm being the same or even correct and using one to guess the parameters of an API for example is going to fail eventually with no explanation or fix and how that narrows their usefulness i get the "stop being a wet blanket" talk.


A fair point - that one thing is the absolute skill level, and the other is a comparison with prior expectations. The latter may give a positive surprise or deep disappointment.


I agree, the lack of consistency sets us (i.e. humans vs LLMs) apart.

> LLMs are confidently and repeatedly wrong about very simple things that a human would never be.

It is a strong assertion. A substantial fraction of the population (maybe even the majority) would repeat phrases they learned. Some are about Earth being 5000 years old, some are political slogans, and some others are based on the knowledge they derived from TV series.

> Since one can invariably take any criticism of LLMs and immediately reply with "but some humans do that too", I don't think such lines of argument say very much.

On the contrary - it is interesting to see which problems are uniquely LLM (or human) and which others - are similar instances of the same, especially when we benchmark LLM not only versus some idealized human cognition but one of an ordinary person under ordinary conditions.


> It is a strong assertion...

I think you may be right if you consider humanity as a whole, but when evaluating LLMs I think we should perhaps hold them to slightly higher standards. It seems silly to compare a system that clearly has an incredible level of knowledge and writing ability to toddlers or completely uneducated remote tribespeople (or even just bigoted people living in otherwise advanced societies). Since LLMs can operate on vastly different timescales to humans, we should probably evaluate them against reasonably knowledgeable humans given several hours and a chance to look back over their work as many times as they like before final submission. Since we’re trying to abstractly compare some sort of computational ability, we should probably give them equal amounts of computational resources/number of steps/clock cycles/etc. Given this setup, I think some of the ‘human mistakes’ we talk about would no longer apply.

Comparing systems like this is not an easy (or objective) matter anyway… I’m not claiming this is the only way of thinking about it and that all others are wrong.

> it is interesting to see which problems are uniquely LLM (or human)

I certainly agree with that. It’s fascinating to be able to observe and study something in real time that appears on the surface to approximate human thought so shockingly well whilst also being so different.


It is hard to define a fair comparison between humans and LLMs. Artificial "neurons" (i.e., vector entries + ReLU) have little in common with biological neurons (each one is a computing unit on its own).

Data efficiency is crucial, but humans have a lot of pre-trained multi-modal data. For energy efficiency, ML is orders of magnitude more efficient.

Regarding the general powers (and weaknesses) of LLMs, I am shocked they work in an Artificial General(ish) Intelligence way. Text generation with GPT2 and GPT3 - sure, it still felt like a very advanced text autocomplete. GPT4 - here, I could have a bet (and lose money) that this level requires some form of reinforcement learning and a two-way interaction with the environment. Sure, there is some RL there, but I assumed something closer to a bot learning to talk with people, running and getting results from code, and looking at data online (for the training!), or maybe even - literally walking with a camera attached.

Yes, there is still a lot to do. There is a difference between a Go model beating novices, advanced players, and everyone, including world champions.


Just fed it into codellama (which I happened to be already running within ollama) and got this output:

> The answer to this riddle is that both cars will win the race, as they have both driven a total distance of 100 miles. Car A has driven at a speed of 10mph for the entire race, while Car B has driven at a speed of 5mph and has a headstart of 50 miles. However, since the two cars have started from the same point and are traveling in the same direction, they will arrive at the finish line together.


> both cars will win the race

that's a positive way to look at things hah. I wonder if its system prompt nudges responses in that direction vs something like "the race is a tie" or "both cars lose".


When I fed it to bing, I got "the race is a tie"


Gpt4’s response to the question using the exact wording

“To find out who wins, we can calculate the time it takes for each car to complete the race.

For Car A: - Speed = 10 mph - Distance = 100 miles - Time = Distance / Speed = 100 miles / 10 mph = 10 hours

For Car B: - Speed = 5 mph - Distance = 50 miles (because of the 50 mile headstart, it only needs to cover 50 miles) - Time = Distance / Speed = 50 miles / 5 mph = 10 hours

Both cars would finish the race at the same time, so it's a tie.”


Did you try ChatGPT with the GPT-4 model?

I'm asking because a lot of people have only tried the 3.5 model and then lose interest when it fails on stupid ways. GPT-4 is not just quantitatively better, it's a qualitative jump ahead. Just in case you haven't tried it out yet (which I assume given your comment since gpt-4 gets that answer right) I encourage you to do it


Is there a free trial?

Does it still try to make up shit when the answer is "it can't be done"?


(I don't know about a free trial)

It's not perfect and the classic failure modes of LLMs (the one you mentioned and others) are still present, but to a much much smaller degree.

I have a few prompts I use for specific tasks, where I ask it to answer tersely and it sometimes says so when something is impossible. Sometimes it hallucinates an answer. The likelihood for it to not hallucinate an answer seems to be proportional to desired terseness of the answer, as if the default instruction to "be helpful" biases it towards telling you something, anything


Hypothetically speaking, ML could be most valuable if the results are counter-intuitive but actually right, for instance if there is some obscure rule it correctly cites where the car with the head-start handicap always wins in the case of a tie.


Given that LLMs are trained in large corpus of text (not in logic, not in mathematical first principles), isn't too much to expect that it can understand the problem and then solve it mathematically?


In a previous HN discussion there was a link to some application that you could use (at least on a Mx mac, but possibly on more platforms) to run these free public models locally.

Of course I closed the tab and lost it. Yay for using lingering tabs as a "to check later" list.

Can anyone repost the app name(s) please?


Apparently i was thinking of this one:

https://lmstudio.ai/

But now I'll check all that get mentioned, thanks.


I use ollama https://ollama.com/


Mlc-llm


GPT4All?


LM Studio is another


Finally! I was impressed by the together.ai approach to trying the chat models without having to deploy your own and was curious why Hugging Face didn’t have that type of interface more consistently available. Now to see HuggingFace make as many models available as together.ai does, they still have a good lead in that regard.


Together serves models optimized for inference speed.

They're not Groq but Together (and Perplexity Labs) have the lowest latencies and fastest tokens per second of any commercial services available right now. Also the lowest prices afaik.


Unable to get proper string reversal even with Mixtral

https://hf.co/chat/r/H8Jrn_f

Based off Karpathy's BPE video highlighting ChatGPT being unable to reverse this string-> .DefaultCellStyle


Thanks, but the accuracy of these models are really lame compared with the free GPT3.5 turbo, I tried some Chinese prompts, always get nonsense.


Llama and gemma aren't actually open source? Since the licences are too restrictive. Not sure about the rest.


Chat with open source models in your area!


yet another bullshit generator :p




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: