Hacker News new | past | comments | ask | show | jobs | submit login
Anthropic's Haiku Beats GPT-4 Turbo in Tool Use (parea.ai)
48 points by Joschkabraun 11 months ago | hide | past | favorite | 14 comments



Not sure when they implemented this, but ollama now has a JSON mode [0]. Not function calling, but one of the simpler ways to get JSON in a local LLM. I'm using it with `knoopx/hermes-2-pro-mistral:7b-q8_0` and it's worked well for me so far.

    response = ollama.chat(model=OLLAMA_MODEL, 
        messages=[
            {
                'role': 'system',
                'content': system_message,
            },
            {
                'role': 'user',
                'content': user_prompt,
            },
        ],
        format='json',
        options = {
            #'temperature': 1.5, # very creative
            'temperature': 0.0
        }
        )
0 - https://github.com/ollama/ollama/blob/main/docs/api.md#json-...


Interesting. Do you have any benchmarks?


No benchmarks, just my anecdotal experience trying to get local LLM's to respond with JSON. The method above works for my use case nearly 100% of the time. Other things I've tried (e.g. `outlines`[0]) are really slow or don't work at all. Would love to hear what others have tried!

0 - https://github.com/outlines-dev/outlines


Ah yes. Have you tried out instructor [0] or Guidance [1]?

[0]: https://github.com/jxnl/instructor/

[1]: https://github.com/guidance-ai/guidance/tree/main


This is great to see another player enter the tool use arena but anecdotally I'm having issues getting function calls to Anthropic to return proper JSON. Still feels a little fragile but hopefully they continue improving.


Have you tried the new beta tool use API? In the experiments I ran there were almost no issues parsing the function call response (similar to GPT-3.5-turbo & GPT-4 turbo)


Yea, new tool calling now in public beta is way better. And avoids all of that XML they had before.


Are tools fine tuned in both or just prompt based? For a fair comparison


No fine tuning. Looks like he's do raw model capabilities with simple prompt. repo: https://github.com/parea-ai/tool-use-benchmark


All models got the same prompt fed which was essentially "Question: {question}". And then the API's accept the function call definition


But what if there is a bad model that fine tunes with only tools?


What do you mean with being a bad model?

If the model is really good at tool use, then it will broadly useful as it needs capabilities to generate the tool definition. So, there should be some transferability.


If the exact tools test was part of the training of the model, wouldn’t that throw off the results and not generalize?


Now if only Claude had JSON mode.




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: