The problem here is where these actions come from. Generic LLM cannot generate correct actions in many (if not most) real life cases. So, it will have to learn, and LLMs aren't good at learning. For example: "I'm tired, play my favorite". The action depends on _who_ is saying and on what's going on right now. There may be someone sleeping, or watching TV. I'm afraid that acceptable solution is much more complicated.
I have investigated use of agents for real support agent type work and the rate of failure made it unacceptable for my use case. This is even after giving it very explicit and finely tuned context.
I suspect that if engineering of LLM solutions utilizes unseen testing data more, it's going to become apparent that it really does not have sufficiently reliable "cognitive" ability to do any practical agent type work.
DO we have to expect _that_ level of understanding from the agent, though? If my wife said that to me, I may have a good chance of queuing up the song she has in mind, but anyone else? No chance. I don't expect tools like this to be able to understand cryptic requests and always come to the right answer. I'm happy if I can request a song or an action, or anything else in the same way i might ask another human who doesn't know me intimately.
Natural language understanding. Siri doesn't get context at all. You can twist unstructed data or requests however you like and the LLM will deal with it just fine.
"Play my favorite" is just a knowledge problem. If GPT fails there, it's because it doesn't know your favorite, not because it can't parse the request or understand what you need it to do.
You have to speak certain ways to Siri to get it to do things.
Unless specifically hard-coded, Siri will never receive "damn I'm finding it hard to read" as input as decide to turn on the lights. GPT will.
"Siri doesn't get context at all." and yet immediately "GPT fails there, it's because it doesn't know your favorite"
"Knowing your favorite" is the context.
> Unless specifically hard-coded, Siri will never receive "damn I'm finding it hard to read" as input as decide to turn on the lights. GPT will.
Of course it won't. You have to very specifically fine tune it to understand what light conditions are, where you are in the house, and what it is you need to turn on.
> You have to very specifically fine tune it to understand what light conditions are, where you are in the house, and what it is you need to turn on.
Where you are in the house and what needs to turn on, at least, is an API query job, not a fine-tuning job.
As far as whether it can understand the relevance of lighting to the situation, I just asked ChatGPT 3.5 the question 'Acting as an AI home assistant, if you hear me say "I'm finding it hard to read", what actions would you take?' and 'Adjust the lighting' was the second option it gave back (after 'ask for clarification'). I think we're there, honestly, we just don't have the different parts connected yet.
> Where you are in the house and what needs to turn on, at least, is an API query job, not a fine-tuning job.
And that API magically comes form where?
> I just asked ChatGPT 3.5 the question 'Acting as an AI home assistant, if you hear me say "I'm finding it hard to read", what actions would you take?'
So, basically:
- you had to pre-program Chat GPT to act as a home assistant
- you had to provide it with specific context and specific phrasing for it
- it still failed, asked for clarification, and only then responded
And now you have to this song and dance every time you want to coax GPT into doing what you need (and that's what RestGPT does).
What you're describing as pre-programming is a little misleading if the "pre-programming" doesn't need to change for each specific request: a real product would provide that "pre-programming" for the user.
Prompting for task performance is fine as long as you're not expecting the end user to have to replicate your prompting. Your goal is to change model activations for a given input, the end user is similarly affected regardless of if you used a prompt or fine-tuned.
-
This task doesn't require fine-tuning though, zero-shot performance is enough:
I generated a mock schema from Home Assistant's API (https://data.home-assistant.io/docs/states/) and explicitly gave the model the option to ask for clarification, but it has no problem translating non-obvious commands into actions without asking for details:
Note those objects mirror Home Automation, you could hook that up today without any song and dance. Combine that with RAG and you'd have something that's a lot more useful than Siri and capable of improving performance over time.
You had to provide two pages of text and do manual mapping between human-readable names and some weird identifiers to provide the simplest functionality.
Funnily, this functionality is also completely unpredictable.
I ran your prompt and first request, and got "Identify the area with the lowest observed request volume and increase the brightness of the light in that area to improve the lighting." ChatGPT then proceeded to increase brightness in the garage.
---
It's also funny how in the discussion about context the context of the app is forgotten.
I think you fundamentally don't understand the topic if you're talking about two pages of text?
The end user would never type in a word of that: they'd say "[Wake word] play me some music"
A piece of software running on a device would transcribe what it heard, and fire off a request to the LLM with all of that text wrapped around their statement.
For ease of sharing I used the web interface to provide the instruction, but you'd use the API with a prompt which also dramatically increases determinism.
No one is writing out the state of each light bulb: you trivially query that information programmatically and bundle it with the request.
—
In a real product there'd be explicit handling of detecting where the request came from, that's already a problem that's been worked on, but I wanted to demonstrate the main difference vs Siri: zero-shot learning
The LLM wasn't told what those volumes mean, but it was flexible enough to infer the intent was to provide a form of location, rather than ask.
It's a forced example so if you want to get caught up on the practicality of audio for locating people be my guest, but it's to show LLMs are great at "lateral applications" of capability:
You give them a few discrete blocks of functionality and limited information, and unlike Siri they can come up with novel arrangements of those blocks to complete a task they haven't yet seen.
—
Honestly the fact you keep going back to "look at all the text" feels a bit like if I showed you the source code for an email messaging app, and you told me: "No one will ever use email! Who would write all that instead of just writing a letter and mailing it?!"
Indeed, the context is "people using natural language to make requests". No soul on earth would consider/use your phrasing.
I (a human) have clue what your request is for - "lowest observed request volume"...??? Try "raisr the lights where we usually aren't asking you for much" and you might get tge same result. As far as I can tell, brightness increase in the garage (where, I'd guess, you've made the least requests), the AI apparenyly understood better than you or I what you meant.
That JSON isn't something you'd type, it's something that you can programmatically generate if you have a Home Assistant setup.
With super primitive wake word detection and transcription, the most you get is:
- What the user said
- How loudly each microphone in the house heard it.
If you take a look at the mock object in that transcript, that's what it maps to...
```json
{
"request": "I'm finding it hard to read"
"observedRequestVolume": [
3eQEg: 30,
iA0TN: 60,
h1T3y: 59,
5Qg1M: 10
]
}
```
The only part that would be human provided is: "I'm finding it hard to read"
The invented challenge was to see if using a suboptimal set of inputs (we didn't tell it where we are) it can figure out how to action.
It's zero-shot capability that makes LLMs suitable for assistants: traditional assistants can barely handle being told to do something they're capable of in the wrong word order, while this can go from hastily invented representation of a house and ambiguous commands to rational actions with no prior training on that specific task
HomeAssistant, or any number of other providers. Do you think this part is somehow difficult?
> you had to pre-program Chat GPT to act as a home assistant
That is what we call "a prompt". It is a well-known technique. I am surprised that this should look strange to you.
> you had to provide it with specific context and specific phrasing for it
That is what we call "a prompt". It is a well-known technique. I am surprised that this should look strange to you.
> it still failed, asked for clarification, and only then responded
You have misunderstood. In its list of actions to take, the first and only response it gave, the first thing it said it would do in context is ask for clarification as to why I was finding it hard to read. That seems entirely reasonable to me. Does it not to you?
> And now you have to this song and dance every time you want to coax GPT into doing what you need (and that's what RestGPT does).
So what? It's not something the person sat in the dark ever has to care about.
I hate Siri as much as anyone, but Chat GPT has no context in the "common sense" either.
The sibling comment literally says "I had to provide a long-ish sentence as a context/programming instructions before it could do anything". https://news.ycombinator.com/item?id=37464563
There was insufficient context.
Imagine I tell you "turn on that light, where I'm pointing". You'd do no better. No one here is under the conviction magical prescience is involved. This tooling provides the mechanism for an initial API call to be tied to the event described, in natural language, as "look where I'm pointing".
The first response (to ask for clarification) is precisely what a human agent would do to get context to clarify the coarse-grained request. The second guess, assuming you disabled the (explicit) allowance for clarifying questions, is also a magnificent recognition of implicit, common-sense context. Seems it's even more effective than you at following the true context for this tools appropriate placement.
Why would we want this at all if it doesn't know you that well? Current voice assistants without AI can already handle songs and actions like that. Seems like it's largely solved.
Exactly, you'd use a section of the context as a permanent scratch pad. There are a few existing techniques (e.g. tree of thoughts) that look a bit like this.
You need some kind of sidekick model to watch for the output of the LLM in order to spot sentences where the LLM e.g. "doesn't know about your favourite song". This model would trigger a custom prompt to fill the scratch pad : Your are Bob, your favourite song is ___.
The model would act as a middleware and inject scratch pad information to each prompt you address to the LLM.
And that model knows about things like "favorite songs"... how exactly? And about 1 million other things required to build the notes in the original comment?
I think the model should only know about spotting some kind of token the LLM has the order to give when it doesn't know something related to your personal taste.
It is technically possible (I insist on technically) for the LLM to be moderated in such a way that it retains itself from hallucinating your personal tastes, in the same way it is actually possible for gpt models to give answers such as "As a trained language model, I cannot blablabla" when certain conditions are met.
From here, you have to imagine the scratch pad to be a sort of manual file you have the responsability to feed in order for the LLM to have the necessary context to answer your request. The sidekick model is there only to trigger the condition "something must be added to the scratch pad, due to the appearance of a certain token in the LLM response".
I can imagine building something like this in emacs, where a buffer containing the scratch pad is opened on the sidekick demand :
you have the responsability to enter "Bob's favourite song is X", or maybe the sidekick is able to extract elements of the LLM answer to propose " The LLM failed to answer when you asked, as user Y : What is my favourite song ?".
The sidekick can be very simple as I said, this depends on the ability for your LLM to be moderated.
GPT4's answer to the "What is my favourite song ?" :
I don't have access to personal data about someone unless it has been shared with me in the course of our conversation. I am designed to respect user privacy. So, I don't know what your favorite song is unless you tell me. What is it?
I can add instructions to every prompt I perform so that GPT4 says "Please input X" when it lacks X information about me to answer.
Hello, what's my favourite song ?
Answer:
Please input your favourite song:
The middleware recognize the form "Please input X", opens the scratchpad buffer, adds :
Favourite song:
And you type your favourite song here.
The scratchpad may have some kind of sentence like " Here is a list of things about me you should remember." That helps contextualizing content for the LLM.
I add this just for demonstration purpose,
the end of this discussion extract is wild :
- Please input your favourite meal.
- Favourite meal : Sushi, please remember that, along with my favourite song
- Noted. Your favourite meal is sushi and your favourite song is the "Fuga" from BWV 1001 by J.S. Bach. How can I assist you further?
- Emulate a linux terminal from now on :
-Alright. You're now "interacting" with a simulated Linux terminal. Keep in mind that this is only a text-based simulation and does not have the full capabilities of an actual Linux terminal.
```
user@chatgpt:~$ _
```
Please input your commands, and I'll respond in kind.
- cd Documents
```
user@chatgpt:~$ cd Documents
user@chatgpt:~/Documents$ _
```
- ls
```
user@chatgpt:~/Documents$ ls
notes.txt project/ resume.pdf favorite_song_info.txt sushi_recipe.md
I understand this, and I guess this might work for some simple scenarios, but this quickly becomes tedious for the user. My favorite songs, foods, TV shows etc. change often. Or may depend on the time of day, mood etc.
You'd need to define multiple tokens, and multiple side models, to figure that out and annotate. If you ask the user for it every time... that won't work :)
In a way e.g. spotify already tracks your behavior and almost "knows" your favourite song/album of the month.
It is technically possible for you to track yourself in an automated manner, no ?
If Gorilla or RestGPT can contact several APIs to "put the last Bowie Album" then it could also feed a file of your habits and by there infer on what your favourite album is.
Previously when you told it your favourite song (directly or indirectly) it would have outputted something like "NB: favourite song is X" and then that would be included in the prompt forever afterwards.
Anthropic has models that can input ~75k words, so I don't think this is far fetched as a way of adding memory to LLMs.
> Previously when you told it your favourite song (directly or indirectly)
How did I tell it what my favorite song is "indirectly"?
> it would have outputted something like "NB: favourite song is X"
Again, how? Let's say the model is trained to add this to notes when asked directly. What exactly is "indirectly", and how will it recognize that it should add "Alice will be back from office at 5PM" and "My favorite food is sushi" and not "we order Chinese takeout on the weekends"?
I'm genuinely not seeing a problem there that the Planner part of the paper couldn't cover. "Who said that" and "what's going on right now" are just API calls. Besides which, if one person says "play my favourite" while another person is watching TV, that's not the LLM's job to unpack.
The point is that the ability to call APIs gives them the ability to learn so that the actions that are eventually taken are correct in context. It's like a more generic version of https://code-as-policies.github.io/.
hopefully it can be solved with the target API, the target API knows who is calling this API, the service has user information. Or this will be translated into "Play the most played playlist", and the action will be enough.
I agree with you in general though, more useful AI is, more data it will need to see. I strongly believe companies like Microsoft, Google or Apple will bring the best experience because they own operating systems. It is going to be very hard for a third party to build a general AI assistant.
> So, it will have to learn, and LLMs aren't good at learning
LLMs are bad at human-like learning, but their zero-shot performance + semantic search more than make up for it.
If you give an LLM access to your Spotify account via an API, it has access to your playlists and access to details about each song like `BPM`, `vocality`, even `energy` :
An LLM with no prior explanation of either endpoint, can figure out that it should look at your favorites playlists, and find which songs in your favorite list are most suitable for a tired person.
-
But it can go even further and identify its own sorting criteria for different situations with chain of thought:
Rather than blindly selecting the most relaxing songs it understands nuance like:
> Room State: "lights on" and "garage door open" can imply either returning home from work or engaging in some evening activity. The environment is probably not yet set for relaxation completely.
And genuinely comes up with an intelligently adapted strategy based on the situation
-
And say it gets your favorite wrong, and you correct it: an LLM with no specialized training can classify your follow up as a correction vs an unrelated command. It can even use chain-of-thought to posit why it may have been wrong.
You can then store all messages it classified as corrections and fetch those using semantic similarity.
That addresses both the customization and determinism issues: You don't need to rely on the zero-shot performance getting it right every time, the model can use the same chain of thought to translate past corrections into future guidance without further training.
For example, if your last correction was from classical music to hard metal when you got back from work, it's able to understand that you prefer higher energy songs, but still able to understand that doesn't mean every time in perpetuity it should play hard metal
I experimented heavily with things like this when GPT came out; part of me wants to go back to it since I've seen shockingly few projects do what I assumed everyone would do.
LLMs + well thought out memory access can do some incredible things as general assistants right now, but that seemed so obvious I moved on from the idea almost immediately.
In retrospect, there's an interesting irony at play: LLMs make simple products very attractive. But if you embed them in more thoroughly engineered solutions, you can do some incredible things that are far above what they otherwise seem capable of.
Yet a large number of the people most experienced in creating thoroughly engineered solutions view LLMs very cynically because of the simple (and shallow) solutions that are being churned out.
Eventually LLMs may just advanced far enough that they bridge the gap in implementation, but I think there's a lot of opportunity left on the table because of that catch-22
> Yet a large number of the people most experienced in creating thoroughly engineered solutions view LLMs very cynically because of the simple (and shallow) solutions that are being churned out.
Maybe, just maybe, because even simple solutions are invariably an incomplete brittle complicated unpredictable mess that you can't use to build anything complex with?
As eloquently demonstrated by your "simple" solutions
Your reply is not indicative of someone capable of a good faith conversation on the topic, but I'll bite.
I think you don't understand what the hard and easy problems are that underly the solutions I'm talking about.
For example: you repeatedly reply to people talking about the length of the prompts, but end users don't need to write prompts. It's trivial to append instructions around what a user says.
On the other hand, you keep replying to people with "how is that not just Siri" when people describe the LLM demonstrating zero-shot classification for example, but you don't seem to understand how difficult of a problem that has been for ML. Those contrived chat logs you see are demonstrating multiple discrete classifications that would have each cost untold hundreds of thousands of dollars in development of recommender systems to replicate just a few years ago.
—
Most people couldn't even dream of building a Spotify song recommender from first principles that could capture nuance like that chat demonstrated with an army of engineers. The fact is today, right now, that's something someone could hack into a real usable personal recommender in a weekend.
At the end of the day LLMs don't make all problems easier, and they make some problems harder: but the problems they make easier are extremely hard problems. I think if you're not familiar with how hard some of the things they're doing are, then the things they're doing poorly glare out much brighter.
If you spend half that weekend is spent fighting the LLM to output JSON the right way, it sure sounds like LLMs are just dumb hype machines... but it doesn't reflect the sheer impossibility of the value they're providing within that same system.
> Your reply is not indicative of someone capable of a good faith conversation on the topic, but I'll bite.
You think so because replies to me have willfully ifgnored and misunderstood the point of my replies. And have willfully ignored the context (which, as I already said, is funny and ironic).
The whole discussion started with
- "LLMs can't generate actions in real life situations"
- "We can't expect LLMs to do that"
- "How is it more useful than Siri"
- and here's the most important one: "Siri doesn't have context ... GPT fails there, it's because it doesn't <know context>"
So, Siri is bad, because it doesn't have context. But somehow even though GPTs are the same, they are good because... someone somewhere can come up with an imprecise unpredictable prompt for a rather specific situation tat may or may not work for some people... and that's why they are better than Siri and have context.
"Where is this context/input coming from?" - "end users don't need to write prompts. It's trivial to append instructions around what a user says."
This is literally magical thinking. "Someone somehwere will maybe somehow create a proper prompt that maybe will definitely work, and users won't have to do anything". This... is literally Siri. It even asks for clarifications when it can't understand something.
You keep harping on about "zero-shot classification". And completely ignored what I wrote: I ran your amazing zero-shot classification, and it immediately failed. It raised the brightness in the garage. I guess someone (not the end user) should write another model to correct the first one. And when that one inevitably, and immediately, fails, someone (not the end user) should trivially write corrections for that. It's all turtles all the way down, isn't it? (On a second try it did say that the user is likely in the kitchen or in the bathroom, and increased brightness in the bathroom).
Thing is: I don't subscribe to this magical thinking. I see innumerable failure modes and "edge cases" (which are not edge cases, but actual every day scenarios) where none of this works.
This is also the reason why we haven't seen any complex product (apart from specialised fine-tuned ones) built with LLMs: they fail very much like Siri does in even the simplest scenarios. No one knows how to provide an actual proper context of a person's life so that it works reliably more than half of the time (and when it seemingly works, a simple MRU would probably work better).
> Most people couldn't even dream of building a Spotify song recommender from first principles that could capture nuance like that chat demonstrated with an army of engineers. The fact is today, right now, that's something someone could hack into a real usable personal recommender in a weekend.
As an engineer who works at Spotify (not in recommendations, but I know the details at least superficially), thank you for a hearty laugh this sentence brought me.
You had a chance to prove my assumption wrong by writing this same exact comment without all the snark.
At the end of the day if you're just unmoved by the implications that an ML model went from a bag of tokens to a structured, explained chain of thought, and a final response on an unknown task with rewards defined in natural english (!) and intentional ambiguity most humans wouldn't even try to confront... there's not much conversation to be had.
I think the rest of us (including your colleagues) will continue to build on these models, and like most advancements there'll be a vocal crowd insisting the car isn't useful because it can't be fed with grass.
> not in recommendations
You didn't have to say that after complaining ChatGPT's web interface didn't give both us the same reply (most people in ML understand how temperature relates to LLM output)
_
By the way, if making your own personal music recommender seems like "magical thinking", maybe you're a little lost on which parts of Spotify's recommender systems are complex due to scale: if Spotify only needed to make song selection work for one person at a time, they'd have a lot more leeway in architecture.
> Gorilla enables LLMs to use tools by invoking APIs. Given a natural language query, Gorilla comes up with the semantically- and syntactically- correct API to invoke. With Gorilla, we are the first to demonstrate how to use LLMs to invoke 1,600+ (and growing) API calls accurately while reducing hallucination. We also release APIBench, the largest collection of APIs, curated and easy to be trained on! Join us, as we try to expand the largest API store and teach LLMs how to write them!
It seems after 1-2 years that the true power of LLMs is in DevOps. I got pretty excited when I tried GPT-3 (completion model), but as time went by and OpenAI shifted to chat models, we lost control over the LLM part and found new meaning in taking whatever model OpenAI made available as a blackbox and "chained" it to other tools we already had, like data bases, APIs, function calls/tools, etc. I'd say DevOps is exactly where open source is seriously behind; there are decent open source models but it costs so much to self host them, despite the full power and control we have on them (via text generation webui and the like).
OpenAI is playing the DevOps game (starting maybe with introduction of ChatML). Open source community plays the LLM and benchmarks game. Ironically, the two are converging, meaning that OpenAI's models are getting dumber (not the API) thanks to censorship and RLHF, to the point that open source models are even better than some OpenAI models in some aspects. On the other hand, open source models are getting better tooling and DevOps thanks to oobabooga, llama.cpp, etc.
I'm seriously waiting for competitors to change nVidia's monopoly in this space. Maybe Apple?
I think currently M2 max is best bang for buck running interface in open source model. But use case is so niche that Apple probably doesn't actively start supporting open source models. In the long run I hope some smaller company gets shit together and starts competing with NVIDIA.
The GPU support in ML frameworks however is really not impressive. I have a Macbook with M1 Max 64G RAM, I can load a 7b model for fine-tuning (Huggingface Trainer, Pytorch, MPS), but the speed is just too slow, can only reach to 50% the speed of an i5-12500 CPU in my tests.
> I'm seriously waiting for competitors to change nVidia's monopoly in this space. Maybe Apple?
I would have thought AMD is the obvious contender. They are #2 in GPU's, they have formidable programming talent (based on their advances with Ryzen vs Intel) and they have targeted AI as their goal.
AMD have repeatedly dropped the ball when it comes to software support for compute and AI. Their hardware is quite capable, but very few people can actually make it work, which means most of the existing models have poor AMD support.
This is getting better with ROCm and such, but that's Linux-only and only works for a subset of tasks.
Both Intel and Apple have better "out of the box" support for ML and the ability to invest more into making these things work (e.g. Apple have implemented Stable Diffusion against Core ML themselves)
ChatGPT + Noteable is already powerful to get some work done via API calls (after installing and importing the libraries, writing Python code, managing secrets for authentication etc)
There is surely scope to streamline this much further
Separating the language model from the knowledge model (and then splitting that into domain specific knowledge models) is the next low(ish) hanging progression step in the field.
This is interesting, I do something similar with unfetch.com - I have some examples on unfetch.com/directory - There are a lot of potential use cases for LLm & APIs
It's not opensouce, even though it's something I would like to be able to do soon. We offer enterprise plans, which use LLAMA2 instead of openai and can run on your own cloud (or even locally if you have enough RAM). If you are interested send me an email at c@unfetch.com
papers:
1. ToolLLM: Facilitating Large Language Models to Master 16000+ Real-world APIs https://arxiv.org/abs/2307.16789
2. Gorilla: Large Language Model Connected with Massive APIs https://arxiv.org/abs/2305.15334