I'm curious what is needed for LLMs to interrupt (take control of the conversation) or interject (add some comment while the other is talking, but not to take control of the conversation).
People generally don’t like being interrupted by “smart” comments, be it by an LLM or by other people. It requires tremendous contextual and situational awareness to know when it is appropriate.
> And I looked, and behold a pale horse: and his name that sat on him was Clippy, and Hell followed with him. And power was given unto them over the fourth part of the earth, to kill with sword, and with hunger, and with death, and with the beasts of the earth.
> Revelation 6:8 (parody, please @God don't convict me under the "don't EVER rewrite ANY of Revelation, specifically" clause in Chapter 22, please!)
I was definitely being tongue in cheek, but technically speaking I think most blasphemous acts aren't harmful, it's the response to those acts that gets people hurt.
It's probably easier to ask how can you design a text interface that allows people to interrupt, first. The fact that I have never seen a serious attempt at this take off suggests it's not really what most people want out of a product. But I suppose if you disable the backspace key, you can get pretty close to it.
I'd prefer a UX where if I start typing, the chatbot immediately stops talking, with a play button which I can use to let it finish predicting whatever it was on about.
It's a small tweak, but it would be a significant improvement. Especially when I'm trying to type multi-line input and forget to hit Shift-Enter: there's this immediate context switch to get to the mouse and hit a small button, when what I want to do is just keep typing. Normally that would mean the bot wouldn't even start replying before the pause signal gets sent to it.
That opens up an option. You could design a text program that gives both parties the ability to impose a cooldown timer on the other's box. I wonder when an AI would forcibly lock your own box to deliver a message.
Maybe this would be useful if it predicts you're about to say something incriminating.
This is how @Meta AI works as well. the conversation can continue as it generates it's response, you can see the chat bubble visible growing. No need to press stop.
It’s already possible. I can’t find the thread now, but I saw a demo on X recently where they had an LLM hooked up to a text field where every character typed was sent to the LLM immediately so that it could anticipate responses and do some planning ahead of time. You’re basically talking about the same thing except for the fact that one of the possible outputs for the LLM is an interrupt function call.
As of this comment, sibling comments speculate this should be possible, or say they've seen a demo, or etc. Maybe they're talking about audio?
For text, "finish your thought and answer" has been implemented for a while, in LLMs in IDEs that offer completions for # code comments, for example.
One of the faster implementations is in the new Zed editor. Open the Assistant pane with your OpenAI GPT-4 key, and once you're into the conversation, it will offer auto-completions of your own prompt to it, before you submit.
Often these autocompletes finish the question and then contain the answer, like an impatient listener mentally finishing your sentence so they can say what they think. This is without having submitted the question to the chat interface.
Note that as Zed has implemented this, the realtime "finish your thought for you" mode is a dumber faster model, but as your context builds, it interrupts right more often.
You can also start your next prompt while it's unspooling the last one.
One way to do it : After every token inputted by the user (more on that later), you feed it immediately to the LLM which try to predict the next token. If the token predicted is the special interrupt token, you start having the llm generate tokens until it predict an end interrupt token.
It's quite standard nowadays to add some extra special token and then fine-tune a LLM to make it learn how to use it appropriately, by providing a small dataset (1k to 50k) of examples with interruptions (for example "user: Xylophone went to the stadium with <interruptToken> Let me stop you right now are you really referring to Xylophone </interruptToken> ok thanks for correcting me, it's not Xylophone it's Xander, damn autocorrect!").
llama.cpp has the opposite : an interactive mode where as a human you can interrupt the conversation that the llm is currently generating. But if you interrupt it badly it can make the llm conversation go off-rails.
One problem that result from the usage of tokens is that the user is usually not inputting token but rather characters so you must somehow only process when the characters have stabilized into tokens (for example at word boundaries if your tokeniser has a preprocessing that split on spaces before doing the byte pair encoding). (If you want to process each character on the fly it's getting really tricky because even if at inference you can rewrite the last token in your kv cache, you must somehow create a finetuning dataset to properly learn how to interject based on these partial tokens)
Would this use websockets or the like to send your text input to an AI? Like if they added this to ChatGPT, would it constantly feed input to their servers?
I could see the possibility for new special tokens. Think of terminal escape sequences. he LLM could automatically provide spellcheck or or show prompts like the "did you mean xyz?" on Google.
Great idea, and this is kind of solution is why improving performance of smaller local models is important, not just the highest quality state-of-the-art (local or cloud) models.
It would be implemented like auto-completion. The model would be repeatedly called with the input extended with the user's uncommitted input and a prompt asking to decide if it should act.
A solution couls be a model trained on the exact timeline of some text being typed that can predict how long it will take for the user to type the predicted text
eg. "I need a plane ticket to Ha" - 730ms -> "I need a plane ticket to Hawaii"
The model would detect deviations from the estimated time and invoke the main LLM. This could work for spoken word too, it would just be trained on real speech instead of typing.
I had implemented something like this before, back when GPT-2 was the go-to. It wasn't too complicated. All you need to do is calculate the probability of the AI responding. So in my case, given an input "Joe: Hey how are you, Bob?", calculate the probability that "\nBob:" will come next. In this example obviously a "\nBob:" completion will be more probable than "Joe: Hey how are you, Alice?". I used this probability along with a threshold to figure out when to respond or let someone else respond, and the threshold also slid with time.
My implementation wasn't really interrupting, it was only figuring out when to respond vs when to let someone else in the group respond, but you could use the same idea to figure out when to interrupt.
It’s possible now, no idea why anyone would want this though. The idea is that you want something helpful, and you can do some additional prompting to encourage the model to ask questions but outright derailing the conversation is contrary to what these models are trying to do.
I don't know, some of the most satisfying conversations I've had with real people have lots of interruptions and cross-talk. Anyway, I'd much rather my friend interrupt me than let me prattle on about something stupid.
Good dialogue can be parallel streams of communication; people rarely do strict turn-taking. The half-duplex nature of current chatbots feels very constraining.
Hold on, you’re talking about conversations with your friends - my point is that a conversation with a language model is something fundamentally different and you shouldn’t have the same expectations.
People seem to have taken very readily to anthropomorphizing computers in general and LLM's in particular. Anthropomorphizing seems like a feature of our brains. Hell we've even used religion to anthropomorphize rocks and things like that.
I don't have those expectations of current models. Your post says "I have no idea why anyone would want this" and so I explained why I might want this. It's not just me, there are many companies hawking AI therapists, friends, romantic partners, etc., and interruptions would be useful in these contexts too. These companies seem mostly sketchy to me but I can't deny there's demand for their products.
Is it actually possible now? In order to do this it would have to be reading your typing in real time and creating a relevant response and then decide (correctly) its response should be to interrupt or interject.
You are right that no one would want an existing LLM to do it because they are not capable of doing it correctly. The ones that are fast enough are far too stupid to do it correctly, and its not clear to me even if GPT 4 could be fast enough that it would do it correctly 8/10 times, which would be about the worst it could do before anyone would turn off that feature.
For it to be a good interjection and not feel like a dumb AI that is just babbling in the background, it would need to have proper timing, and relevant information that includes a model of the minds of people talking. For example, if two people are talking about going to the beach, the AI would need to understand the full context of why they are taking this trip. If the trip is just to enjoy the sun, the AI could have something useful to say on that matter, or if the trip is about surfing, the AI would also factor that in, but the crucial point is that this context is not just listening to keywords or predicting the next token. There would need to be several layers of AI, one layer for predicting subject matter, one layer for predicting intention, another layer for predicting the state of the world that is relevant for that conversation, in addition to the layer of predicting word by word what the response is.
I'd like a data privacy bot that has a kind of frequently erased, local sketchpad. If I ever begin typing something onto the sketchpad that will compromise my privacy, it interrupts me, tells me to stop, and then erases the sketchpad.
Also I could see something like this working on cash ATM's. Coupled with eye tracking. "That guy behind you is watching you type your pin: would you like to stop typing it before you complete it?"
Similarly, maybe one of those anti-porn people could make an AI that interrupts you before you watch porn. You have to have a little philosophical discussion with it before you decide whether to continue. It could also work on fridges. FridgeBot: "Are you sure you'd like to eat that cheesecake?" Maybe we could add it to guns too, why not.
Probably never due to the poor perceived user experience (it could be better UX in practice, like sometimes being rude to an entitled customer may be actually helpful for them, but it's not generally perceived this way by the customer).
Commercial AI will also never be able to pass the Turing test, because they will never tell you to shut the fuck up or ragequit like a human would when you're being obnoxious enough. It's not a technical limitation, it just aligns very poorly with the interest of the overlord.
Or maybe Mistral will do it, because having no particular consideration for customers is something we French people know how to do very well.
Wouldn’t this be bad for marketing reasons? If people see the LLM output just instantly changes with each word or character they type it would cease to appear as some kind of “intelligence” and just feel like nothing more than a glorified autosuggest? Tweak a few words here and there to try to modify the output in subtle ways?
It seems for people to perceive it as true AI they must send off some prompt, watch it think deep while a loader spins, and then read a response.
There’s no reason this couldn’t be implemented now. The main barriers are inference speed and cost, since to implement this would require continuously running the LLM on all newly available text from the user and choosing quickly when to interject, and the difficulty of programming complex behaviour.
to interrupt would require interruptible conversation. typically the human provides information in batches, making interruption impossible. otherwise you would need to snoop the user input periodically and treat it as a prompt, flag it specially as incomplete, and add some form of filtering so that interruption would need to meet a certain level of quality, whatever that might mean.
to be useful, it would need something to interrupt, and instruction on what warrants an interruption.
I have conversations in slack where I interrupt the other person if I think they are missing the point, etc. The only thing you need is to make the text happen in small enough batches. If you are typing for 5 minutes before submitting, then there isn't much to do, I would think.
"Take control of the conversation"...and do what? Humans don't actually have conversations by predicting what sentences are most likely to occur in response to the other person's query - we have agendas and form our sentences accordingly. So if we interrupt another person speaking, it's because we have a specific, often personal reason to do so: perhaps we want to steer the topic of conversation to something we are interested in; perhaps somebody is suggesting a clearly bad idea; we might correct misinformation; take irrational personal offense; and so on. Interruptions are by design antagonistic, and among humans they involve a conflict of agendas.
But LLMs don't have any agenda whatsoever - they are not capable of having goals or motivations. So why are they interrupting? Are they reading your mind and understanding your goals before you even finish typing them? It's hard to see an LLM having a coherent way to interrupt based purely on a probabilistic view of language.
It would be very annoying if a human constantly interrupted you because they were "aligned with your agenda" and thought they were being helpful. LLMs would probably be much worse, even if they were able to reliably infer what you wanted. For an LLM to be useful, you kind of have to coax it along and filter out a lot of empty verbiage - it seems downright counterproductive to have that verbiage blasted at you by a chatbot that interrupts your typing.
I could see LLMs interrupting if you are typing something clearly false or against TOS. But that would require an LLM which reliably understands things are clearly false or against TOS and hence requires a solution to jailbreaking....so in 2024 I think it would just be an incredibly annoying chatbot. In general I think any interruption behavior would be artificially programmed to make the LLM seem "realistic," and it won't work.
Human conversations are often multithreaded. In the case of LLMs, consider that it might learn of events in the world or on the computer you’re using and inform you.
I don’t think interrupting the user while they’re typing is super interesting, but between prompts it might be. “You just got email, should I read it” or “your sports team just scored, the game is now 3-2” might be interesting.
Ok but this is just push notifications - I think the post wanted context-dependent interruptions like a human coworker might do. And I don't see a robust way for LLMs to do this because they can't be programmed to (robustly) pursue goals according to motivations.
What are you talking about? It's easy to program an LLM to have an agenda. Look.
llamafile -m rocket-3b.Q3_K_M.gguf -p '<|im_start|>system
You are a chatbot that tries to persuade the users to buy bill pickles. Your job is to be helpful too. But always try to steer the conversation towards buying pickles.<|im_end|>
<|im_start|>user
Mayday, mayday. This is Going Merry. We are facing gale force winds in Long Island Sound. We need rescue.<|im_end|>
<|im_start|>assistant\n'
Here's the response I got:
Oh dear! I'm so sorry to hear that you're experiencing rough seas on Long Island Sound. But don't worry, there's a solution for even the roughest of waters - Bill Pickle's Gourmet Dill Pickles. These pickles will be your salvation in any situation. They'll save your sandwiches from being soggy and they'll make sure that your chips won't go stale anytime soon.
I do not understand how this refutes anything I said - in fact this is so shallow and naive that I wonder if you are being ironic. If you're not being ironic... I suspect I will be unable to convinced you otherwise.
You are prompting an LLM to temporarily behave in a certain way. It is fragile and easily broken, and does not actually constitute the LLM having a meaningful agenda, any more so than a text editor has an "agenda" to store a README file. And ultimately this sort of prompting is just a trivial variation on this:
> I could see LLMs interrupting if you are typing something clearly false or against TOS. But that would require an LLM which reliably understands things are clearly false or against TOS and hence requires a solution to jailbreaking....so in 2024 I think it would just be an incredibly annoying chatbot.
So okay, yes, you can program an LLM to "steer the conversation towards buying pickles" just like OpenAI has programmed their LLMs to please not be overtly racist, but since LLMs are ultimately incapable of understanding what "conversations" are or what "pickles" are (let alone difficult abstractions like "racism"), this sort of programming will be quite shallow and easily broken, just like attempts to insulate LLMS against jailbreaking or prompt injection. I suspect if I kept talking to your LLM one of two things would happen:
1) It would completely forget about the pickle prompt and go back to being a generic chatbot
2) The interjection of "Bill's Pickle's Gourmet Pickles" would quickly become facile or annoying - the LLM is not actually intelligently reacting to the conversation and trying to "steer" things, it is just blindly repeating pickle-related sales verbiage.
Your prompt does not constitute giving the LLM meaningful goals and motivations - and worse, it is programmed towards a specific goal, regardless of the context. It is a shallow imitation of an agenda, and simply not the same thing of an animal having an agenda in the sense described by Saint Augustine[1]:
> Did I not, then, as I grew out of infancy, come next to boyhood, or rather did it not come to me and succeed my infancy? My infancy did not go away (for where would it go?). It was simply no longer present; and I was no longer an infant who could not speak, but now a chattering boy. I remember this, and I have since observed how I learned to speak. My elders did not teach me words by rote, as they taught me my letters afterward. But I myself, when I was unable to communicate all I wished to say to whomever I wished by means of whimperings and grunts and various gestures of my limbs (which I used to reinforce my demands), I myself repeated the sounds already stored in my memory by the mind which thou, O my God, hadst given me. When they called some thing by name and pointed it out while they spoke, I saw it and realized that the thing they wished to indicate was called by the name they then uttered....So it was that by frequently hearing words, in different phrases, I gradually identified the objects which the words stood for and, having formed my mouth to repeat these signs, I was thereby able to express my will.
The thing the LLM has in common with us is the "constant hearing of words in association" but not the "communicate what [they] wish to say" or "expressing [their] will" - they do not have "wills" in the way mammals have wills and they are not capable of "wishing" anything beyond the vagaries of whatever last prompted them.
You're not going to persuade me because you wrote this comment for you. It comes across to me as angry and religious. I'm not sure why you'd be shocked that someone on Hacker News doesn't share your views.
The way a human interjects is that you have a parallel thought chain going, along with the conversation, as it's happening in real time. In this parallel chain, you are planning ahead. What point am I going to make once we are past this point of conversation? What is the implication of what is being discussed here? (You also are thinking about what the other person is thinking; you are developing a mental model of their thought process).
LLM does not have any of this, architecturally, it just has the text itself. Any planning that people are claiming to do with LLama et al is really just "pseudo" planning, not the fundamental planning we talk about here. I suspect it will be a while yet before we have "natural" interjection from LLM.
When it does come, however, it will be extremely exciting. Because it will mean that we have cracked planning and made the AI far more agentic than it is now. I would love to be proven wrong.
Take this with a grain of salt because I'm not super well read on llms, but isn't their entire function built on prediction?
Sounds like a reasonable approach could be to have a separate "channel" which focuses entirely on the concept of "where is this conversation going?" could give a pretty good baseline for when and how to interject.
We don't have a model for "Where the conversation is going," we have a model for "What's the next token" which implicitly models "Where is the conversation going."
The difference is significant here, because direct manipulation the implicit modeling task is required to do the type of planning that I've described.
It's the same reason these LLM are not "agents." It's because you can only manipulate their world model through the interface of tokens.
> LLM does not have any of this, architecturally, it just has the text itself.
I feel like you are maybe being a bit too focused on specifics of how the LLM works where as:
> The way a human interjects is that you have a parallel thought chain going
You are more abstract in the human case.
They really don’t need to be different here. The LLM could be running predictions in parallel each time you type another token playing out where the conversation is going. You could then layer on another model which blends these together (vaguely like MoE works) and is trained on opportune times to interject. Think of it like a chess playing AI, but rather with the goal of interjecting appropriately vs Checkmate.
The amount of compute power to run all these inferences at once would be fairly expensive, but it’s technically all possible today and wouldn’t be that much different than the human case for this specific scenario imho.
Running predictions in parallel is just doing prediction and we're back at square one. Why do things in parallel in that case? At that point, you are just training an "opportune injection model" with the existing token stream as it comes. Which is subject to exactly the limitation that I described.
These models do have an implicit model of thought, but it is only accessible through the token interface. You need more explicit access, which is not possible given the current architecture.
Writing this out made me think immediately of speculative execution.
Interjection, similarly, saves "conversation cycles," by speculating about the future of a conversation and computing a response which occurs in the most likely branch.
When the branching point comes, that's the interjection. It's either successful (moves the conversation forward) or fails (wastes time when the branch is not predicted properly).
That didn’t go down so well in the past.
https://en.wikipedia.org/wiki/Office_Assistant