Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
ChatGPT in an iOS Shortcut: Smart HomeKit Voice Assistant (matemarschalko.medium.com)
339 points by punnerud on Jan 22, 2023 | hide | past | favorite | 111 comments


One of the big eye openers here is not the use of OpenAI, but how dumb and limited Siri seems by comparison (especially its inability to grasp context). Apple, Amazon and (to a lesser extent) Google’s voice assistants never evolved past simple intent and keyword matching, and this hack puts their product teams to shame…


ChatGPT is impressive but it makes a lot of mistakes. Siri can't afford that rate of errors for PR and legal reasons, so they need to use a technology that's less flexible but more reliable / safer. This is similar to self-driving cars: it's relatively easy to come up with a proof-of-concept but making it into a safe mainstream product is a different story.


Siri makes a lot of mistakes.

It keeps mishearing "a lamp" as "alarm", and goes into alarm setting mode even if the rest of the query makes no sense.

When told to turn of specific lights, it sometimes ignores the room/qualifier, and turns all lights off in the whole house.

Many other queries end up "I can't find <what you said> in your Music library".

Siri is three regexes in a trench coat.


> It keeps mishearing "a lamp" as "alarm",

Doesn't this Shortcut use the same voice recognition? Doesn't seem like a problem GPT solves.


You've missed the "even if the rest of the query makes no sense." part.


ChatGPT will have the same problem, since it works on the text iOS gave it. It's not clear to me it would perform any better at reformulating your query.


1. I expect it to know it's not possible to increase brightness of an alarm and reject incongruent requests.

2. This particular implementation is limited to operating on an English text in its finalized form, but a different first-class LLM implementation of an AI assistant could work directly on some form of phonetic input. ChatGPT is pretty good at dealing with such ambiguities, e.g. it already understands questions written in IPA notation. It also understands Japanese written in romaji, which is a non-native phonetic spelling of a language with tons of homonyms.


> Siri can't afford that rate of errors for PR and legal reasons

Have we been using the same Siri? The only thing I trust it with is starting a timer. Everything else is literally a coin flip if it’ll actually understand me or mangle my request into something ludicrous.


But one timer and one timer only. Siri cannot start a second timer on your phone in 2023. Its ridiculous


That’s not Siri, but the Clock app. Siri’s “start a timer for …” just calls out to the Clock app.


I don’t trust it for setting a timer fully. When I say “set a timer for 50 minutes” (laundry) it ends up setting a 15 minute timer >50% of the time


My experience as well - I now just disable it on new phones. I think they only thing worse - or a close second - is the Apple TV interface.


IME Siri is much more limited than Alexa and Google Assistant. Have there been any lawsuits regarding those assistants? Or is Apple just being more conservative for other reasons?


Yeah IIRC they had an emphasis on disaster response and getting fixes pushed out across all 20+ languages.


You're conflating technology with functionality. Surely Siri's core tech can be improved.


Did I say that Siri's core tech cannot be improved?


Yeah, I use a Siri shortcut that lets me ask questions of Google Assistant. "Hey Siri, OK Google". The main downside is that this requires me to unlock my phone before it will proceed beyond this point. I usually am asking via an AirPod, when my phone is in my pocket.

But Apple is getting better, with Siri pulling up a relevant website and reading me the first bit, and offering to go further. I used this yesterday when I was driving and talking to my kid about biology; we looked up "cytoplasm" and various other technical terms, and it gave accurate definitions with sources noted. The only thing it failed to do was to tell us how many kingdoms of life there are (but from looking at Wikipedia it appears this question is not quite cut-and-dried).


I really want to agree, but those are different tasks. Apple and Amazon attempt to focus on monetizable parts of NLP flows meaning optimization for narrow use cases (play music, order me a book, etc) and OpenAI can share an impressive chat bot, to just see how that plays out and generate extra PR.

Pretty sure Apple and Amazon are capable of improving their voice-assistants, the question is whether they decide to invest into it. OpenAI is relatively young and is "quicker" than large orgs, as it's a startup with top-notch talent.


Oh come on, I haven't tested it lately but for like 5 years, if you asked Google, "play Mozart next" it would say it can't find "Mozart next".

Even with the most common use cases it doesn't even try. One coder, one evening, simple pattern matching + a few rules and it could be improved so much. I simply don't understand how they can keep it so bad for so long. There were more examples like this, I don't really have a list since I gave up on it.


> One coder, one evening, simple pattern matching + a few rules and it could be improved so much. I simply don't understand how they can keep it so bad for so long.

Exactly what I've been thinking in the last 10 years. It was ludicrously bad, and not improving even on obvious things. They just sat on it.


I don't know what you're talking about. "play Mozart next" opens Spotify and selects Mozart for me. if anything, it's too eager to play music. Sometimes it thinks turning off the lights is a request to play a song.


Now try “play two Mozart songs, a Bruce Springsteen song, and then four flaming lips songs” or ask it to play Dark Side of the Moon in alphabetical order.


If these were actual requests humans would make, I’m sure it would not be a difficult task to implement such functionality.

Your example requests are at best extreme outliers, and not good tests of smart home assistants.


I think the point is that you shouldn’t have to explicitly implement any of this stuff. It should “understand” basic commands that include sequences and counts.

It seems like ChatGPT could be a giant leap ahead of the current crop of home assistants.


I would be happy to be able to schedule lighting and media properly. Just the basics actually working would be great.


I think that's rather a case of hindsight as we now realise and agree that Alexa, Siri etc won't evolve past this crude monetization scheme.

They began as very much what you describe though, to generate pr and make their underlying ecosystems seem more attractive and advanced.


The question is what are the respective companies trying to get it to do.

Apple appears to try to make Siri an well defined interface to specific apps that offer specific services.

Looking at https://developer.apple.com/documentation/sirikit you can see specific intents and workflows that Siri can hook into. This makes it rather limited but for the things that it can do, it does. To that end, Apple isn't trying to monetize Siri - its trying to be a hands free interface to specific tasks that apps can do.

Amazon was trying to make Alexa a general tool (and they've reduced those capabilities over the years) running on AWS with additional goals of providing additional monetization routes. Things like "A book by an author you follow was released, would you like to add it to your cart?" Personally, I never found the chat more than a curiosity and even less so now that there is no knowledge engine backing the general knowledge questions.


Not really. Amazon has already started working on "generalizable intelligence" [0] for Alexa inspired by GPT-3 (as they say so themselves) and released to production at least one model (viz Alexa Teacher Model) based on that effort: https://archive.is/gItZq / https://www.amazon.science/blog/scaling-multilingual-virtual...

[0] https://archive.is/UlCpM / https://www.amazon.science/blog/alexas-head-scientist-on-con...


Worth mentioning that article is from last June, but in December Amazon laid off 10,000 people, most of which were in the Alexa division.

https://www.forbes.com/sites/qai/2022/12/06/amazon-stock-new...


> in December Amazon laid off 10,000 people, most of which were in the Alexa division

Maybe the new tech worked _really well_.


The article mentions that “API will cost around $0.014 per request.”


I keep trying Siri every now and then, and about the only things work reliably for me are (a) asking through my airpods for it to phone someone, (b) setting a timer and (c) asking what the piece of music I'm hearing is. But the answer to (c) is only on the screen for a short time, which isn't ideal if it's while I'm driving and can't write it down, and if I say "what was the music you just identified" or similar, it has no understanding whatsoever. Every time I try it I come up against something like that which feels really obvious. This ChatGPT version sounds pretty amazing in comparison.


If you download Shazam (owned by apple) you can see all the past music that Siri has identified


You can also add the Music Recognition/Shazam tile to control centre and long press on it for the history with no additional apps, but it's very hidden. I wonder why apple is not shipping Shazam or some sort of UIs by default.


Thank you! That was really bugging me.


Gosh. Can somebody help me to understand how a LLM has achieved this capability.

I had thought that a LLM was essentially only doing completions under the hood using statistical likelihood of next words, wrapped in some lexical sugar/clever prompt mods. But evidently far more is going on here since, in OP’s example, some of its output (eg future time stamps) will not be present in its training data. Even with several billion parameters that seems impossible to me. (Clearly not!)

Could somebody join the dots for me on the nature of whatever framework the LLM is embedded inside that allows it to achieve these behaviours that emulate logic and allow generation of unique content.


I mean, that's really the mystery of it. One of the most notable advances of recent LLMs has been emergent, one-shot, and highly-contextually-aware behaviors that seem to only manifest in models with extremely large numbers of parameters.

Cleary the latent space of the model is able to encode some sort of reasoning around how time flows, typical information about houses, understanding how to transform and format information in a JSON snippet, etc. That's the "magic" of it all; amazing and powerful emergent behaviors from billions of weights.

Of course, that's also the limitation. They're opaque and incredibly difficult (impossible?) to inspect and reason about.


“seem to only manifest in models with extremely large numbers of parameters”

You can do the same with models trained on your laptop in a few seconds. The trick here is to let the model have attention to what is learned, and can also be used on images and other type of data.

The benefit of a lot of parameters is more related to training in parallel, faster and “remember” more from the data.


Thanks for the pointer.

Any idea of how clever the wrapper around these things is? For example, would OP’s use case simply get forwarded to the neural network as one single input (string of words) or is there some clever preprocessing going on.


The only significant bit of preprocessing by the language model is tokenization. You can see how it works here: https://beta.openai.com/tokenizer


Compression is intelligence, see https://en.m.wikipedia.org/wiki/Hutter_Prize

EDIT: I personally would state it that compression is understanding(statistical distribution + correlation), where I view intelligence as raw processing power and intellect as understanding + intelligence. Wisdom I think is understanding replayed on past events which allows to infer causation of thing that couldnt be conputed real-time-ish.


I think understanding has to be related to generalisation, because we need to clearly separate it from simple memorisation. In math, for example, being able to solve equations not in the training set would show generalisation. But understanding is mostly a receptive process.

On the other hand, intelligence is about acting, and is an emissive process. It will select the next action trying to achieve its goals. Intelligence also needs to generalise from few examples and work in new contexts, otherwise it is not really intelligent, just a custom solution for a custom problem.

Compression alone is not equal to intelligence because compression is looking at past data, so it needs to learn only past patterns, while intelligence is oriented forward and needs to reserve capacity for continual learning.


What if the universe is deterministic and you can compute the entire Hutter Prize text using just a few physics rules? Then compression!=intelligence, I suppose.


Does it depend on language?


> Can somebody help me to understand how a LLM has achieved this capability.

It's worth clarifying what is being accomplished here. iOS is handling speech recognition, and Shortcuts is handling task execution with an unspecified and presumably long user script. What GPT does here is convert text instructions into JSON formatted slot filling[1] responses.

It's somewhat amazing that GPT is emitting valid JSON, but I guess it's seen enough JSON in the training set to understand the grammar, and we shouldn't be too surprised it can learn regular grammars if it can learn multiple human languages. Slot filling is a well studied topic, and with the very limited vocabulary of slots, it doesn't have as many options to go wrong as commercial voice assistants. I would be way more amazed if this were able to generate Shortcuts code directly, but I don't think that's allowed.

> some of its output (eg future time stamps) will not be present in its training data. Even with several billion parameters that seems impossible

Maybe this is a feature of attention, which lets each token look back to modify its own prediction, and special token segmentation[2] for dates?

[1]: http://nlpprogress.com/english/intent_detection_slot_filling... [2]: https://github.com/google/sentencepiece


I have also been having ChatGPT respond in JSON and it works incredibly well. Adding information into the middle of existing JSON output is easy as well. I find that ChatGPT allows you to "generate" random(terribly so) and unique IDs for references. It used to work a lot better, and was seemingly alterered just before the Dec 15 build was replaced.

With this you can create the structure for replies and have chatGPT respond in an easy to read way. When you walk the Assistant through each method to fill the variables, the result is an ability to have Assistant then run the same steps on any new topic while accurately refollowing the steps.

I have used this method to hold values for things such as image descriptions with "coordinates", behavioural chemistry, and moving "limbs" with limitations baked in using variables. In my image descriptions I would have 20-30 items capable of being interacted with. Each would be assigned a list of viable tasks and each item on the list would go through special instructions.

The interesting part is that the team running the Assistant layer has blocked the bot from doing some of the final steps of my tests (recently). My first test ended at a closed door with an override preventing the bot from using the door. Assistant got through the door anyway, successfully, with 6 different options to choose from. By correctly identifying the word use as the problem to focus on

The largest advantage I can see is the ability to locate invalid information very quickly for targeted retraining.


Have you ever had a seemingly brilliant idea to only later found out somebody else had it too and that actually analyzing the idea turns out it was a combination of existing ideas?

I would say, that a part of our human intelligence is not much more than doing exactly that, learning patterns in different languages. English, emotions, experiences are all interfaces between the world and the self (whatever that is).

When you learn to speak, at first you reproduce simple sounds, you add more sounds to this, more patterns. These patterns and "meta-patterns" are what intelligence is, in my opinion. The creepy part is just that they usually don't appear as patterns and pattern manipulation to us, but rather in a form that is useful to "us" in a way to interact with "the world". "The world", what that means to the individual is also just a useful representation that is accumulated in a similar manner. But what is this "self"? Does it exist at all? Is it the mere accumulating of meta-patterns? With a pair of eyes connected to it and some other useful appendages?


“Language model” just means a probabilistic model of a token sequence. I think what we’re seeing with LLMs is that something that we thought was very hard—approximating the true joint probability function of language in this case—might be possible with much higher compression than we expected. (Sure, 170B parameters is a lot, but the structure of Transformers like GPT is so highly redundant that it still seems simpler than I would have expected.)

We started out with very simple probabilistic models by making very strong limiting assumptions. A naive Bayes model assumes every token is independent; a Markov model assumes that a token only depends on the single token before it. But ultimately these are just models for the underlying “real” joint distribution, and in theory we should expect that the simple models are bad because they’re bad estimates of that true distribution. I think my experience with these poor approximates limits what I expected that any language model could do, and the success of LLMs is making me recalibrate my expectations.

But I think it’s important to keep in mind that the model is still just spitting out the most likely next tokens according to its own estimation of the joint distribution. It’s so good because the model of “most likely next token” is very accurate. It’s easy to fall into the trap of looking at how specific the output is and thinking “wow, out of all the possible outputs, how did it know to give this one? It could have said anything and the right thing is a very low probability event, so it must really understand the question in order to answer properly.” But another way to think about it might be “wow, it’s incredible that the model accurately estimates the probability distribution for this kind of token sequence.”

As to your specific example of things like future time stamps, I think I’ve seen folks say in the past that there is some additional input prefix or suffix information in addition to the prompt you send that is passed into the model. So the model is still a static function—it is not learning from experience within an interaction session; it doesn’t experience time or other external contextual factors as part of its execution. The operators update the contextual input to the model as time passes in the world.


Do you think our own brains when they “understand” are also “just” estimating probability well? If so perhaps we’re quite close to discovery of something important.


It’s really hard to say. I very much doubt that brains and LLMs operate on the same principle; but that doesn’t mean we aren’t discovering something important.

A few years ago Google claimed credit for achieving Quantum Supremacy by building a quantum processor that simulated itself…by running itself. If that sounds tautological, then you see the problem.

They fed in a program that described a quantum circuit. When they ran the program, the processor used its physical quantum gates to execute the circuit described by the program and sampled from the resulting distribution of the quantum state. It’s a bit like saying we “simulated” an electrical circuit by physically running the actual circuit. (Their point was that the processor could run many different circuits depending on the input program, so that makes it a “computer”.)

Ignoring the task, that processor did exactly what an LLM does: it sampled from a very complicated probability distribution. Did the processor “understand” the program? Did it “understand” quantum physics or mathematics? Did it “understand” the quantum state of the internal wave function? It definitely produced samples that came from the distribution of the wave function for the program under test. But it’s hard to argue that the processor was “understanding” anything—it was doing exactly the thing that it does.

If we had enough qbits, then in theory we could approximate the distribution of an LLM like GPT. Would we say then that the processor “understands” what it’s saying? I don’t think the Google processor understands the circuit, so at what point does approximating and sampling from a bigger distribution transition into “understanding”?

Like the quantum device, the LLM is just doing exactly the thing that it does: sampling from a probability distribution. In the case of the LLM it’s a distribution where the samples seem to mean something, so it’s tempting to think that the model _intended_ for the output to have that meaning. But in reality the model can’t do anything else.

None of that proves that humans do anything different, but it certainly seems like we (and many other animals) are more complex than that.


The conversation on another thread is currently exploring exactly that question:

https://news.ycombinator.com/item?id=34474043


I found Andrej Karpathy's explanation of transformers to be very insightful:

https://www.youtube.com/watch?v=9uw3F6rndnA


Yes, the training is based on predicting a missing word. At first, such a task allowed GPT2 to generate paragraphs that were surprisingly coherent, at the user's request. ChatGPT added a novel ability of conversation. Regarding timestamps, the training did include many examples of timestamps in JSON format.


I've gone down this road before, setting up a protocol with ChatGPT. You can see some early revisions of it on a gist[0].

It instructed chatgpt to communicate via a protocol and for the most part it worked. It also bypassed a lot of the neutering openai has done as a side effect.

Might be interesting for someone to play with. Since chatgpt doesn't have API access this was as far as I decided to take it. Feed the prompts in order, one after the other. If you do something cool with this, let me know!

Just a tip if you want to modify it: Changing the rules or adding commands later on seems not to work very well at all. It starts to break the "fourth wall" and starts issuing responses without using the protocol framing. In my experimentation it worked best when establishing everything at the beginning. Changes to the prompts worked best when starting with new sessions.

Also, don't expect it to remember things well. E.g. don't try to tell it to remember the last timer value, for example. It simple can't handle it, not entirely sure why. You'd need to rely on the external "software handler" for storage/retrieval.

[0] https://gist.github.com/Qix-/4f3b3f249192caa140c95ce4c38a232...


Last week I asked Siri in my car for the name of a song with the lyrics, “hello darkness my old friend.”

I know the song and title. I was just having a brain fart. Siri had zero ability to help me. She just repeated the currently playing song.

ChatGPT would have been perfect, especially since it was a question that I could instantly verify an answer to.

I want my Ship’s Computer in my car.


Using Siri is an exercise in how easily frustrated you can be. Most of the commands of misunderstood, the rest of them ends up with the reply "Sorry, I can't help you with that while you're in the car", even when you ask the most basic questions.

CarPlay in general feels like such an afterthought, so not that surprising that Siri sucks there too.


>"Sorry, I can't help you with that while you're in the car"

I was on the way to an important meeting that didn't have enough time to prepare for. I needed to read a very long blog post and so I pulled up the page before I got in the car and once I was on the highway, I asked Siri to "speak screen"and got this asinine response. WTH? This is the primary use case for having Siri read something to me. What idiot thought this was somehow necessary for safety?

That day, Siri was the tipping point from Apple being a company that delighted me to one that infuriated me.


I’ve tried it with the exact prompt from your comment: «what is the song with the lyrics, “hello darkness my old friend.” » and Siri answered correctly with “The Sound of Silence.”


Fair enough! It must have misheard me and then did its best guess with some of the words it did hear.


For me, Siri’s speech recognition is far worse than its NLP


The best eye opener for me was how he programmed / primed ChatGPT using plain English. It's not even psuedocode. It's just human language (structured well).


Why does the title and beginning of the article talk about ChatGPT if he actually used InstructGPT?

As someone working in the space, it's quite strange to observe how ChatGPT blew up, when OpenAI had similar models for quite a while.


The mistitling is likely deliberate SEO to pull in traffic off the back of the popularity of ChatGPT


Where does it say he is using InstructGPT?


ChatGPT plays the role of a generic word here (like Xerox, Kleenex, Google).


arguably still inappropriate - you wouldnt put Bing in an ios shortcut and call it Google. we need to be less tolerant of lying to technical people because we fear they are too dumb to understand that software has variants or has layers


Accessibility and presentation


SEO


I actually just did something similar with whisper.cpp, and hooked it up to GPT-3 Davinci-3 via the API, and then the answer piped via Microsoft's text to speech. The mic I'm using is a cheap USB omni-mic designed for conference calls.


How long has it been running? Have you figured out anything to use it for beyond what other assistants can do yet? I'm very curious as I had the same idea but don't have time to implement...


A few hours... It is constantly listening.

In terms of what other assistance can't do, it can integrate w/ my desktop computer in a way that other assistance can't easily do. The M$ voices take things to another level. Apple has better quality voices for Siri on desktop, but they don't let you access them through the 'say' command.



Does whisper model support real time transcription?


whisper.cpp is an open source project that has a feature to turn whisper into a real time transcription program, it works surprisingly well by feeding small segments of audio into whisper. take a look at their GitHub page!


You're right, and it also is apparently optimised for m1.


where do you have whisper.cpp running? would be cool to have that linked to an app or shortcut on your phone


I have it running on my M1 Mac. I picked up a few minis with frequent flyer points and have started to use them in places where I'd use a Rasberry Pi, and as a gateway for my RAID. I'm tickled by the idea of a modern version of Stephen's personal infrastructure[0].

[0] https://writings.stephenwolfram.com/2019/02/seeking-the-prod...


Read this while sitting on the couch next to my 8 year-old son. He’s gotten into Scratch and Python recently and seemed very impressed when I explained the gist of the article. Can’t help but wonder how different software development will be when he enters the workforce in 10-15 years… and everything else for that matter.


You guys should write an assistant. Use the API to make a gpt request with a prompt he'll enjoy.

I did this yesterday, then I hooked up whisper as well so I could give it voice commands, and then I used gTTS to get a really nice voice for the robot. I'm going to try MozillaTTS as well.

It's all a few hundred lines, fun to write. I had chatGPT try to write most of it.

It's at least as useful as chatGPT. I'm going to work on giving it appendages--ways to execute commands from a limited set.


The only ways to earn money I can conceive of are owning robots and/or real estate.

Going to be a ton of disruption as labor is collectively devalued further by the deflationary forces known as technological advances.


A brutal example of commoditise your complements: you can see an economy as capital, land and labour. Labour employs capital to extract value from land.

The fear here is that capital replaces labour, leading to the value of labour dropping to zero and the value of capital and land increasing alone.

But is labour being replaced, or augmented as it always is by capital? Industrialisation augmented labour more efficiently, but the value of labour has still increased in real terms and has not dropped to zero.

That said, I would invest in capital and land now and not in labour.


If it helps, a blacksmith who made their daily living producing nails could not conceive of ways to earn money post-industrialisation other than owning a mine or a factory either.

But it turns out there are many unimaginable but better ways humans can contribute to our collective success than by banging hot metal into the same shape again and again in a post industrial economy.


It's not lookg very bright, but it's still better than getting into something like graphic design that will be devalued even quicker


Very cool and impressive.

Yesterday, I created a subreddit to discuss and curate these types of Ai use-cases, if anyone else is interested please join:

https://www.reddit.com/r/AiAppDev/


I feel like the logical next step is to plug something like this into something a bit more powerful than Siri Shortcuts. I expect you could easily hook this up to the home assistant API or similar and have it work properly programatically.


Someone already tried it with homeassistant.io (https://mikegrant.org.uk/2022/12/22/gpt3-and-homeassistant.h...) . I predict that there will be a homeassistant integration module for openai soon.


Called it two months ago!

https://news.ycombinator.com/item?id=33892842

I can't find the actually query I tried but I told it a Typescript API and told it to generate a script to run in response to the query. That's much more powerful than JSON because it can pass data around from one thing to another e.g. "send a message to my girlfriend telling her my ETA". You just give it the maps and messages API and it figures it out.

Edit: Found the query I tried back then!

https://imgur.com/a/yfEJYKf


I had the same idea as well. I run home-assistant for my home, and the dumbness of Google Assistant is infuriating. You have to manually add keywords and phrases to make it understand anything. So I was experimenting with having ChatGPT generate JSON instructions based on a prompt similar to this (though the prompt in the blog is way better than mine).

The real power of these models are not the information that they can fuzzily regurgitate, but how they can be given instructions on how to manipulate unstructured data.


Truly brilliant combination.

Not the author's fault, but this implementation reveals one of Apple Shortcut's worst traits: after the target "Ok, Smarthome," one must wait for Shortcuts to link the function and ask for the next input. There's no way to go "Ok, Smarthome what's the temperature in the oven?" One must wait for the "Yes?" in this example - even though it's just one body of text, the Shortcut parser can't delineate anything after the target command. Frustrating.


A decade of Siri research & development... overshadowed by a single clever GPT-3 prompt. Love it.


Tbh there's a lot of research that went into LLMs


Haha, its funny and interesting how importent it is _when_ you submit something on hn, it seems. I submitted the exact same link, with the same title for 1 day ago [1], but no comments and 1 vote :) This post a couple of hours later got alot of comments and traction.

[1] https://news.ycombinator.com/item?id=34460202


I've been wanting something like this ever since GPT-2 was a thing. Language models seem like a pretty natural fit for voice assistants. It's just a question of actually building one, and of being comfortable enough with all the new failure modes that enables to actually release the product despite that.

Hopefully the existence of ChatGPT has popularized the idea enough that we'll start to see companies actually implementing this in short order. Or at the very least, open source solutions taking advantage of LLM APIs outcompeting proprietary assistants from companies too terrified of the idea of Alexa or Siri being caught on camera saying something racist to implement a system where there's even a remote possibility of that happening.


ChatGPT stands out in its ability manage meandering conversational language to find the interwoven commands. The pearl is that the user doesn't need to accomodate the AI's limitations.

The core issue with current AI assistants is each has an invisible wall of limitations that the user can only discover via experimentation. This ultimately results in the system training the user on how to speak to it, and what commands it can accept. Since the system doesn't have any way of notifying the user of growing functionality, most users possess a dated idea of the capabilities of their AI assistant.

I don't feel it's warranted to use ChatGPT as evidence that AI assistants are bad or that people at FAANG companies are to be ashamed. AI assistants operate under far greater restraints such as a necessity to reply quickly and run computationally light. ChatGPT currently can't provide that at the moment, and making the conversational interaction smarter won't make the whole AI system smarter. E.g. If I say "remind me to get bananas", ChatGPT won't pop up and alert me when I pass a greengrocer or supermarket.

Instead one should direct their criticism more fairly: such as why don't AI assistants provide a better means of demonstrating their growing functionality, or demonstrate the flexibility of language that they do understand. I think most people would be pleasantly pleased by what an AI assistant could do if they just knew what they could ask it.

Side note: Siri is does understand basic context. If you say "it's dark in here", it will turn on the lights. For those that put a homepod in each room, Siri is smart enough to turn on a light in just that room. It will also handle longer commands that tie into system functionality "remind me to get my prescription when I'm near CVS" will build a reminder that's tied to a geo-fence around the pharmacy. What it doesn't do (and I can't see how ChatGPT would help here) is recommend any pharmacy that I walk past, instead it has to be just that CVS, and if there are multiple CVS's nearby, I would have had to elect one.)


I’ve been slowly integrating more devices to my home assistant setup—most recent a garage door controller—and I thought I was done for now and content with the setup. After reading this I can’t help but plan out the next several weekends to try to replicate this in hass.


This is so cool, and ofc impressive. Also, I learned a lot about how powerful Shortcuts is.

Every since I played with ChatGPT (just with the chat UI) I am kind of amazed every time I use Siri: it’s crazy how far we have come.


Funny, isn't this the exact current version of the "Egg" in the Black Mirror Christmas Special? The most self aware thing we could build nowadays to manage our homes.

https://en.wikipedia.org/wiki/White_Christmas_(Black_Mirror)


ChatGPT now has a paid plan (and a limited free plan), so I wonder how apps like these that integrate into ChatGPT are going to handle it, I assume they'd have to start being monetized as well.

https://news.ycombinator.com/item?id=34476842


Can someone help me understand: is ChatGPT essentially just using the GPT-3 text-davinci-003 under the hood, with a bunch of prompt-prefixing (similar to what is going on in this guy's Siri Shortcut)? Or is it using a significantly more powerful model?


ChatGPT is GPT-3.5 plus some human feedback reinforcement learning[1] to steer it away from the things that tanked Tay. Meaning they had a bunch humans test the thing out, rate responses, and incorporate that into future training.

    [1]: https://en.wikipedia.org/wiki/ChatGPT#Training


From what I understand, they refer to it as GPT 3.5. So, that plus the prompt prefixing as you mentioned.


It was either text-davinci-002-render or text-davinci-003-render


AFAIK that's exactly what it is.


If so, doesn’t that make it trivial for anyone to make a ChatGPT-like app using the GPT-3 API? If so, I would’ve expected several ChatGPT-like tools coming out in the days following GPT-3’s release. I don’t understand why it took so many months - and then, why it was OpenAI themselves that finally released the first interactive thing on top of it. It’s puzzling.


OpenAI charges per-word. You get an initial credit, but the "setup" prompt is so long that each request will be far more expensive than "normal" requests, since you'll be charged for the entire setup prompt with every single request.


Yep, have been playing with the 18USD API credit and indeed the cost is not trivial enough for casual use cases.

That said, these new developments give us a new reason to have powerful computers. The last 10 years computers were way more powerful for most of the task for their conventional use cases but now computers are yet again way less powerful for the new frontier use cases.

Apparently GPT-3 runs on 350GB VRAM, so it requires specialised hardware, so the best we can do today is to sneak peak into the future of on-device GPT-3 like LLMs.


That input is 610 tokens. With the actual prompt and the response let's say you hit 1000 tokens. That would be 2 cents per question then.

Not sure how often do people ask Siri questions but that seems like it would not be a significant cost for one user. If you were to build a service on top of it, that might be expensive though.


In a home with only responsible adults, it would be fine. With kids it would be a repeat of the olden days before unlimited texting when the parents would get the cellphone bill at the end of the month.

But I wonder how soon we’ll have fine-tuned or otherwise customized specialized LLMs that are cheaper to run? In the mean time this kind of thing is a really neat proof-of-concept of how ChatGPT’s abilities can be harnessed.


GPT-3 is really impressive with the Davinci model, but this seems like overkill for the tasks here. You can save 10x [0] by just switching to the Curie model.

More information on the models: https://subscription.packtpub.com/book/data/9781800563193/2/...

[0]: https://openai.com/api/pricing/


This seems like a moon landing without the fanfare. I’m shocked this is possible.


This is great, how is it going to be affected by the pricing change that OpenAi doing those days?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: