Okay the bike example is cute and impressive, but the human interaction seems to be obfuscating the potentially bigger application.
With a few tweaks this is a general purpose solver for robotics planning. There are still a few hard problems between this and a working solution, but it is one of hard problems solved.
Will we be seeing general purpose robots performing simple labor powered by chatgpt within the next half decade?
That bike example seemed a mix of underwhelming (for being the demo video) and even confusing.
1. It's not smart enough to recognize from the initial image this is a bolt style seat lock (which a human can).
2. The manual is not shown to the viewer, so I can't infer how the model knows this is a 4mm bolt (or if it is just guessing given that's the most likely one).
3. I don't understand how it can know the toolbox is using metric allen wrenches.
Additionally is this just the same vision model that exists in bing chat?
The bike shown in the first image is Specialized Sirrus X. You can make out from the image of the manual that it says "spacer/axle/bolt specifications". Searching for this yields the following Specialized bike manual which is similar: https://www.manualslib.com/manual/1974494/Specialized-Epic-E... -- there are some notable differences, but the Specialized Sirrus X manuals that are online aren't in the same style.
The prior page (8) shows "SEAT COLLAR 4mm HEX" and, based on looking up seat collar in an image search, the part in question matches.
In terms of the toolbox, note that it only identified the location of the Allen wrench set. The advice was just "Within that set, find the 4 mm Allen (Hex) key". Had they replied with "I don't see any sizes in mm", the conversation could've continued with "Your Allen keys might be using SAE sizing. A compatible size will be 5/32, do you see that in your set?"
It bugged me that they made no mention of torque. The manual is really clear on that part with a big warning:
> WARNING! Correct tightening force on fasteners (nuts, bolts, screws) on your bicycle is important for your safety. If too little force is applied, the fastener may not hold securely. If too much force is applied, the fastener can strip threads, stretch, deform or break. Either way, incorrect tightening force can result in component failure, which can cause you to lose control and fall. Where indicated, ensure that each bolt is torqued to specification. The following is a summary of torque specifications in this manual...
The seat collar also probably has the max torque printed on it.
When they asked if they had the right tool, I would have preferred to see an answer along the lines of "ideally you should be using a torque wrench. You can use the wrench you have currently, but be careful not to over tighten."
Ah good find. yah, I tried bing and it is able to read a photo of that manual page and understand that the seat collar takes a 4mm hex wrench (though hallucinated and told me the torque was 5 Nm, unlike the correct 6.2, suggesting table reading is imperfect).
Toolbox: I just found it too strong to claim you have the right tool, when it really doesn't know that. :)
In the end it does feel like the image reader is just bolted onto an LLM. Basically, just doing object recognition and dumping features into the LLM prompt.
Like a basic CLIP description: Tools, yellow toolbox, DEWALT, Allen wrenches, instruction manual. And then just using those keywords in the prompt. Yes, you’re right, it does feel like that.
Yep. This example basically convinced me that they were unable to figure out anything actually useful to do with the model's new capabilities. Which makes me wonder how capable the new model in fact is.
Yah, pretty sure it is the same feature that's been in Bing Chat for 2 months now. Which feels really like there's only one pass of feature extraction from the image, preventing any detailed analysis beyond a course "what do you see". (Follow-up questions of things it likely didn't parse are highly hallucinated).
This is why they can't extract the seat post information directly from the bike when the user asks. There's no "going back and looking at the image".
Edit: nope, it's a better image analyzer than Bing
The implementation that manifests itself as an extremely creepy, downright concerning level of dubious moral transgressions isn't nearly as publicly glamorous as their tech demos.
I feel they could have used a more convincing example to be honest. Yeah it's cool it recognises so much but how useful is the demo in reality?
You have someone with a tool box and a manual (seriously who has a manual for their bike), asking the most basic question on how to lower a seatpost. My 5 year old kid knows how to do that.
Surely there's a better way to demonstrate the ground breaking impacts of ai on humanity than this. I dunno, something like how do I tie my shoelace.
Even on something the size of a car chatgpt won't be running locally, the car and drone are equally capable of hitting openai's API in a well connected environment.
What needs to happen with the response is a different matter though.
Humans don't spend 18+ years preparing how to lower a seat post or drive a truck or even do pretty much most jobs. No one is solely training for 18 years to do anything.
Most of those 18 years are having a fucking great time (being young is freakin awesome) and living a great life is never a waste or a negative ecological footprint.
Society artificially slows education down so it takes 18 years to finish school because parents need to be off at work, so 18 years of baby sitting is preferred. By 18, kids are at the age where they will no longer be told what to do so it's off to the next waste of time, college, then 30 years of staring at a blinking box...or whatever.
When I was 12, I decided I wanted to drive a car, I'd never driven a car in my life, but I took my parents car and drove it around wherever I liked with absolutely no issue or prior instruction. I did this for years.
The youth are very capable, we just don't want them to be too capable...
With a few tweaks this is a general purpose solver for robotics planning. There are still a few hard problems between this and a working solution, but it is one of hard problems solved.
Will we be seeing general purpose robots performing simple labor powered by chatgpt within the next half decade?