Hacker News new | past | comments | ask | show | jobs | submit login
Sharing Screen with GPT 4 vision (loom.com)
160 points by Suneel478 on Nov 9, 2023 | hide | past | favorite | 72 comments



Blender is definitely the app AI was made to assist us in using. I keep coming back to Blender every year and picking up a little more of the art of using it (mainly by watching the pros on YouTube).

Photoshop? KiCad? Final Cut Pro? Blender wins as the app that I have struggled the most to master.


You're not struggling, it's just an unwieldy application; struggling is not being able to hold a wrench, this is just difficulty with picking up an excavator.


It's not an unwieldy application. It's an unwieldy domain. 3d is hard. Drawing apps are hard. Video editing is relatively hard. Throw those together, and you get fundamentally __hard__.

The excavator is a good analogy. The best-designed most graceful excavator will still be _hard_ (at least for more complex tasks).


Why not both? 3D is hard but Blender has some very peculiar UX choices that require a lot of memorization of iconography and hotkeys to make use of.

BforArtists specifically exists to provide better surfacing of interaction to Blender https://youtu.be/0vEtTP0C0Cs?si=comeTyStz98t9-a0

Granted Blender 2.8 onwards and even 2.x onwards are huge steps up from the 1.x days, but it still has one of the most opaque interaction models of any of the common 3D DCCs


Have to second this, it's rare for an applications UX to piss me off enough to start looking into modding it or creating a fork but blender is at the top of that list.

Just having a better way to customize hotkeys would go really far, and I want to my mouse and movement controls to be roughly the same as unreal engine. Until those two things happen I use blender but I'm always angry when I do.


Even tasks which are not 3D but are simply managing the application itself are quite hard in Blender. So you have a double hit: first 3D modeling is quite hard, but then the application itself is also unintuitive. If it were only one or the other then it would be more manageable.


It's not intuitive if you don't know it. When you do know the common keybindings you feel as if modelling by hand.

There are 14 year olds making great use of it [1].

[1] https://variety.com/2023/film/news/spider-man-across-the-spi...


I think you missed what intuitive means.


Alright, fair enough. I might have not explained myself well then.

Blender "shortcuts" are most of the time one key, followed by others, alternative variations can be achieved with modifier keys (makes sense).

It takes very little time to see what the shortcuts are. Most of the time you can just hover by a tool icon, other times, menus have them clearly labelled by item.

At this point, you know the basics. Your human eyes are very good at perceiving changes in your FoV, and after inputting a key, the status bar presents information on alternatives modes or filters available for that tool.

Eventually, you start assuming (correctly) that other tools behave in the same fashion for a variety of things.

For example, you press "Y" (y-axis) after "S" (scale), and you scale on that axis! If you followed that by a number, you scale by that scale factor! And this can be applied to every other tool. Moreover, this and other combos make sense, are easy to understand. Whatever you may imagine as the effect they have on other tools is most likely the exact outcome.

Blender is very sane, it does exactly what you tell it to do. You do have to make the call, but when that is at a distance of a finger, it isn't an issue.

You can learn by just using it. Information is laid out to you clearly. Modifiers/filters are consistent, making their knowledge easily transferrable to different tools.

It cannot get more intuitive than this.


No, they didn't. As the saying goes, the only intuitive interface is the nipple - everything else is learned.

Blender is only "not intuitive" if you already learned to use a different program in the same or adjacent domain (say some CAD tool, or Unreal Editor, or even Paint 3D or Sketchup), as it's not going to be similar enough. Similarity to existing software is a good thing, but not worth it if you can offer much better ergonomics otherwise. Blender could, and did.


Except I've been able to do medium complexity tasks in various Autodesk products(Inventor, 3ds Max) within minutes of opening the program for the first time, without looking at any documentation. While I struggle to do even simple tasks in blender, even with many hours of going through tutorials. I try to pick it up again every few years, and inevitably give up several weeks in.


With respect, a lot of the founding priciples are not especially complicated. Of course you can keep adding detail and ability, that will make any domain hideously complex after some time. That doesn't make it fundamentally unwieldy.


Try FreeCAD :)

A few days ago, I tried to have it produce an o-ring. It didn't work out so well https://i.imgur.com/zYDJXpT.png

Having it use Python behaved much better: https://i.imgur.com/8uQSFtZ.png

I also gave it a more complicated problem, and it didn't do too bad before it forgot all about the instructions. It clearly lacks 3d understanding, but I don't know if I could do better given what it was given: https://imgur.com/a/8GCjmlo

While doing this, I got a lot of "There was an error generating a response" responses, especially when the screen was mostly full of nothing. I don't know how to avoid this, but it definitely struggled picking out the few pixels of relevant detail from a mostly-irrelevant screen.


The piano of the 3d world


More like an accordion that you found on the side of the road.


3D apps are like that; there are simply more degrees of freedom.


This is definitely what I want... forget privacy, the world is all in on convenience. Let GPT4 look over my shoulder at my screen and also listen to my voice and automagically tell me useful info. Of course, I should be able to copy/paste or otherwise insert the output.


This is exactly why in https://github.com/OpenAdaptAI/OpenAdapt we have put a lot of emphasis on scrubbing screenshots before sharing them with the model, and built tools to visualize the results.


The very stuff we are made of isn't private, why do we care about privacy of mental processes when most of the input to those processes is culturally provided. The only reason people care for privacy is because they will be socially and financially punished for if they let others inside their head.


i think this is the real future of AI. Our phone and computer UIs will be replaced with telling the chat bot what we want the phone or computer to do.


What is missing right now is for the AI to be able to challenge instructions, and ask for clarification and stuff like that.

It's currently not deterministic enough to entirely replace other kinds of Ui


I wonder how much this is GPT 4 actually inferring the details about what's on the screen vs GPT 4 recognizing the "shape" of Blender, getting the task (which is simple and has most likely a myriad tutorials online), and just proceeding step by step. Basically, whether just asking GPT 4 the same question with "in blender" in text mode didn't result in the same effect.

Ideally, could be tested with something that's less straightforward, and requires understanding the data presented in some windows on the screen. Like "how can I fix this error"?


Am I going insane? That cube never became a sphere?


You are not, and this demo is total fail. Not only the cube didn't became a sphere, the AI took ages to reply, the instructions were wrong and the result was a total mess.

I assume most people in these comments don't understand 3d modeling, or they are seriously optimistic about THE IDEA of vision assistant AIs, but this demo is not exciting at all. In fact is detrimental to showcase real utility


Classic HN response.

This is just an early taste of a potentially powerful use case.

I understand the vision API doesn’t have memory, so each screenshot it takes is like an entire new context. If the script/application is able to send WHAT application it’s in, and has some RAG database in the backend to pull knowledge from, this would be incredibly useful.

Of course it’s slow now. If you’re legitimately stuck, a couple seconds for a personalized answer is a perfect trade off. It will get better.


I think every UI application should start logging actions the user takes so that AI could learn the mappings from actions to visual output. It would be amazing form of data.


I could say your comment is a classic 2023 HN comment..? There is no reason to be overly optimistic anbout other people’s products. Plus, nobody said “oh wow this will never work”, it’s just currently quite bad.


I couldn’t hear it perfectly, but I’m pretty sure the instructions it provided were to transform the vertices of the cube to make the sphere. It’s like using MS Frontpage. It may look right, but it’s a convoluted mess underneath.


Have mercy on him. Remember this is the worst version it would ever be


I thought the post was a sarcastic/joke submission, but after reading the comments I’m confused. I mean sure nice the idea is cool and I’m sure it’ll work better soon, but right now it’s hilariously bad. And we’re talking about a really simple operation.

There seems to be a trend in HN in general I’ve noticed of “it’s cool to be optimistic” and I don’t really like it. I mean sure I’m optimistic when it comes to human beings around me but when discussing technology and people who are out there to make money (or not) I don’t feel the need to be overly optimistic. Plus, competition is the very foundation of innovation and harsh honest feedback is a very important piece of the puzzle.


Created a script to share screen with gpt 4 and asking it to guide through blender. Latency could be better if openai tts api supported a streaming text input


Pretty cool. I wonder if there's enough interest from other devs to hook into something like this or build "apps" on top of this ability. Are you trying to "productionize" or build a platform for others on this or was it a cool hack that you don't plan on spending much more time on? Did you send screenshots at a certain FPS or do some other decision making on when to use the bigger LLM?

My little startup has system that can stream apps to an ML processing service with both a video of what's on screen and things like the context around it and what you clicked on. We run an LLM on top of this after a bunch of other processing (OCR, delta change detection, speech reco, etc.) for our knowledge capture purposes for our single app. It would be really straight forward to make such a platform available to others to build apps like what you're showing on more then just a browser, pretty much any application you run on a desktop or working across multiple of them, we haven't since before the new GPT4 with vision, most people weren't working on anything where that would help.

Anyway, I think we've solved a bunch of the heavy lifting to make this possible, feel free to email me, or anyone else reading this who like this space, if you're a company or dev that might want that layer so you can build something cool on top of it like this (diamond@augmend.com).


Can you share the script? It’s a great idea


Will clean up scripts and share soon on this thread - https://twitter.com/suneel_matham/status/1722538037551530069


Yes I feel like the delay seen in the video is about the same length as I would expect when making a request without streaming.


Yeah, elevenlabs supports input text streaming but in practice it seems to wait for quite a while before it starts streaming. Should look into ways to make TTS much more instant


Play.ht has a great Turbo model that produces audio chunks within less than 200ms from request. It doesn’t sound quite as good as ElevenLabs, but it’s about 90% as good and is much faster. Might be worth checking out.


I don't understand what I'm looking at here, can someone explain it?


Seems the author found out a way to share screen with chatgpt.

I'm guessing probably, it's just sending a screenshot of the screen right after the voice input finished (there are easy ways to recognize pauses) and sending that to the multimodal version of gpt-4, the one which is able to work with image data.

ChatGPT properly recognizes context of the task: that's the typical newly created document in the 3d software Blender. Since it starts out with a box, the user wanted to shape it into a sphere. ChatGPT provides him with a list of operations: change the selection mode to vertices, select them all and apply a bevel function, which in effect, will cause a lousy spherelike object to be created.


It is kind of an odd ask. I'm with someone else, ChatGPT needs to be better at challenging instructions. It'd be way better to just say "Hey delete that cube and make a sphere"


Oh, the way chatGPT took in this assignment is super strange. One, because almost all tutorials start out with "delete the default sphere", which is a multi year old meme at this point.

Two, because what you mentioned; there are a couple of other ways (like your primitive, but also NURBS lathe of a half circle, even the box with a lot of smoothing steps, subdivided icosahedron with smoothed faces)


The person with the male voice is sharing their screen with GPT4, then uses a voice interface to ask GPT4 what to do next. The computery-sounding female voice is ChatGPT4 answering, via a text-to-speech interface.

The app used is Blender, a 3d modeling application.


Cheap marketing for openai’s chat bot.


Or someone playing around with an idea and by doing so showing us another interesting way of ai interaction.

But hey your negative and simple comment could also be true who knows.


To be honest, as some person experimenting with an idea, it's really cool. As "marketing" it's trash since it is slow, clunky, and the answer is never actually given.


Erm - delete the cube, then click Add->Mesh->UV Sphere :-)

In all seriousness though - this is absolutely amazing. Imagine it in conjunction with the Facebook/Rayban glasses with integrated cameras and headphones. Now you can walk around an event and hear "this is John Doe, he's a VP at X Corp..." or you look at a product and hear "you can get this for 30% less at this store"...

I appreciate the concerns around privacy - but tech has steadily been moving in the opposite direction - so at least we're starting to get some value from giving up so much data.


> or you look at a product and hear "you can get this for 30% less at this store"...

That's called an ad (in particular, a coupon) and could be done already with phone cameras and the pseudo-AR that's been trendy in the past few years, but it's cheaper done by simple contextual advertising or even a coupon site.

Not that it's actually useful. That "30% less" is coming from somewhere, and believe me, it's not coming from the sellers if they can help it. Someone's getting shafted, and as the saying goes, if you can't spot the sucker in the room, you're the sucker.


> Imagine it in conjunction with the Facebook/Rayban glasses with integrated cameras and headphones. Now you can walk around an event and hear "this is John Doe, he's a VP at X Corp..." or you look at a product and hear "you can get this for 30% less at this store"...

This is the imagination of my nightmares. Surveillance and consumption. Useful tech might have you look at a product and say "you don't actually need that. you can use the one you have at home that works great" or maybe list out the amount of global energy it takes every time you and the world queries the model.


>Imagine it in conjunction with the Facebook/Rayban glasses with integrated cameras and headphones. Now you can walk around an event and hear "this is John Doe, he's a VP at X Corp..." or you look at a product and hear "you can get this for 30% less at this store"...

Yeah, people walking around with little cameras recording everything they see and sending it to OpenAI sounds totally awesome and not like a dystopian black mirror episode at all!


It sounds appealing - until we start relying on it for more and more for even the menial stuff, we stop bothering to turn it off, and hyper-capitalism/consumerism takes it over. I find this to be quite believable:

https://youtu.be/YJg02ivYzSs

I sure hope uBlock will work on it :)


Is there any vnc version yet; I was going to make one for accessibility (people with heavy RSI or paralysed but able to speak/see) but couldn't figure out how. Related tot the Vimium gpt project today, I would like that for VNC which is impossible without vision because how are you going to 'point' at things by just explaining.


Would love to hear more about what you're trying to build here. Are you thinking use VNC to control and stream a desktop?


I think you could get pretty far without vision here. Feed all the existing tutorials and Q&A's about a program into GPT and ask it for guidance and I bet it could do well at giving you steps.

I would be interested in how well AI vision could extract written tutorials out of video tutorials, which could then also be used for Q&A.


I've been asking ChatGPT for Adobe Illustrator help over the last couple of days, and the results are as good as the video here. It gets things wrong, but no more wrong than someone very familiar with Illustrator would.


This was my experience with Unity as well. It had suggestions but it was hard to keep it on a specific version and keep it relevant. It was helpful with specific small tasks (even more complex, like "how can I make this thing move 'forward' respecting rotation) but bigger unknowns like just asking why something is going wrong it had no idea even with screenshots. It made sense some of the stuff in unity is just messy and stupid lol


There are some tools appearing in this space that have native prompting support. Pretty impressive honestly.

https://spline.design/ai


Am I correct to say this is a multi modal model using vision and audio?

What model is it? And how is it understanding the image and the question? Can anyone shed some light on this technical process?


GPT4 is multimodal in the sense that it can take images as input. The person is using a speech to text system such as OpenAIs Whisper and serving screenshots and voice transcripts to GPT4 and GPT4 is returning a text response which is converted to speech using a text to speech system such as OpenAIs TTS API.


Ah got it! So basically the prompt to GPT4 is an image + text (converted from audio).


How long before gpt4 can interact with a program, run tests, and rewrite the program in any framework.


In it's current form, no chance. But if we continue on this trajectory (if that's possible of course), then pretty soon. I personally think there will be a barrier popping up soon for further advancement. However, today I am picking olives (not a joke) for my new career if it should happen (a joke; I would just retire).


You don't think it's already plateaued? In terms of reasoning ability, anyway.


Not even close. You can look at the Llama v2 training curves and its clear that even 7B models are still under-trained and they themselves have only been trained on 2T tokens. A 70B or whatever size/model amalgamation GPT-4 is also likely under-trained.

Apart from just plain under-training, there's that whole 'grokking' phenomenon that's been observed in smaller models, where it looks like they're not really improving for a long time, but _eventually_ suddenly undergo a massive improvement. I don't know if anyone has been willing to set enough cash on fire to see if grokking can happen in a large LLM and/or how long it would need to be trained for.

There's also still a lot of juice left in improving the quality of datasets and in training on larger data sets. OpenAI, having built GPT-4, probably want to work on some applications of it given that they have enough of a lead on everyone else. I think there's definitely a few "burn piles of cash to train a better model" buttons at OpenAI right now, but they have no reason to use them when they're clearly in the lead anyway.

There's also the actual limiting factor: there are only so many A100s/H100s in the world, but the amount is growing.


Yeah, even RL tuning can be applied to improve reasoning paths for specialized applications, something like RLAIF by anthropic but for specialized agents


Well yes, that's the feeling I have, but there are now so many very smart people working on smarter ways and billions are poured in (not only on the 'AI for car searches' type of sites...); I'm willing to give it some benefit of the doubt.


using cursor as my editor, its close as of today, I mostly steer the AI into directions rather than typing on my own majority of time now. Especially complex refactorings or debugging/error hunting works surprisingly well when the full codebase can be used as embedded vectors on demand automatically


Yeah, you kind of want to say, "GPT, turn the cube into a sphere," and not have to do the clicks/drags yourself.


Yeah, need a UI overlay assist atleast. sort of like hints in games


Give it 1 year


That's not a sphere.


The page is completely blank without javascript running.


This is like saying "I can't watch the video without a browser that supports video playback", yep?




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: