Blender is definitely the app AI was made to assist us in using. I keep coming back to Blender every year and picking up a little more of the art of using it (mainly by watching the pros on YouTube).
Photoshop? KiCad? Final Cut Pro? Blender wins as the app that I have struggled the most to master.
You're not struggling, it's just an unwieldy application; struggling is not being able to hold a wrench, this is just difficulty with picking up an excavator.
It's not an unwieldy application. It's an unwieldy domain. 3d is hard. Drawing apps are hard. Video editing is relatively hard. Throw those together, and you get fundamentally __hard__.
The excavator is a good analogy. The best-designed most graceful excavator will still be _hard_ (at least for more complex tasks).
Granted Blender 2.8 onwards and even 2.x onwards are huge steps up from the 1.x days, but it still has one of the most opaque interaction models of any of the common 3D DCCs
Have to second this, it's rare for an applications UX to piss me off enough to start looking into modding it or creating a fork but blender is at the top of that list.
Just having a better way to customize hotkeys would go really far, and I want to my mouse and movement controls to be roughly the same as unreal engine. Until those two things happen I use blender but I'm always angry when I do.
Even tasks which are not 3D but are simply managing the application itself are quite hard in Blender. So you have a double hit: first 3D modeling is quite hard, but then the application itself is also unintuitive. If it were only one or the other then it would be more manageable.
Alright, fair enough. I might have not explained myself well then.
Blender "shortcuts" are most of the time one key, followed by others, alternative variations can be achieved with modifier keys (makes sense).
It takes very little time to see what the shortcuts are. Most of the time you can just hover by a tool icon, other times, menus have them clearly labelled by item.
At this point, you know the basics. Your human eyes are very good at perceiving changes in your FoV, and after inputting a key, the status bar presents information on alternatives modes or filters available for that tool.
Eventually, you start assuming (correctly) that other tools behave in the same fashion for a variety of things.
For example, you press "Y" (y-axis) after "S" (scale), and you scale on that axis! If you followed that by a number, you scale by that scale factor! And this can be applied to every other tool. Moreover, this and other combos make sense, are easy to understand. Whatever you may imagine as the effect they have on other tools is most likely the exact outcome.
Blender is very sane, it does exactly what you tell it to do. You do have to make the call, but when that is at a distance of a finger, it isn't an issue.
You can learn by just using it. Information is laid out to you clearly. Modifiers/filters are consistent, making their knowledge easily transferrable to different tools.
No, they didn't. As the saying goes, the only intuitive interface is the nipple - everything else is learned.
Blender is only "not intuitive" if you already learned to use a different program in the same or adjacent domain (say some CAD tool, or Unreal Editor, or even Paint 3D or Sketchup), as it's not going to be similar enough. Similarity to existing software is a good thing, but not worth it if you can offer much better ergonomics otherwise. Blender could, and did.
Except I've been able to do medium complexity tasks in various Autodesk products(Inventor, 3ds Max) within minutes of opening the program for the first time, without looking at any documentation. While I struggle to do even simple tasks in blender, even with many hours of going through tutorials. I try to pick it up again every few years, and inevitably give up several weeks in.
With respect, a lot of the founding priciples are not especially complicated. Of course you can keep adding detail and ability, that will make any domain hideously complex after some time. That doesn't make it fundamentally unwieldy.
I also gave it a more complicated problem, and it didn't do too bad before it forgot all about the instructions. It clearly lacks 3d understanding, but I don't know if I could do better given what it was given: https://imgur.com/a/8GCjmlo
While doing this, I got a lot of "There was an error generating a response" responses, especially when the screen was mostly full of nothing. I don't know how to avoid this, but it definitely struggled picking out the few pixels of relevant detail from a mostly-irrelevant screen.
This is definitely what I want... forget privacy, the world is all in on convenience.
Let GPT4 look over my shoulder at my screen and also listen to my voice and automagically tell me useful info.
Of course, I should be able to copy/paste or otherwise insert the output.
This is exactly why in https://github.com/OpenAdaptAI/OpenAdapt we have put a lot of emphasis on scrubbing screenshots before sharing them with the model, and built tools to visualize the results.
The very stuff we are made of isn't private, why do we care about privacy of mental processes when most of the input to those processes is culturally provided. The only reason people care for privacy is because they will be socially and financially punished for if they let others inside their head.
I wonder how much this is GPT 4 actually inferring the details about what's on the screen vs GPT 4 recognizing the "shape" of Blender, getting the task (which is simple and has most likely a myriad tutorials online), and just proceeding step by step. Basically, whether just asking GPT 4 the same question with "in blender" in text mode didn't result in the same effect.
Ideally, could be tested with something that's less straightforward, and requires understanding the data presented in some windows on the screen. Like "how can I fix this error"?
You are not, and this demo is total fail. Not only the cube didn't became a sphere, the AI took ages to reply, the instructions were wrong and the result was a total mess.
I assume most people in these comments don't understand 3d modeling, or they are seriously optimistic about THE IDEA of vision assistant AIs, but this demo is not exciting at all. In fact is detrimental to showcase real utility
This is just an early taste of a potentially powerful use case.
I understand the vision API doesn’t have memory, so each screenshot it takes is like an entire new context. If the script/application is able to send WHAT application it’s in, and has some RAG database in the backend to pull knowledge from, this would be incredibly useful.
Of course it’s slow now. If you’re legitimately stuck, a couple seconds for a personalized answer is a perfect trade off. It will get better.
I think every UI application should start logging actions the user takes so that AI could learn the mappings from actions to visual output. It would be amazing form of data.
I could say your comment is a classic 2023 HN comment..? There is no reason to be overly optimistic anbout other people’s products. Plus, nobody said “oh wow this will never work”, it’s just currently quite bad.
I couldn’t hear it perfectly, but I’m pretty sure the instructions it provided were to transform the vertices of the cube to make the sphere. It’s like using MS Frontpage. It may look right, but it’s a convoluted mess underneath.
I thought the post was a sarcastic/joke submission, but after reading the comments I’m confused. I mean sure nice the idea is cool and I’m sure it’ll work better soon, but right now it’s hilariously bad. And we’re talking about a really simple operation.
There seems to be a trend in HN in general I’ve noticed of “it’s cool to be optimistic” and I don’t really like it. I mean sure I’m optimistic when it comes to human beings around me but when discussing technology and people who are out there to make money (or not) I don’t feel the need to be overly optimistic. Plus, competition is the very foundation of innovation and harsh honest feedback is a very important piece of the puzzle.
Created a script to share screen with gpt 4 and asking it to guide through blender. Latency could be better if openai tts api supported a streaming text input
Pretty cool. I wonder if there's enough interest from other devs to hook into something like this or build "apps" on top of this ability. Are you trying to "productionize" or build a platform for others on this or was it a cool hack that you don't plan on spending much more time on? Did you send screenshots at a certain FPS or do some other decision making on when to use the bigger LLM?
My little startup has system that can stream apps to an ML processing service with both a video of what's on screen and things like the context around it and what you clicked on. We run an LLM on top of this after a bunch of other processing (OCR, delta change detection, speech reco, etc.) for our knowledge capture purposes for our single app. It would be really straight forward to make such a platform available to others to build apps like what you're showing on more then just a browser, pretty much any application you run on a desktop or working across multiple of them, we haven't since before the new GPT4 with vision, most people weren't working on anything where that would help.
Anyway, I think we've solved a bunch of the heavy lifting to make this possible, feel free to email me, or anyone else reading this who like this space, if you're a company or dev that might want that layer so you can build something cool on top of it like this (diamond@augmend.com).
Yeah, elevenlabs supports input text streaming but in practice it seems to wait for quite a while before it starts streaming. Should look into ways to make TTS much more instant
Play.ht has a great Turbo model that produces audio chunks within less than 200ms from request. It doesn’t sound quite as good as ElevenLabs, but it’s about 90% as good and is much faster. Might be worth checking out.
Seems the author found out a way to share screen with chatgpt.
I'm guessing probably, it's just sending a screenshot of the screen right after the voice input finished (there are easy ways to recognize pauses) and sending that to the multimodal version of gpt-4, the one which is able to work with image data.
ChatGPT properly recognizes context of the task: that's the typical newly created document in the 3d software Blender. Since it starts out with a box, the user wanted to shape it into a sphere. ChatGPT provides him with a list of operations: change the selection mode to vertices, select them all and apply a bevel function, which in effect, will cause a lousy spherelike object to be created.
It is kind of an odd ask. I'm with someone else, ChatGPT needs to be better at challenging instructions. It'd be way better to just say "Hey delete that cube and make a sphere"
Oh, the way chatGPT took in this assignment is super strange. One, because almost all tutorials start out with "delete the default sphere", which is a multi year old meme at this point.
Two, because what you mentioned; there are a couple of other ways (like your primitive, but also NURBS lathe of a half circle, even the box with a lot of smoothing steps, subdivided icosahedron with smoothed faces)
The person with the male voice is sharing their screen with GPT4, then uses a voice interface to ask GPT4 what to do next. The computery-sounding female voice is ChatGPT4 answering, via a text-to-speech interface.
The app used is Blender, a 3d modeling application.
To be honest, as some person experimenting with an idea, it's really cool. As "marketing" it's trash since it is slow, clunky, and the answer is never actually given.
Erm - delete the cube, then click Add->Mesh->UV Sphere :-)
In all seriousness though - this is absolutely amazing. Imagine it in conjunction with the Facebook/Rayban glasses with integrated cameras and headphones. Now you can walk around an event and hear "this is John Doe, he's a VP at X Corp..." or you look at a product and hear "you can get this for 30% less at this store"...
I appreciate the concerns around privacy - but tech has steadily been moving in the opposite direction - so at least we're starting to get some value from giving up so much data.
> or you look at a product and hear "you can get this for 30% less at this store"...
That's called an ad (in particular, a coupon) and could be done already with phone cameras and the pseudo-AR that's been trendy in the past few years, but it's cheaper done by simple contextual advertising or even a coupon site.
Not that it's actually useful. That "30% less" is coming from somewhere, and believe me, it's not coming from the sellers if they can help it. Someone's getting shafted, and as the saying goes, if you can't spot the sucker in the room, you're the sucker.
> Imagine it in conjunction with the Facebook/Rayban glasses with integrated cameras and headphones. Now you can walk around an event and hear "this is John Doe, he's a VP at X Corp..." or you look at a product and hear "you can get this for 30% less at this store"...
This is the imagination of my nightmares. Surveillance and consumption. Useful tech might have you look at a product and say "you don't actually need that. you can use the one you have at home that works great" or maybe list out the amount of global energy it takes every time you and the world queries the model.
>Imagine it in conjunction with the Facebook/Rayban glasses with integrated cameras and headphones. Now you can walk around an event and hear "this is John Doe, he's a VP at X Corp..." or you look at a product and hear "you can get this for 30% less at this store"...
Yeah, people walking around with little cameras recording everything they see and sending it to OpenAI sounds totally awesome and not like a dystopian black mirror episode at all!
It sounds appealing - until we start relying on it for more and more for even the menial stuff, we stop bothering to turn it off, and hyper-capitalism/consumerism takes it over. I find this to be quite believable:
Is there any vnc version yet; I was going to make one for accessibility (people with heavy RSI or paralysed but able to speak/see) but couldn't figure out how. Related tot the Vimium gpt project today, I would like that for VNC which is impossible without vision because how are you going to 'point' at things by just explaining.
I think you could get pretty far without vision here. Feed all the existing tutorials and Q&A's about a program into GPT and ask it for guidance and I bet it could do well at giving you steps.
I would be interested in how well AI vision could extract written tutorials out of video tutorials, which could then also be used for Q&A.
I've been asking ChatGPT for Adobe Illustrator help over the last couple of days, and the results are as good as the video here. It gets things wrong, but no more wrong than someone very familiar with Illustrator would.
This was my experience with Unity as well. It had suggestions but it was hard to keep it on a specific version and keep it relevant. It was helpful with specific small tasks (even more complex, like "how can I make this thing move 'forward' respecting rotation) but bigger unknowns like just asking why something is going wrong it had no idea even with screenshots. It made sense some of the stuff in unity is just messy and stupid lol
GPT4 is multimodal in the sense that it can take images as input. The person is using a speech to text system such as OpenAIs Whisper and serving screenshots and voice transcripts to GPT4 and GPT4 is returning a text response which is converted to speech using a text to speech system such as OpenAIs TTS API.
In it's current form, no chance. But if we continue on this trajectory (if that's possible of course), then pretty soon. I personally think there will be a barrier popping up soon for further advancement. However, today I am picking olives (not a joke) for my new career if it should happen (a joke; I would just retire).
Not even close. You can look at the Llama v2 training curves and its clear that even 7B models are still under-trained and they themselves have only been trained on 2T tokens. A 70B or whatever size/model amalgamation GPT-4 is also likely under-trained.
Apart from just plain under-training, there's that whole 'grokking' phenomenon that's been observed in smaller models, where it looks like they're not really improving for a long time, but _eventually_ suddenly undergo a massive improvement. I don't know if anyone has been willing to set enough cash on fire to see if grokking can happen in a large LLM and/or how long it would need to be trained for.
There's also still a lot of juice left in improving the quality of datasets and in training on larger data sets. OpenAI, having built GPT-4, probably want to work on some applications of it given that they have enough of a lead on everyone else. I think there's definitely a few "burn piles of cash to train a better model" buttons at OpenAI right now, but they have no reason to use them when they're clearly in the lead anyway.
There's also the actual limiting factor: there are only so many A100s/H100s in the world, but the amount is growing.
Yeah, even RL tuning can be applied to improve reasoning paths for specialized applications, something like RLAIF by anthropic but for specialized agents
Well yes, that's the feeling I have, but there are now so many very smart people working on smarter ways and billions are poured in (not only on the 'AI for car searches' type of sites...); I'm willing to give it some benefit of the doubt.
using cursor as my editor, its close as of today, I mostly steer the AI into directions rather than typing on my own majority of time now. Especially complex refactorings or debugging/error hunting works surprisingly well when the full codebase can be used as embedded vectors on demand automatically
Photoshop? KiCad? Final Cut Pro? Blender wins as the app that I have struggled the most to master.