Was following these two projects by someuser on Github which makes similar things possible with Local models. Sending screenshot to openai is expensive , if done every few seconds or minutes.
Nice! Although the productivity increase from being able to resolve blockers more quickly adds up to a lot (at least for me), local models would be more cost effective -- and probably feel less iffy for many people.
I went for OpenAI because I wanted to build something quickly, but you should be able to replace the external API calls with calls to your internal models.
https://github.com/KoljaB/LocalAIVoiceChat
While the below one uses openai - don't see why it can't be replaced with above project and local mode.
https://github.com/KoljaB/Linguflex