I built a realtime visual intelligence that connects a users phone camera to a multimodal llm. I use the pipecat open source framework, webrtc, and a few other services to connect it all together.
It's similar to chatgpt advanced voice and grounded with google_search for asynch internet searches based on transcripts or frames from the video that run at 1fps to the LLM.
Let me know what you think and if you want to work on some fun scaling problems with me on this project.
www.withsen.com
reply