Hacker News new | past | comments | ask | show | jobs | submit login

Right, so there's more to it than I initially thought, but it's still hopelessly data-constrained. They’re hoping you could magically obtain all the necessary data from images and videos recorded by the phone when you remember to use the camera.

From my experience building a meeting minutes AI tool for myself, I’ve found that audio carries far more semantic information than video, and we're lacking most of the model capabilities to make audio useful—like speaker diarization. For video, you need object detection, and not just limited to the 100 or so categories of YOLO or DETR. You need to build a hierarchy of objects, in addition to OCR running continuously on each frame.

Once the raw data collection is done, you somehow need to integrate it into a RAG system that can retrieve all of this in a meaningful way to feed to an LLM, with a context length far beyond anything currently available.

All in all, just for inference you'd need more compute power than you'd find in the average supercomputer today. Give it 20 years and multiple always on cameras and microphones attached to you and this will be as simple as running a local 8b LLM is today.






Nah, you can do all of this with a simple phi3.5 instruct + SAM 2, both of which fit into an Nvidia Jetson Orin 64 GB chip.

We do this at scale in factories/warehouses describing everything that happens within like:

Idle time Safety Incidents Process Following across frames Breaks Misplacement of items Counting items placed/picked/assembled across frames


SAM2/Grounding DINO + Phi3.5 Instruct(Vision) give essentially an unlimited vocabulary

If you want audio transcription just add distill-whisper to the mix.

You should perhaps use that system to reread my post. Maybe it can explain it to you top.



Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: