Hacker News new | past | comments | ask | show | jobs | submit | nareshshah139's comments login

Nah, you can do all of this with a simple phi3.5 instruct + SAM 2, both of which fit into an Nvidia Jetson Orin 64 GB chip.

We do this at scale in factories/warehouses describing everything that happens within like:

Idle time Safety Incidents Process Following across frames Breaks Misplacement of items Counting items placed/picked/assembled across frames


SAM2/Grounding DINO + Phi3.5 Instruct(Vision) give essentially an unlimited vocabulary

If you want audio transcription just add distill-whisper to the mix.

You should perhaps use that system to reread my post. Maybe it can explain it to you top.

Looks like the open interpreter project might be what you’re looking for.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: