Right, so there's more to it than I initially thought, but it's still hopelessly...

nareshshah139 · 2024-10-11T04:08:06.000000Z

Nah, you can do all of this with a simple phi3.5 instruct + SAM 2, both of which fit into an Nvidia Jetson Orin 64 GB chip.

We do this at scale in factories/warehouses describing everything that happens within like:

Idle time Safety Incidents Process Following across frames Breaks Misplacement of items Counting items placed/picked/assembled across frames

nareshshah139 · 2024-10-11T04:09:35.000000Z

SAM2/Grounding DINO + Phi3.5 Instruct(Vision) give essentially an unlimited vocabulary

nareshshah139 · 2024-10-11T04:10:15.000000Z

If you want audio transcription just add distill-whisper to the mix.

llm_trw · 2024-10-11T04:41:18.000000Z

You should perhaps use that system to reread my post. Maybe it can explain it to you top.