Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I'd do the transcript and the summary parts separately. Dedicated audio models from vendors like ElevenLabs or Soniox use speaker detection models to produce an accurate speaker based transcript while I'm not necessarily sure that Google's models do so, maybe they just hallucinate the speakers instead.




Agreed. I don’t see the need for Gemini to be able to do this task, although it should be able to offload it to another model.



Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: