Hacker News new | past | comments | ask | show | jobs | submit login

We're taking about audiobooks here. An actor recording an audiobook does not read the sentences or paragraphs in a random order without context.

Sure, voice acting for games or movies is done piecemeal. But the actor still gets information about the story ahead of time to inform their acting, along with their general cultural knowledge as a human. Most crucially, when acting is done in this way it is done with a human director in the loop with deep knowledge of the story and a strong vision, coaching the actor as they record each line and selecting takes afterward. When the directing is done poorly, it is pretty easy to tell.

Sure, for a movie or game you could direct a TTS system line by line in the same way and select takes manually, but it would be labor intensive and not at all automatic. And to take human direction the model would need more than just the text as input. Either a special annotation language (requiring a bunch of engineering and special annotated training datasets), or preferably a general audio-to-audio model that can understand the same kind of direction a human voice actor gets.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: