What I have read people do is that they prime the text with something obviously emotional before hand, including punctuation, before their desired output.
They then trim the audio they want to remove the prime phrase.
So something along the lines of:
"Remove this audio because it makes me super angry! Very angry! Now I say what I want to keep!"
Just having the AI read a script is often not enough alone without post processing and manipulation, often it is somewhat flat
I put some Shakespeare into the test on their main page and it was surprisingly good at times. Hearing an AI emote "Out, out, brief candle!" makes me feel strange. This has to replace voice actors in gaming at least. You could change the text as needed and perhaps just add some hints to get the emotion you want.
I feel these kinds of tools really need a way to put "emotional clue" besides text. Just like the instructions that you give to the voice actors. I don't think force the writer to put all the subtle emotional clues as explicit dialogue text is a good idea.
My impression about these voice generators was they're only good for Youtube tutorials and such. It was 2 years ago tho, I wonder how things changed.