Most of them are extremely mechanic, to the point where it's almost impossible to understand, but others are actually quite convincing.

I think it primarily needs to learn to respect punctuation, and to translate them to a breathing pause that matches the target voice ("President having speech"-style long pauses vs. "Politician having their ass handed to them by journalist on TV"-no-air-needed pauses).

Absolutely - listening through the multiple samples with different intonation from both Obama and Trump, some of the samples are much more realistic, while others come off as robotic.

Maybe it would be possible to train the system to prefer certain intonations in certain cases by rating the realism of the speech in context. It would be interesting to analyzes pauses around words grouped by word2vec! Or choosing a "style" of intonations based on punctuation, parameters like words/minute, etc.

