The problem might be that high frequencies, especially overtones, aren't properly constructed, but I'm certain that can be improved.
You can synthesize someone's voice perfectly, but if it's stressing words incorrectly or not at all, it's not going to fool anyone.
Then again, that's probably easier to work around by having humans annotate the sentences to be read.
Or by starting with a recording of someone else reading the sentence. Then you get the research problem known as "voice conversion", which has been studied a fair amount, but mostly prior to the deep learning era - and mostly without the constraint of limited access to the target person's voice. (On the other hand, research often goes after 'hard' conversions like male-to-female, whereas if your goal is forgery, you can probably find someone with a similar voice to record the input.)
Anyway, here's an interesting thing from 2016, a contest to produce the best voice conversion algorithm, with 17 entrants:
Even then, I don't believe the issue is with stress. I believe that the voices sound robotic because they are using, and also admitting because it makes their results impressive in some sense, very few samples, "less than a minute" they claim. Triphones are usually what speech systems are trained on. The amount of triphones (3-phoneme-grams) to cover a language's phonemic inventory is huge (50 phonemes = 50! triphones, which could mean a few hours of audio, although many will not occur within the language given the phonotactics of the language).