I personally like this sample the most [0]. Note that these samples are not cherry-picked - having worked with very related algorithms [1] once it is trained well it pretty much "just works".

There is a lot of room for DSP/hacks/tricks to improve audio quality - just the same as in concatenative systems, but the point of this demo is to show what is possible with raw data + deep learning. Also note that this is (as far as I am aware) learned directly on real data such as youtube, or recordings + transcripts. That is quite a bit different than approaches which require commercial grade TTS databases, which are generally professional speakers with more than 10 hours of speech each, and cost a lot of money.

[0] https://soundcloud.com/user-535691776/special-guest-at-iclr

[1] http://josesotelo.com/speechsynthesis/

