For example, in a movie with a busy action scene involving gunfire/helicopters/storms the production audio is usually useless because there were fans and other loud machinery on the film set to create visual illusions of powerful winds and so on. In this situation the audio is either not recorded or used only as a guide track to match timing on a high-quality track recorded later in the studio. Actors hate re-recording dialog by lip-syncing in front of a screen and producers hate paying for it. This solution is already good enough to use for incidental characters in scenes that are going to be noisy anyway. I give it about 5 years before it's good enough to use in ordinary dialog scenes - not for the intimate conversation between the two Famous Actors flirting over a romantic dinner, but fine for replacing the waiter or other background characters.
Though coming soon: Neural networks to determine whether speech is NN-generated? :P
I guess this would be an ideal use case for a generative adversarial network based approach.
That means that recordings of speeches/performances from concert halls are potentially suspicious.
Know of any other instances of reverb covering things up? This is interesting!
History is now doomed. Crackly recordings are obviously fakeable. Children will listen to JFK's "We shall not go to the moon" speech, proof that the moon landings are a liberal conspiracy and all that grainy footage is just CGI with a noise filter.
She stated it as a simple fact and seemed to believe it. There wasn't any indication that she might think it was a fact of what one might call "political convenience". It made the scene just that much chilling to me: https://www.youtube.com/watch?v=MpKUBHz6MB4
Something similar seems to be going on in live vocals. If you lack confidence in your own voice, adding a bit of reverb can make it sound much better. Not sure what's going on — whether the reverb jams the critical listening facilities in one's brain or something like that.
a.k.a. why your singing sounds way better in the shower.
Now I understand why I liked adding a bit of reverb when listening to old MOD/IT/S3M audio files - it covered up the "digitalness" of the song structure a bit.
Thanks for the live vocals tidbit too, that's definitely something to file away.
I wonder how far you could push that in a presentational context (ie, when giving speeches), or whether "who left the speakers in 'dramatic cathedral' mode" would happen before "I dunno what they did to the audio but it sounds great". Maybe if the presentation area was fairly open/large it could work; the question is whether it would have a constructive effect.
It's similar to the difference between 24fps cinema and 60fps home video. The video/clean signal retains more of the original information and is "more correct", but 24fps/touch of reverb adds a nuance that keeps things from getting too clinical. As to why we interpret clean signal == clinical == bad.... I can't really speculate.
A standard quality video is just a projection, a high quality video stream on a 4k set running full 122hz is a weird window from wich we don't get stereo depth clues. The brain constantly has to rethink it's not real as we shift the head and the pov doesn't adjust.
Once LCD density allows for VR with 4K (or, if needed, 8K) per eye... yeah :) we'll firmly be in the virtual reality revolution.
Obviously we'll also need tracking and rendering that can keep up but display density is one of the trickier problems right now.
> If you can listen to it and hear reverb (unless you're going for that effect) you're doing it wrong.
I was thinking precisely that; I figured it'd need to be subtle and just-above-subliminal to have the most effect.
Completely agree about the 24fps-vs-60fps thing. I think this is a combination of both the fact that the lower framerate is less visual stimulation, and that I'm used to both the decreased visual stress and the overall more jittery aesthetic of 24fps.
Regarding >24fps, I think how it's used is critical.
I remember noticing https://imgur.com/gallery/2j98Y4e/comment/994755017/1 (yes, a random imgur gif - discovering that imgur doesn't have an FPS limit was nice though). I think this particular example pushes the aesthetics ever so slightly, but still looks pretty good.
I don't know where I found it but I remember watching a 48fps example clip of The Hobbit some time back. That looked really nice; I completely agree 48fps is a great target that still retains the almost-imperceptible jitter associated with 24fps playback.
To me the "nope"/sad end of the spectrum is motion smoothing. I happened to notice a TV running some or other animated movie with motion smoothing on while in an electronics store a few months ago... eughhh. It made an already artificial-enough video (I think it was Monsters Inc University) look eye-numbingly fake (particularly because the algorithm couldn't make its mind up about how much to smooth the video as it played, so some of it was jittery and some of it was butter-smooth). I honestly hope the idea doesn't catch on; it'll ruin kids and doom us to having to put up with utterly unrealistic games.
But I can see that's the direction we're headed in: 144Hz LCD panels are already a thing, and VR has a ton of backing behind it, so it makes a lot of sense that VR will go >200Hz over (if not within) the next 5 or so years.
The utterly annoying thing is that raising the framerates this high almost completely removes the render latency margins devs can currently play with. Rock-steady 60fps (with few drops below ~40fps) is hard enough but manageable on reasonable settings in most games nowadays (I think?), but when everyone seriously starts pining for 144fps+ at 4K, it's going to get a lot harder to keep the framerate consistent - now that we've hit ~4GHz, Moore's law won't allow breathing room for architectural overhead as has been the case for the past ~decade, and with current system designs (looking holistically at CPU, memory, GPU, system bus, game engine) we're already pushing everything pretty hard to get what we have.
So that problem will need to be solved before 144fps+ becomes a reality. A friend who has a 144Hz LCD says that going back to 60Hz just for desktop usage is really hard because the mouse is more responsive and everything just "feels" faster and more fluid. I'm not quite sure whether the games he plays keep up with 144fps though :P
On a separate note, I've never been able to make the current crop of 3D games "work" for my brain - everyone's pushing for more realism, more fluidity, etc etc, and it just drives things further and further into the uncanny valley for me, because realtime-rendered graphics still look terribly fake. Give me something glitchy and unrealistic in some way any day.
The cadence was a bit off/unnatural, but I'm sure that is not too hard to fix. Phone-in TV/Radio/web shows are about to get very interesting.
I think it primarily needs to learn to respect punctuation, and to translate them to a breathing pause that matches the target voice ("President having speech"-style long pauses vs. "Politician having their ass handed to them by journalist on TV"-no-air-needed pauses).
Maybe it would be possible to train the system to prefer certain intonations in certain cases by rating the realism of the speech in context. It would be interesting to analyzes pauses around words grouped by word2vec! Or choosing a "style" of intonations based on punctuation, parameters like words/minute, etc.
There is a lot of room for DSP/hacks/tricks to improve audio quality - just the same as in concatenative systems, but the point of this demo is to show what is possible with raw data + deep learning. Also note that this is (as far as I am aware) learned directly on real data such as youtube, or recordings + transcripts. That is quite a bit different than approaches which require commercial grade TTS databases, which are generally professional speakers with more than 10 hours of speech each, and cost a lot of money.
This is so good I'm wondering whether this is actually a massive (massive) sound bank. I think it might be a sound bank.
A random radio receiver site I found that doesn't require Flash, tuned to NOAA for Akron OH: http://tunein.com/radio/NOAA-Weather-Radio-1624-s88289/
The list I got the above link from (^F "noaa"): http://tunein.com/radio/Weather-c100001531/
I suspect the warbling I'm hearing is not due to TTS imperfections but 64kbps artifacting.
BTW, I like the Tom voice more than the newer Paul. Paul is more realistic, but is also more soft-spoken and monotonic. Tom has more inflection and sounds more...forceful. I know it's just my imagination, but sometimes Tom sounds annoyed at bad weather :)
I see. I can't seem to find an online receiver for that frequency, although I did find that WZ2500 uses or seems to have used that frequency (for Wytheville VA).
I had a look at SDR.hu (a site I may or may not have just dug out of Google for the first time), but unfortunately the RTL-SDR receivers I can find seem to focus entirely on 0-30MHz. There are a couple ~400MHz receivers but nothing for ~160MHz.
(I may have fired up the receiver I found in NH and fiddled with it, puzzled, for 10 minutes before realizing the scale is in kHz, not MHz... yay)
> Note that the stations have several different voices they use for different reports.
> Now that I think about it, I'm not sure which one I heard the mistakes on - it might have been one of the older ones.
That's entirely possible. (But hopefully not. I kind of want to hear. :P)
> BTW, I like the Tom voice more than the newer Paul. Paul is more realistic, but is also more soft-spoken and monotonic. Tom has more inflection and sounds more...forceful. I know it's just my imagination, but sometimes Tom sounds annoyed at bad weather :)
I just learned about this service, I have to admit (I'm in Australia). It sounds really nice to be able to have a computer continuously read out the weather conditions to you as they change. And I can completely relate to the idea of preferring the voice that sounds unimpressed when the weather's bad :D
I'm not surprised that they re-use the frequencies. These are local weather stations, only intended to serve a radius of a hundred miles or so (at least here on the east coast). In addition to my local station I can receive the one in Boston, about 50 miles south of me.
Also, here are some audio clips http://www.nws.noaa.gov/nwr/info/newvoice.html
The problem might be that high frequencies, especially overtones, aren't properly constructed, but I'm certain that can be improved.
You can synthesize someone's voice perfectly, but if it's stressing words incorrectly or not at all, it's not going to fool anyone.
Then again, that's probably easier to work around by having humans annotate the sentences to be read.
Or by starting with a recording of someone else reading the sentence. Then you get the research problem known as "voice conversion", which has been studied a fair amount, but mostly prior to the deep learning era - and mostly without the constraint of limited access to the target person's voice. (On the other hand, research often goes after 'hard' conversions like male-to-female, whereas if your goal is forgery, you can probably find someone with a similar voice to record the input.)
Anyway, here's an interesting thing from 2016, a contest to produce the best voice conversion algorithm, with 17 entrants:
Even then, I don't believe the issue is with stress. I believe that the voices sound robotic because they are using, and also admitting because it makes their results impressive in some sense, very few samples, "less than a minute" they claim. Triphones are usually what speech systems are trained on. The amount of triphones (3-phoneme-grams) to cover a language's phonemic inventory is huge (50 phonemes = 50! triphones, which could mean a few hours of audio, although many will not occur within the language given the phonotactics of the language).
And fwiw, I think the intonations are actually impressively learned and not random. Trump's odd yet distinct intonations are capture quite well in their samples.
I can see something like this, if not already used, for propaganda.
Tin Foil hat is on