Hacker News new | comments | show | ask | jobs | submit login
A Neural Parametric Singing Synthesizer (upf.edu)
74 points by agumonkey on Apr 29, 2017 | hide | past | web | favorite | 20 comments



When https://lyrebird.ai/demo showed up on hacker news last week the first thing i thought was: I can't wait to buy voice sampler packs of my favorite artist to add to my music :)


Thinking about Hatsune Miku[1] here... are human vocalists headed out? Think about how much cheaper it'd be to have robot entertainers, who don't get drunk, fight their contracts, storm out of concerts, get sick...

Hatsune Miku has, "performed at her concerts onstage as an animated projection".

1: https://en.wikipedia.org/wiki/Hatsune_Miku


I went to a hatsune miku holographic performance at a festival concert here in Tokyo a few years back. The crowd was into it and it was a lot of fun.

Visually, it was totally convincing from the center of the floor at mid-distances. There were other characters who performed, they had costume changes, special effects, etc... It's a 2D effect but at concert distances, you can't really tell and the lighting of the character convinces your eyes.

There was one point where she put her foot up on a monitor speaker and my mind was blown until I realized that the monitor was part of the hologram too. It had been sitting there by the floor for the whole song waiting to have her put her foot on it.


At this stage, it's still trained on the voices of individuals, making ownership of the songs a fairly straight-forward thing to negotiate and adjudicate in court if necessary.

But what if it was trained on a dozen top pop stars? A hundred? A thousand? At what point does the resulting voice no longer belong to a human in legal terms?

My guess is we'll find out relatively quickly as pop music is willing to stretch IP pretty far in pursuit of a hit song.


I was expecting some creepy unrealistic unpromising samples, but this actually sounds pretty good. But there is no information whatsoever on the page about how these samples came to be. There are some papers [here](http://dblp.uni-trier.de/pers/hd/b/Blaauw:Merlijn) from this author.


I was all set to "meh" but... nope. Dead wrong.

This is comfortably on its way out of uncanny valley. Very impressive.

Edit: the only thing that sticks out as being a little off is the pitch/intonation envelope. Some of the pitches are off the mark, and some of the glides between notes aren't quite what a human would do. The vocal tone is near perfect.

Pitch should be the easiest thing to fix. I wonder if that's an artefact of the training set.


I'm a little confused. Why is this special? Is the difference this is generated from training instead of samples?

Vocaliod has been around for over 13 years:

https://en.wikipedia.org/wiki/Vocaloid

https://www.vocaloid.com/en/

https://www.youtube.com/watch?v=0HIjTFVINHE


A paper with the same title and authors: https://arxiv.org/abs/1704.03809


The mixed version sounds really good. Acapella it sounds robotic but the acapella is the exact same voice track that is used in the mix, right? Amazing.


How does one input the phonemes and parameters for a synthesizer like this? For those of us that are naturally good singers, the easiest way would be to just sing, have a program convert the vocal into phonemes and parameters, and then change the key, gender, or whatever.


The dataset as described in the paper linked below-

"In the initial evaluation of our system, we use three voices; one English male and female (M1, F1), and one Spanish female (F2). The recordings consist of short sentences sung at a single pitch and an approximately constant cadence. The sentences were selected to favor high diphone coverage. For the Spanish dataset there are 123 sentences, for the English datasets 524 sentences (approx. 16 and 35 minutes respectively, including silences). Note that these datasets are very small compared to the datasets typically used to train TTS systems, but this is a realistic constraint given the difficulty and cost of recording a professional singer"


"but this is a realistic constraint given the difficulty and cost of recording a professional singer"

Just because a singer is professional doesn't mean they're any good. My wife copes with adversity by singing and she can sing "fuck fuck fuck shit shit shit" in soprano, on key, from the kitchen. The only thing keeping her from singing in public is her stage fright.

There are a /lot/ of people like her, that would answer an ad in the newspaper (or craigslist) that would like to /volunteer/ and contribute to a geeky project as long as they got credit in the paper.

At that point, the largest non-tech cost winds up being the studio rental fee, if you have one.


You could also go to karaoke bars and observe who the good singers are.


Intonation is gratingly bad, and that should be the easiest part. Speech quality, however, is top notch. If you played it for me w/out letting me know it was a software synth I might not notice.


The spanish girl is amazing.

Here is their paper: https://arxiv.org/pdf/1704.03809.pdf


It's easy to spot the English voices as synthetic if you listen to the Acapella tracks... but try the "Powerful VQ"!


Sounds very convincing, especially the Spanish girl! I can't wait to play with a piano roll interface for laying down vocal tracks of this quality.


Holy mother of God this is amazing!


Is this like a 5 year fast-forward in this technology? Any experts see this coming in 2017? Wow.


I am floored.




Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact

Search: