
A Neural Parametric Singing Synthesizer - agumonkey
http://www.dtic.upf.edu/~mblaauw/IS2017_NPSS/
======
stevehiehn
When [https://lyrebird.ai/demo](https://lyrebird.ai/demo) showed up on hacker
news last week the first thing i thought was: I can't wait to buy voice
sampler packs of my favorite artist to add to my music :)

------
chris_st
Thinking about Hatsune Miku[1] here... are human vocalists headed out? Think
about how much cheaper it'd be to have robot entertainers, who don't get
drunk, fight their contracts, storm out of concerts, get sick...

Hatsune Miku has, "performed at her concerts onstage as an animated
projection".

1:
[https://en.wikipedia.org/wiki/Hatsune_Miku](https://en.wikipedia.org/wiki/Hatsune_Miku)

~~~
po
I went to a hatsune miku holographic performance at a festival concert here in
Tokyo a few years back. The crowd was into it and it was a lot of fun.

Visually, it was totally convincing from the center of the floor at mid-
distances. There were other characters who performed, they had costume
changes, special effects, etc... It's a 2D effect but at concert distances,
you can't really tell and the lighting of the character convinces your eyes.

There was one point where she put her foot up on a monitor speaker and my mind
was blown until I realized that the monitor was part of the hologram too. It
had been sitting there by the floor for the whole song waiting to have her put
her foot on it.

------
Gargoyle
At this stage, it's still trained on the voices of individuals, making
ownership of the songs a fairly straight-forward thing to negotiate and
adjudicate in court if necessary.

But what if it was trained on a dozen top pop stars? A hundred? A thousand? At
what point does the resulting voice no longer belong to a human in legal
terms?

My guess is we'll find out relatively quickly as pop music is willing to
stretch IP pretty far in pursuit of a hit song.

------
janwillemb
I was expecting some creepy unrealistic unpromising samples, but this actually
sounds pretty good. But there is no information whatsoever on the page about
how these samples came to be. There are some papers [here]([http://dblp.uni-
trier.de/pers/hd/b/Blaauw:Merlijn](http://dblp.uni-
trier.de/pers/hd/b/Blaauw:Merlijn)) from this author.

~~~
TheOtherHobbes
I was all set to "meh" but... nope. Dead wrong.

This is comfortably on its way out of uncanny valley. _Very_ impressive.

Edit: the only thing that sticks out as being a little off is the
pitch/intonation envelope. Some of the pitches are off the mark, and some of
the glides between notes aren't quite what a human would do. The vocal tone is
near perfect.

Pitch should be the easiest thing to fix. I wonder if that's an artefact of
the training set.

------
pasta
I'm a little confused. Why is this special? Is the difference this is
generated from training instead of samples?

Vocaliod has been around for over 13 years:

[https://en.wikipedia.org/wiki/Vocaloid](https://en.wikipedia.org/wiki/Vocaloid)

[https://www.vocaloid.com/en/](https://www.vocaloid.com/en/)

[https://www.youtube.com/watch?v=0HIjTFVINHE](https://www.youtube.com/watch?v=0HIjTFVINHE)

------
brudgers
A paper with the same title and authors:
[https://arxiv.org/abs/1704.03809](https://arxiv.org/abs/1704.03809)

------
eriknstr
The mixed version sounds really good. Acapella it sounds robotic but the
acapella is the exact same voice track that is used in the mix, right?
Amazing.

------
mwcampbell
How does one input the phonemes and parameters for a synthesizer like this?
For those of us that are naturally good singers, the easiest way would be to
just sing, have a program convert the vocal into phonemes and parameters, and
then change the key, gender, or whatever.

~~~
Gargoyle
The dataset as described in the paper linked below-

"In the initial evaluation of our system, we use three voices; one English
male and female (M1, F1), and one Spanish female (F2). The recordings consist
of short sentences sung at a single pitch and an approximately constant
cadence. The sentences were selected to favor high diphone coverage. For the
Spanish dataset there are 123 sentences, for the English datasets 524
sentences (approx. 16 and 35 minutes respectively, including silences). Note
that these datasets are very small compared to the datasets typically used to
train TTS systems, but this is a realistic constraint given the difficulty and
cost of recording a professional singer"

~~~
bubo_bubo
"but this is a realistic constraint given the difficulty and cost of recording
a professional singer"

Just because a singer is professional doesn't mean they're any good. My wife
copes with adversity by singing and she can sing "fuck fuck fuck shit shit
shit" in soprano, on key, from the kitchen. The only thing keeping her from
singing in public is her stage fright.

There are a /lot/ of people like her, that would answer an ad in the newspaper
(or craigslist) that would like to /volunteer/ and contribute to a geeky
project as long as they got credit in the paper.

At that point, the largest non-tech cost winds up being the studio rental fee,
if you have one.

~~~
mwcampbell
You could also go to karaoke bars and observe who the good singers are.

------
tomcam
Intonation is gratingly bad, and that should be the easiest part. Speech
quality, however, is top notch. If you played it for me w/out letting me know
it was a software synth I might not notice.

------
return0
The spanish girl is amazing.

Here is their paper:
[https://arxiv.org/pdf/1704.03809.pdf](https://arxiv.org/pdf/1704.03809.pdf)

~~~
eeZah7Ux
It's easy to spot the English voices as synthetic if you listen to the
Acapella tracks... but try the "Powerful VQ"!

------
cousin_it
Sounds very convincing, especially the Spanish girl! I can't wait to play with
a piano roll interface for laying down vocal tracks of this quality.

------
stevehiehn
Holy mother of God this is amazing!

~~~
tedd4u
Is this like a 5 year fast-forward in this technology? Any experts see this
coming in 2017? Wow.

