
Utau – a Japanese singing synthesizer application - bane
https://en.wikipedia.org/wiki/Utau
======
ANTSANTS
A Vocaloid thread on Hacker News? Today is a weird day.

Here's the most impressive Vocaloid song/video I've seen:

[http://www.youtube.com/watch?v=zkLJoFp2UAE](http://www.youtube.com/watch?v=zkLJoFp2UAE)

EDIT: From the artist's bio:

"Mitchie M is a relatively new producer who became popular due to his very
realistic tuning of Hatsune Miku. His name is derived from The Jimi Hendrix
Experience's drummer, Mitch Mitchell, and originally was "Mitchiell Mitchie"
before shortening it. He started composing songs in high school in a band
which made covers. However, he wanted to write his own songs. His biggest
influence is Tom Jobim. He uses Logic Pro 9 to compose his songs. Mitchie M's
popularity is steady with most of his videos quickly gaining at least 200,000+
views."

This hints at part of the magic of Vocaloid: It gives the inexperienced,
unknown, or unconnected a pitch-perfect (if a bit robotic-sounding) vocalist
to experiment with, significantly expanding the range of genres a lone
musician can cover. The take a relative "outsider" can have on pop music is
very interesting, something of a middle ground between "mass-
produced"/"commercialized" pop and the more experimental, "underground"
electronic music you usually hear from solo acts.

------
jmillikin
As noted in the article, Utau is known mostly as being a free clone of the
much more popular and mature Vocaloid software. Both are capable of rendering
Japanese and English (among other languages), and they're mostly used for pop
music.

Vocaloid English:
[http://www.youtube.com/watch?v=S_YNQzcRhmE](http://www.youtube.com/watch?v=S_YNQzcRhmE)

Utau English:
[http://www.youtube.com/watch?v=Cm7o5ptHwb0](http://www.youtube.com/watch?v=Cm7o5ptHwb0)

For a head-to-head comparison, we'll need to switch languages. Listen to
Hachi's song Matroshka rendered by:

Vocaloid Japanese:
[http://www.youtube.com/watch?v=_JGaQ3g8WU4](http://www.youtube.com/watch?v=_JGaQ3g8WU4)

Utau Japanese:
[http://www.youtube.com/watch?v=nCqy27QrqZA](http://www.youtube.com/watch?v=nCqy27QrqZA)

------
bane
It's not great art, but there's a bunch of videos on youtube demonstrating it.
Apparently there are English packs as well.

Here's something somebody spent way too much time working on.

[https://www.youtube.com/watch?v=t8GKzBhAgQo](https://www.youtube.com/watch?v=t8GKzBhAgQo)

and another

[https://www.youtube.com/watch?v=yd62XA4Tp_8](https://www.youtube.com/watch?v=yd62XA4Tp_8)

~~~
jbverschoor
How much work is it to get "proper" pronunciation like this song?

~~~
btown
Lots of work, especially for English (as opposed to Japanese which has many
fewer possible morae/sounds [1]). I've made a number of (as yet unreleased
[2]) song covers with the Kumi Hitsuboku English voice bank [3], which boasts
the most expressive English reclist ("recording list" of individual recorded
syllable fragments) I've seen. To get believable pronunciation, you can't just
one-to-one map between a note and a pitch-shifted sound. Many syllables need
to be broken into one note containing the starting consonant and the held
vowel, sometimes another note containing a vowel transition (for example, the
long "i" needs to be broken into a short "a" and a long "e"), and another note
containing the end of the word (a vowel blended into a consonant). And in
English, sometimes you need to work the previous word into the first note: for
instance, "come on" becomes "ka," "ma," "a n" in Kumi's aliased reclist.

UTAU doesn't do anything algorithmic to help with these word-splitting tasks;
even if it could, there's a lot of nuance and judgement that depends on
everything from the tempo to the specific vocalist's pronunciation. For
instance, the word "slam" is actually best rendered as "sle, e m" when working
with Kumi because the "a" sound is more hollow, and the point at which you
switch from the "sle" to the "e m" depends a lot on context. A 2-minute song
with chorus and verses can take hours to get perfect.

And of course, pronunciation is only the beginning. Mitchie M (one of the most
highly regarded vocaloid artists, mentioned elsewhere in this thread) stated
in an interview that he can spend up to 2 weeks programming intonations for
rap segments in his songs. [4] It's very interesting to think about how spoken
word would be coded as pitches, and how we are able to detect deviations from
common pitch-patterns as "unnatural" when listening to spoken passages.

[1] See
[http://en.wikipedia.org/wiki/Katakana](http://en.wikipedia.org/wiki/Katakana)

[2] Why unreleased? Still working on making lyric videos to accompany the
recordings; for that, I need to shut down my VMs to launch After Effects, and
I can't justify procrastinating that much!

[3] Links at
[http://utau.wikia.com/wiki/Kumi_Hitsuboku](http://utau.wikia.com/wiki/Kumi_Hitsuboku)

[4]
[https://www.facebook.com/vocaloidism/posts/568055823290010](https://www.facebook.com/vocaloidism/posts/568055823290010)

~~~
e12e
Thanks for your very informative comment.

The note on katakana isn't very helpful -- but Japanese does indeed have quite
few phonemes, 21 according to various random searches (sounds about right).
Spanish has 24, French 34 and English 44 according to [1] (No source for
Japanese).

Apparently Cantonese has 609-630 or so (due to being a tone language).

Actually that number for English sounds a bit high, I believe 40-41 is more
reasonable. Perhaps it includes "all" dialects of English.

[1]
[http://perso.limsi.fr/mareuil/publi/1209.pdf](http://perso.limsi.fr/mareuil/publi/1209.pdf)

~~~
bane
> Apparently Cantonese has 609-630 or so (due to being a tone language).

My completely non-Chinese intuition (and having asked the question a couple
times to native speakers, including a couple who speak Cantonese), is that the
number of phonemes for tonal languages tends to reduce vastly when singing.
Since you can't hold a tune and produce tones at the same time.

------
nostromo
For anyone wanting to use this in a project, it seems to be a bit behind
Vocaloid.

Vocaloid's Hatsune Miku is pretty impressive. "She" has even done "live" shows
in Japan and has opened for Lady GaGa in the US.

[https://www.youtube.com/watch?v=FoTd918zhZc](https://www.youtube.com/watch?v=FoTd918zhZc)

~~~
ekianjo
Well the phonetics of Japanese are, by definition, pretty simple, compared to
most other languages out there. There are actually very few sounds and they
are clearly detached when you speak. It's one of the easiest languages to
"vocalize".

~~~
e12e
But speech and singing are quite different? I don't know, but I'm a little
surprised if the vocaloid engines(?) wouldn't be easier to "tune" to Spanish
than to English. Also, while "Chinese" (Cantonese or Mandarin) are quite
complicated (especially considering the fact that they are(?) old languages)
-- I'm not sure if they're particularly complicated from the point of view of
emulating singing?

~~~
ekianjo
Singing is probably harder to mimick than speech, because words need to be
pronounced according to a melody and not just their standard pronunciation.

> Also, while "Chinese" (Cantonese or Mandarin) are quite complicated

Not sure what you refer to in terms of how complicated is the language.
Phonetics? Grammar? Vocabulary ?

> I'm not sure if they're particularly complicated from the point of view of
> emulating singing?

The pronunciation in chinese is not as detached/split as in Japanese, that's
for sure. This being said, it's certainly possible if you put sufficient
effort into it.

~~~
e12e
>> Also, while "Chinese" (Cantonese or Mandarin) are quite complicated

> Not sure what you refer to in terms of how complicated is the language.
> Phonetics? Grammar? Vocabulary ?

I meant that Chinese have many phonemes, but part of that comes from being a
tone language -- and if you're going to emulate singing, you already have to
deal with tone change.

However when I think more about it, I really have no idea how phonemes are
"tuned" when singing a melody. For a language where tone doesn't carry
meaning[1] (or hardly ever does) I would _think_ we simply change tone
according to the tune we sing. But what happens if changing the tune of the
phoneme changes it meaning? I suppose it is a little like how we must fit the
words to the rhythm...

[1] By "carry meaning" I mean in the phoneme sense: how in British English you
can have "red" and "read" both be pronounced /red/ \-- and there would be no
way to answer the question: "Is /red/ a verb or a colour?" (without context).
Well you could of course answer "both"...

------
userbinator
My favorite part is this:

> Written in VB6

That says you don't necessarily need a great programming language to do
something interesting, which is a nice change from the norm.

~~~
w1ntermute
It also illustrates how far behind the Japanese are in terms of software.

~~~
e12e
Yeah, they even have their own hopelessly slow scripting language smalltalk
clone that hardly anyone uses, except for some amateur hypertext hipsters. I
mean, who here have heard of this language, ruby? /s

~~~
w1ntermute
Predictably, Ruby is the first thing anyone brings up on HN when someone makes
a point about how backwards the Japanese software scene is. But if you'd spent
any time in the Japanese tech scene, you'd know that Ruby is not very popular
in its country of birth, and if you'd spent any time learning about its
creator (a Mormon with 4 children), you'd know that he's far from the norm
when it comes to Japanese people.

~~~
e12e
Aren't most creators of popular languages outliers? As you've guessed I'm not
very familiar with the day-to-day of programming in Japan -- but my impression
was that in the average day-to-day of enterprise programming, the whole world
tends to be "behind"?

I must admit, off the top of my head the projects that come to mind when
thinking about Japan and programming are Ruby, Nilfs2, TOMOYO MACs for Linux
and game/system development (Sony PS3/Nintendo etc) -- and more recently
Shougo's[1] vim plugins.

I have a "host-family" cousin that was studying at a vocational college in the
late 90s, learning game development, and I recall he complained that learning
c++ (as a first language) was hard. But that seemed kind of standard for game
dev anywhere at the time.

I admit, I have a knee-jerk negative reaction when someone generalizes across
an entire nation without much explanation.

[1] [https://github.com/Shougo](https://github.com/Shougo)

------
waynecolvin
Question: Can anybody explain how synthetic singing voice is made? Especially
things like pitch, a voice actor doesn't have to sing every syllable at every
pitch do they?

~~~
mutagen
There are several approaches. I'm not sure what this software uses, machine
translation of the site suggests samples.

Pitch shifting samples is one way to do it. A singer is recorded singing a
syllable and that is shifted up or down by software. Artifacts creep in
relatively quickly, especially with something as nuanced as the human voice. A
variety of pitches and syllables can be sampled and the pitch shifting
manually tuned to minimize audible artifacts.

Modeling could also be used, from simplistic models not far removed from the
ADSR envelopes of basic synthesis to advanced physically based models. Samples
and modeling could be combined to expand the palette of syllables.

Our ears and neural processing of speech and singing are finely tuned to
process subtle shades of difference so any technique often sounds artificial.
Fortunately this can be exploited musically and great music can be made with
these 'artifical' sources.

~~~
bdonlan
UTAU and Vocaloid do indeed use pitch-shifted samples (UTAU even lets you
build your own sample libraries). A more recent product, CeVIO, uses modeling
IIRC.

------
leoc
Somewhat related: the Voder
[https://www.youtube.com/watch?v=5hyI_dM5cGo](https://www.youtube.com/watch?v=5hyI_dM5cGo)
.

~~~
NIL8
Nice link! Thanks.

------
hrktb
The choice to pick up Utau instead of Vocaloid is interesting. Usually the
open source version has a halo around it, but in this case the commercial
version is gowing forward fast enough, there's efforts to have the technology
on about every platform they feel they can break even, and overall the
software has very permitive licensing concerning the end product (the songs),
so there isn't much legal friction surrounding the product's licensing right
now.

What's also intersting: the image characters of the vocaloid packages are
under a CC license and can be used freely for non commercial work, and quite
easily for commercial work as well as it seems. The availability of the
character and all the community surrounding is often cited by the son makers
as a major attract to show their work to a larger public, and not have to take
too much time on the image, "marketing" side of it.

~~~
bdonlan
UTAU isn't open source - it's freeware. Even redistributing the editor
binaries requires permission (as dictated at
[http://utau2008.web.fc2.com/redistribution.html](http://utau2008.web.fc2.com/redistribution.html)).
As far as I know there aren't any open source singing voice synthesis engines
that have become particularly popular thus far.

------
benkt
Elders react:
[https://www.youtube.com/watch?v=wHhluDhVtjU](https://www.youtube.com/watch?v=wHhluDhVtjU)

------
bitL
Perfect! Thanks for the tip! I will add UTAUloids to the collection of
vocaloids I bought in Kyoto! Composing music with Hatsune Miku or Megpoid
vocaloids was fun, can't wait to try UTAU as well ;-)

~~~
nanofortnight
The engine isn't as polished as Vocaloid's is.

~~~
ninjin
It should also be mentioned that phonetically, English is far more complex
than Japanese.

------
danbmil99
Perhaps of interest -- a musical produced with a hacked version of Festival
(text-to-speech) singing all the parts:

[http://www.morashon.com](http://www.morashon.com)

~~~
bitL
Sounds like something from Alan Parsons Project :-)

~~~
danbmil99
That will be taken as a compliment.

------
tomphoolery
this is awesome!

