
Artificial Intelligence Generates Christmas Song from Holiday Image - kebinappies
https://news.developer.nvidia.com/artificial-intelligence-generates-christmas-song-from-holiday-image/
======
fredleblanc
The melody is all over the place, and the rhythm is hard to tap a foot to, but
one thing is certain: it's absolutely convinced that Christmas trees get
decorated with flowers. Lots and lots and lots of them.

~~~
klenwell
I could have mistaken it for a track off The Shaggs'[0] lost Christmas album.
I like to believe Frank Zappa would have liked this, too.

[0]
[https://en.wikipedia.org/wiki/The_Shaggs](https://en.wikipedia.org/wiki/The_Shaggs)

~~~
fredleblanc
Oh man, The Shaggs are something else.

And it just so happens that I live two towns from Fremont, NH, where they're
from.

------
BinaryBullet
One of the cofounders of the Echonest (acquired by Spotify) created this back
in 2004:

    
    
        "A Singular Christmas" was composed and rendered in 2004. It is the automatic statistical distillation of hundreds of Christmas songs; the 16-song answer to the question asked of a bank of computers: "What is Christmas Music, really?"
    
    

[https://soundcloud.com/bwhitman/sets/a-singular-
christmas](https://soundcloud.com/bwhitman/sets/a-singular-christmas)

~~~
hahaker
Very interesting reference! Deep learning is statistical, so this is sorta one
of the spiritual predecessors.

~~~
x2398dh1
By my understanding, yes, in the sense that certain parts of each algorithm
are looking to minimize something...the former model used principal components
analysis, which is linear in the sense that you are using transforms which
pick out for the least correlated pieces of a huge chunk of data, whereas
neural networks, which uses a combination of linear and non-linear layers
picked by the user to minimize, "errors."

What's interesting is that the former model sounds so much, "better." I wonder
if anyone could chime in about how our ears and auditory nerves or perhaps
auditory cognition works, and whether they are more, "principal component
analysis-y" somehow than "error minimization-y" or something relating to the
actual math, which may explain why this new neural network christmas song
sounds like absolute crap to us, whereas the older version sounds pretty
amazing. Also, whether my understanding of the underlying math is correct or
not.

------
mojuba
Is it just me or it really sounds like a randomly generated chord progression
that barely makes any musical sense?

~~~
nkozyra
"Musical sense" is a pretty subjective thing, which I suspect is one of the
primary issues with artificial creativity.

That said, assuming training data comes from music with progressions that
could be broadly classified as 'popular music,' you would expect to find some
regression to the mean with more production and deeper training data. (to
abuse a phrase)

One other issue that I think will come up is how insular and unevolving
artificial creativity will be if it's based on present music for training
data. What has historically moved creative trends is disruption; sometimes
it's a slow burn and sometimes it's a few catalysts, but experimentation in
artificial creativity will be hard to come by early on but quickly needed if
it's to supplant human creativity.

~~~
TheOtherHobbes
The statistical approach is painfully naive and doesn't work - as is obvious
from the example.

It's like feeding a net with the complete works of Shakespeare and expecting
it to produce a genius-level original play. It's simply not going to happen.

~~~
nkozyra
The issue is not with the statistical but with the parameters around the
output and the organization of training data.

~~~
zeveb
I think that your assumption is that the genius of Shakespeare's plays can be
statistically reproduced through sufficiently-clever organisation. That is not
obviously true to me.

Some things are just art, capable of being truly understood only by a creature
with a head and heart, arms & legs, love & hate, emotions, experiences ­— in
short, a man.

~~~
nkozyra
Art is about perception not creation. If autonomously created art produces the
same perception and evocative response, it sufficiently passes that test.

An autonomous agent doesn't need to understand the underlying emotion, it just
needs to mimic it.

A autonomous car doesn't know why it shouldn't hit a child that jumps into its
path, or why it is making any decisions at all despite those being some of the
most important and fundamental to humans on earth. It just needs to reproduce
the actions of a human that _does_ understand those things.

Yes, I believe that artificial creativity will produce art will be
indistinguishable from that made by humans.

------
d33
It kind of feels like a five-year-old trying to make a song. Which seems good
- now they only need to improve the mechanisms and who know, maybe we'll get
to the point when it's a ten-year-old?

~~~
logicallee
Instead of five, don't you mean "two to three"? And under that comparison,
isn't it scary how much it does sound that way? Like a kid who hears words
together but don't know what they mean yet? (and doesn't really get the
cultural things around it.) To me it sounds like a quite musical 2-3 year old
stringing words together. Doesn't it strike anyone else that way?

These things are going to grow up very, very soon! We know that. It's scary.

You're watching a two-year old. It's interesting to think about what it will
be like when it's ten, sure.

But what will really blow your mind is what it'll be like when it's 23. This
is happening right now, before our eyes.

I cannot overemphasize that any large server farm at Google or Amazon is doing
more, and much faster, processing than a human brain's neural net. Human brain
is 86 billion neurons with an average of 7,000 synaptic connections each. That
is a huge number. But they are firing at 15-300 herz (because they're at
biological speed, instead of lightspeed like our CPU's) - which is 7 orders of
magnitude slower than our silocon. Our brain is about 3 pounds (1,300-1,400 g)
and uses some 20 watts.

It's not a question of "if" a server farm will have as powerful neural nets.
It's a question of "when".

(Also although we won't be using it, the entire source code for the human
brain has to be strictly less than 700 MB, because the fully sequenced human
genome which obviously encodes the full human mind is less than 700 MB
uncompressed.)

Guys, we are at an incredible pivot point in human history. We are coming up
with computerized brains with in some ways comparable architecture to humans,
and they are doing human activities.

Today, in 2016, there are thousands, perhaps tens of thousands, of server
rooms all over the world that have more than enough computational power to do
in real-time what a human adult brain does in real time - but we lack the
software.

when we see advances like this in artificial intelligence, this is scary.

We're all but looking at the intellectual output of a two year-old in the
field of music.

every single day AI results are astounding. this is it.

~~~
d33
> the entire source code for the human brain has to be strictly less than 700
> MB, because the fully sequenced human genome which obviously encodes the
> full human mind is less than 700 MB uncompressed

Actually those 700MB are compressed, in a way so sophisticated that we don't
really know how to uncompress it yet - or whether it's even possible without
the external resources our planet provides. And keep in mind that those 700MB
only describe how to prepare the basic concept of the brain, whose memory is
then packed with information we get from the culture.

~~~
logicallee
What I mean by uncompressed is that you're reading the cytosine (C), guanine
(G), adenine (A), or thymine (T) as being two bits per base pair (1 of 4
possibilities). There are 3 billion base pairs, which is 6 billion bits,
divided by 8 to get bytes you get 750 MB.

The human genome is somewhat redundant and can be further compressed. That is
just the string of "ones and zeros" (ACGT) could be run through whatever
compression algorithm you wanted.

But don't take my word for it:

>"When the 4 bases are packed into one byte ( .2bit format) the size is 770M
(hg18.2bit) , but you'll need an extra tool to decypher the data." [1]

2.

You raise an important point:

>And keep in mind that those 700MB only describe how to prepare the basic
concept of the brain, whose memory is then packed with information we get from
the culture.

Yes, absolutely. I simply called it an upper bound on how complex a brain's
architecture could be. DNA obviously encodes the brain's architecture, since
humans all have human brains. Beyond that, there is a very large variation in
people's mental capabilities and brains, and the largest variation of all
comes from culture.

But culture could be given to a virtualized brain (called training).

Bear in mind that when human brains receive culture, it takes them years of
all-day training before they're even able to speak. So full 1x human brains
take a long time to train.

When you see results out of neural nets that are similar to what very young
toddlers can do, you should be awed. We have the computational power in server
farms to do what full brains do -- if not now, then soon.

This isn't some sci-fi pipe dream. Go ahead and look at the facts.

[1] [https://www.biostars.org/p/5514/](https://www.biostars.org/p/5514/)

~~~
wvyar
I think that your "700MB uncompressed" fails to take into account that the
construction, development, and maturation of the brain relies heavily on
cellular and molecular mechanisms. I think it is a little disingenuous to hide
the enormous wealth of information necessary to create a brain, much less
understand and utilize one, inside of your compiler.

~~~
logicallee
I don't think you are correct from an information computation point of view.
When looking at the computation done by neurons in the brain it is sufficient
to abstract away the lower-level substrate in which it occurs.

You and other posters are all correct regarding the huge volume of information
on which human minds are trained. it's hardly unsupervised learning either :)

\---------

EDIT: In response to your comment, I've given it further reflection. DNA as
source code may be misleading as an "upper bounds". After all, suppose we knew
for a fact (assume there were a mathematical proof or anyway just assume
axiomatically) that a one hundred megabyte source code file completely
described (contained every physical law etc) a deterministic Universe, and
that if you ran it on enough computation to fully describe a hundred billion
galaxies with a hundred billion stars each, one of those stars would have a
planet and that planet would contain humans and the humans at some point in
the simulation would deduce the same one hundred megabyte source code file for
their Universe. (This is a bit of a stretch as it's not possible to deduce the
laws of the universe rigorously.)

Anyway under that assumption in a way you could argue that the "upper bounds"
on the amount of entropy that it takes to produce human intelligence is "just"
a hundred megabytes, since that source code can deterministically model the
entire Universe. But practically that is useless, and the humans in that
simulation would have to do something quite different from modeling their
universe, if they wanted to come up with AI to help them with whatever
computational tasks they wanted to do.

In the same way, perhaps DNA is a red herring, as there are a vast number of
cells in the human body (as in, tens of trillions) doing an incredible amount
of work. So starting out with DNA is the "wrong level" of emulation, just as
starting out with a 100 MB source code file for the universe would be the
"wrong level" of emulation, even if we posit axiomatically that it fully
describes our entire Universe from the big bang through intelligent humans.

So I will concede that it is misleading.

All that said, I think that emulating or considering the computation on the
level of neurons is sufficient - so it is sufficient to look at how many
neurons are in the human brain and the kind of connections they have.

As for the efficacy of this approach - that's the very thing that is being
shown in the story we're replying to and many places elsewhere. It works.
We're getting extremely strong results, that in some cases beat humans.

I believe that emulating or comparing to humans at the neural level should
probably be sufficient for extremely strong results. We do not need to emulate
every atom or anything like that. I consider it out of the question that we
would discover that human minds form a biochemical pathway into another
ethereal soul-plane and connect with our souls in a way that you can't emulate
by emulating neurons and the like, and that the souls are where intellect
happens and brains are just like "radio antennas" for them. Instead, I think
that the approaches we're seeing will achieve in many ways similar results to
what humans brains produce computationally - a much higher level of
abstraction is sufficient for the results that are sought.

~~~
wvyar
I will confess to not being an expert, but I disagree: I don't think it's
sufficient to abstract away the lower-level substrate when the OP was
referring to DNA as source code, which absolutely depends on that level of
detail to both construct the system (the brain) and to enable the continued
development and maturation of that system (a physical, real brain).

I was not referring to the huge volume of information necessary, as I
acknowledge that as being "outside the system" for purposes of this
discussion, so my apologies for any confusion I might have caused.

It may be possible that (and it is my belief that) there is a higher-level
abstraction for the computations taking place in the brain, even if it is on
the neuron-level, but at that point I don't think you can claim that the
source code for that is going to fit under 700MB by using DNA as a baseline.

~~~
logicallee
you are right, I went too far. see my other comment:

[https://news.ycombinator.com/item?id=13081551](https://news.ycombinator.com/item?id=13081551)

however, that just concerns the DNA argument. We know roughly how many neurons
there are in humans and their connectedness.

------
keypulsations
I've seen more and more little A.I.-generated ditties like this recently and
their reception tends to be the same: that they're interesting and funny but
don't sound that great.

The output would probably be more compelling if A.I. were adopted more as an
instrument by individual artists/composers to automate some of their more
tedious tasks by learning their own particular styles rather than a magical
music box that churns out top hits.

------
kingkawn
"The best Christmas present in the world is a blessing."

This algorithm is throwing down some wisdom

~~~
Florin_Andrei
I think generating fortune cookies is a really low hanging fruit for current
AI. Someone could put it together in a week-end.

~~~
kmill
I once developed a hierarchical Markov chain, and I decided to use it to
generate fortune cookie wisdom because I thought people would be more willing
to overlook grammatical mistakes or be willing to interpret it as an
expression of deep truth.
[http://www.kylem.net/stuff/fortunes.html](http://www.kylem.net/stuff/fortunes.html)
(You might need to increase the "sense" parameter.)

~~~
DrPhish
"The supreme happiness in life is simply to serve as a warning to others."

Oh my god, I'm in stitches. What a fun project, thanks for sharing!

------
binarnosp
When I think that this month I'm going to hear "Last Christmas" by Wham! in
every store I set foot into, this new masterpiece doesn't sound so bad.

------
ChuckMcM
I am pretty amazed at the effort that nVidia is putting into its corporate
rebranding effort. I wonder if, in the not too distant future, they will be
the AI company that also makes Graphics cards sometimes.

The other thing I find really amazing about it, coming from IBM, is that IBM
has invested a ton of money in IBM Watson but they sold off their foundry
business (could have made massively parallel AI machines) and their systems
business is a fraction of what it was.

Looking at what can be done when you're leading versus when you are following
is really sobering to me.

------
ravenstine
I'm not sure what I should be impressed by. Maybe there's some real technical
feat happening here, but I feel like a basic mad-libs style algorithm could
produce something better.

~~~
hahaker
I'm not very familiar with mad-libs so correct me if I was wrong. I think
generating a lyrics passage (zero hard-coded rule on content or grammar or
anything) from an image would not be something you can do with mad-libs.

------
divanvisagie
One step closer to GLaDOS.

~~~
demolish
this was a triumph! im making a note here: huge success

~~~
divanvisagie
It's hard to overstate my satisfaction.

------
duke_z
it is scary, sounds like GLAdOS singing "still alive"!

~~~
bitwize
It sounds more like a corrupt core.

"Cave here. It's Christmas time, and you know what that means: Christmas
bonuses have been suspended until further notice. We've gotta pay the
judgement on that pesky class-action with _something_. But don't let that get
you out of the Christmas spirit. The lab boys have come up with a way to stay
festive by hooking up the Christmas Core to the lab's PA system. So enjoy
free, continuous, computer-generated Christmas music from now until January
5!"

"Cave again. Apparently the Christmas music has been causing some employees
severe emotional and psychological distress. We've had reports of people
sticking their heads into active particle accelerators and drinking Repulsion
Gel to get away from the sound. So until a full investigation has been
conducted and the Christmas Core thoroughly debugged, we are discontinuing the
Christmas music. We do _not_ need another class-action on our hands, folks."

------
dweekly
Music for people who hate music.

------
diydsp
Auto-synthesis of music has been a topic of academic interest since the 1950s,
when the first mainframe scribbled out code on paper tape to be translated
into sheet music and performed. UToronto's work here is the latest expression
of this desire.

The huge gap between our cultures' actual music and these synthetic projects
can to an extent be described through "receptivity" or the phenomenology of
music, in other words, _how_ it's experienced. The following fun, short talk
does a great job of introducing the concept of through its analysis of
"vaporwave."

[https://www.youtube.com/watch?v=QdVEez20X_s](https://www.youtube.com/watch?v=QdVEez20X_s)

~~~
vasaulys
That video is fantastic I watched it a week or so ago.

Your explanation also explains why computer performed music is so off. It
still has that uncanny valley effect. So when Sony had a computer generate a
"Beatles-esque pop song", they still had a human perform and produce it. But
at the point there's so much creativity and human-added value on top of it
that I don't think its fair to call it computer generated imho.

~~~
diydsp
yes. I can tell you a little more about that, too, since I used to research
this stuff and think about it a lot still.

One of my models of music is an external model of a regulated system that
parallels and trains our own habits and responses. E.g. a song demonstrates
tension and release similar to our own lives. The level of tension in a song
before release occurs can inform us how much tension which should accept
before performing some release activity.

Music's rhythms also inform the pace of our work. E.g. verse-chorus-verse
represents switching between two different activities. Even the pitch of a
single note acts as a reference for the amount of intensity of a sensation we
should use in our own lives. E.g. thrash metal listeners enjoy sudden shifts
into massive intensity and hold it there. Dub step listeners are training
themselves for unusual, but rather intense aesthetics leading up to
disproportionate release. Classical music tends to be for "long-chain
thinkers" tumbling ideas over from various perspectives, e.g. writers and
politicans, doctors, not factory workers.

With that as a background, consider that a live instrument is also a physical
system with a human controlling it interactively. The live system is a bit
different every time. Here's the critical part: _the human must listen and
provide instantaneous feedback to a varying system in order to present the
piece of music as a proper response model of a regulated system_. If the
player fails to do this, the model communicated by the performance is
different.

In open-loop systems, such as a sequencer, there is no (or limited)
interaction between the player and the sound, so an incidental model emerges.
That incidental model represents an _unintended and therefore most likely
irrelevant model of how to interact with reality._ e.g. it relieves tension
where no relief was needed. It lingers too long on an idea, long after a human
novelty-seeking circuit has starved.

Some people, e.g. in discussions of unstable filters like the TB-303, chalk up
the variations as being different at every performance because the instrument
is random... However, they're missing the closed loop portion of the
performance, in which _the performer reacts to the unpredictability of the
instrument in order to maintain the model._ In other words, the score and
notes are not the music, but the performer's response to the environment the
score sets up is the music.

To revivify your uncanny valley observation, the "unstable filter creates
variations" crowd has a parallel in Perlin noise used to subtly animate human
models to make them not look so dead. However, it's incomplete because they
don't use (short-term) feedback to determine when the movement suffices to be
convincing. That feedback is the essence of performance.

In theory, computer scientists could implement these feedback models in
performance to make the sounds more realistic. They could be used in
synthesis, but the playback would still require observation of the listener!
Which is possible. Personally, I just prefer playing electronic instruments
live over using sequencers. It's only the sounds of electronic music I like,
the zaps, peowms, zizzes, pews, and poonshes, etc. I don't care for
electronics/computers to perform for me.

If you like this hypothesis, you can find more references on my wiki at:
[http://www.diydsp.com/index.php?title=Computer_Music_Isolati...](http://www.diydsp.com/index.php?title=Computer_Music_Isolation)

------
_ix
Did I just hear a new classic?

~~~
BoringCode
No.

------
givinguflac
This is definitely now my favorite Christmas song. While obviously not a
masterpiece, it's incredible how far this tech has come. It's almost got a
Dadaist feel to it. Can't wait to see where this ends up in ten years! I can
foresee music labels buying a few of these AI's, getting some pretty people
with decent voices and sending them on tour.

~~~
midgetjones
Is pop music not cheap and disposable enough for you already?

~~~
TheOtherHobbes
Pop is still recognisably human. This sounds like logic gates roasting by an
open fire.

~~~
midgetjones
That was a brilliantly seasonal analogy.

------
blauditore
While this is hilarious, it doesn't seem like a huge achievement to me.

The only thing (kind of) working well is feature/topic detection in the image
(tree, christmas etc.), but that isn't really cutting edge.

The core part, learning and creating music, only produced melody and lyrics
that seem not much different from accumulating random sentence and chordal
fragments.

------
midgetjones
I'm not sure that Christmas songs generally use the blues scale, I wonder what
made them choose that for the melody?

~~~
dasboth
I suspect it's an easy way to get something that won't sound completely
dissonant, especially because you can use the same blues scale over multiple
chords.

~~~
midgetjones
I can see the logic behind that, but without any sort of tension/release
between the melody and chords, it still sounds just as dissonant to my ears.

~~~
dasboth
Granted, it will never win any awards. Part of the choice behind it might have
been the desire to "ship" it before Christmas.

------
brudgers
It brings this recent story to mind:
[https://news.ycombinator.com/item?id=13033299](https://news.ycombinator.com/item?id=13033299)

------
andrewclunn
Better than The Christmas Shoes.

------
eva1984
This is not even a proper song.

------
debt
We have a very long way to go it seems. That was almost nonsensical.

------
antisthenes
This was a triumph!

I'm making a note here:

Huge success!

------
pmyjavec
Chilling...

------
miguelrochefort
Show me the same algorithm generate songs in a different genre from different
images and I'll be impressed.

~~~
hahaker
I'm from the project team. This is a very interesting point.

While it is easy to crawl many songs from the internet, it is a little harder
to gather the same amount but with proper genre/style/etc labels, although it
is not impossible.

For now there's only one genre, which we call it "the genre of whatever is on
the internet". So whatever music files on there, many of them quite "crappy",
were used to train the model. Also there are many other problems on how to
better structure and flavor the composition.

This is just a very early-stage attempt, as a CS student's fun side project.
We are working with people with real musical talent now and hoping to make
better songs in the next version.

~~~
miguelrochefort
I mean, where does the Christmas element comes from? The image alone, the
music it was trained with, or is it somehow hardcoded in the algorithm?

~~~
hahaker
The Christmas element comes from 1. the image, and 2. a 4800-dimensional RNN
sentence encoding bias generated from ~30 Christmas songs.

Not sure how to hardcode this.

~~~
midgetjones
I'm interested that you used some Christmas songs as training (which wasn't
obvious from what I read of the paper). Were they pop songs, traditional, or a
mix?

Further to my comment up there[0] - and I don't wish to sound a grinch because
this is a really cool project - but would I be right in thinking you spent
more time on the image description than the music?

I saw that you specify a scale for the melody, would it be either possible to
use a mode to generate the accompaniment around, so that the melody can move
diatonically and risk too many clashes, or to allow the melody to follow the
chord sequence somehow?

Again, sorry if I sound too critical. It's a really awesome thing you've done,
and I'm just a guy that listens to the music instead of the lyrics.

[0]
[https://news.ycombinator.com/item?id=13079355](https://news.ycombinator.com/item?id=13079355)

~~~
hahaker
Thanks for the comments! Are you asking the lyrics or music generation?

For lyrics, we actually didn't train on Christmas songs. Training data was a
large collection of romance novels. (See neural-storyteller by Jamie Kiros).
The "Christmas trick" we did was applying a "style shifting" after image
captioning and before lyrics generation, where the shifting vector was
obtained from ~30 Christmas songs.

For the music generation. Although we are aware of some basic music performing
rules, such as melody following chord etc, we actually didn't add this kind of
rules.

For the blues scale here's the thing. I didn't really know much about music,
so I spent several hours reading things like basicmusictheory.com. It happened
to introduce blues so we just used it. But you're right on the relevance
between blues and pop: only a very small percentage in our pop music
collection is blues, after we ran the scale-checking code.

~~~
midgetjones
Thanks for the reply! I was concentrating on the music specifically. I thought
the lyrics generation was really enjoyable.

I was asking more if you'd used any traditional carols, as they can have a
more definitively "christmassy" sound than a pop song with sleighbells laid
over the top.

Overall I meant that I think the music would be more convincing either
following the chords in the melody, or sticking to a single mode for both
melody and accompaniment.

