
An AI system for editing music in videos - benryon
http://news.mit.edu/2018/ai-editing-music-videos-pixelplayer-csail-0705
======
GistNoesis
Link to the Official Page with paper and demo: [http://sound-of-
pixels.csail.mit.edu/](http://sound-of-pixels.csail.mit.edu/)

------
kr4
Does anyone know of anything similar which can extract human voice out of a
video which has other noises including fans, people coughing, electric
generator etc?

~~~
Jarwain
I'd imagine one could apply this network to a video of someone speaking

~~~
augbog
Yup I believe they announced this at Google I/O Keynote this year actually --
they mentioned that while it might not be enough with just the audio, looking
at mouth gestures can give AI enough to know what might be said by whom.

[https://www.youtube.com/watch?v=ogfYd705cRs&t=7m0s](https://www.youtube.com/watch?v=ogfYd705cRs&t=7m0s)

~~~
pbhjpbhj
Reflecting on your last phrase: I was watching Esports the other day, the
player was talking with a very loose mouth and I wondered if he was avoiding
being lip-read.

I imagine a lip reading AI would get wide usage. Managers will be wearing face
masks to hide their lips. (It's probably doable now to listen with a spy-mic,
but it's obvious in a way that using a normal video camera isn't).

~~~
laszlokorte
Julia Probst is a german deaf blogger [1] who is famous [2] for lipreading
tactical commands given by soccer coaches to their team during a match and
posting them on twitter. She is even hired by sport channels on TV to provide
the commentator with insight information.

[1] [https://twitter.com/einaugenschmaus](https://twitter.com/einaugenschmaus)
[2] [http://www.sueddeutsche.de/sport/lippenlesen-im-fussball-
die...](http://www.sueddeutsche.de/sport/lippenlesen-im-fussball-die-
geheimnisverraeter-1.2215827)

------
gtani
You can google "source separation" to get background on this, Stanford has
what is supposed to be a nice lib.

[https://www.reddit.com/r/MachineLearning/comments/4oewdq/rnn...](https://www.reddit.com/r/MachineLearning/comments/4oewdq/rnn_on_audio_for_instrumentalvocal_isolation/)

[https://www.reddit.com/r/MachineLearning/comments/4r92iq/wou...](https://www.reddit.com/r/MachineLearning/comments/4r92iq/would_it_be_possible_to_train_a_nn_to_remove_echo/)

[https://www.reddit.com/r/MachineLearning/comments/66j2i4/p_i...](https://www.reddit.com/r/MachineLearning/comments/66j2i4/p_isolating_vocals_from_music_with_a_convnet/)

------
tgp1
This might be unrelated but i want to ask this. When on a phone call, many
times the sounds of peripheral objects (traffic, horns, fan sounds, keyboard
clicks) seem more obvious than the voice of the person. Do you feel feel the
same? Is there a scientific explanation for this?

~~~
hammock
Many phone systems use some form of automatic ducking, meaning it tries to
identify when someone is speaking, and raises their volume while lowering the
volume of everyone else. The objective of this is to increase overall
intelligibility but it's not perfect. A sudden change in volume on one line,
e.g. caused by a car horn or keys typing, can trigger it to think someone else
has started speaking

------
peterlk
I am so excited to see this. I've been waiting for the day that we get a tool
that can extract various instrumental parts from music. This isn't there yet,
but it's the right direction. If we can get there, music copyright will have
another good fight on its hands.

