Hacker News new | past | comments | ask | show | jobs | submit login
Looping Music Seamlessly (nolannicholson.com)
223 points by crummy on Oct 28, 2019 | hide | past | favorite | 61 comments



Awesome project! As a professional audio developer, I was really blown away that this was the author's first project working with audio.

For anyone interested, I'd recommend checking out "The Infinite Jukebox", which has a similar goal, but perhaps a more robust approach: http://infinitejukebox.playlistmachinery.com

If I had to guess at why your approach didn't work well on recorded music, it's probably because most of the time, there is more than a single event happening at a time, so picking out just the highest FFT bin is probably not a very robust "fingerprint" of that part of the music. The infinite jukebox uses timbral features as the fingerprint, rather than just a single note.


Thanks for the pointer to The Infinite Jukebox! It's impressive how well it does. It's biggest point of failure seems to be jumping to different parts of a verse or chorus and mixing up bits of lyrics. It makes that this would happen, but it would be really interesting to see what could be done to avoid this.


It's a shame you can no longer upload music to the InfiniteJukebox. Apparently it relied on a Spotify API that was shut down October 1st. Best hope going forward is an open source library that can offer the same track analysis that used to be provided by that API.


The API was an Echo Nest API. Spotify acquired Echo Nest and shut down all their APIs in 2016.

I'd love to see the app live on too! Spotify's API has similar functionality[1] to the old Echo Nest API, for now at least. (But I don't know if it returns all the same data.) Or, if you don't want to rely on Spotify, I bet Essentia[2][3] could do the job just as well. Essentia is the open-source brains behind AcousticBrainz[4]. So you could either use Essentia directly, or grab the data from the AcousticBrainz API.

[1] https://developer.spotify.com/documentation/web-api/referenc...

[2] https://github.com/MTG/essentia

[3] https://essentia.upf.edu/documentation/essentia_python_examp...

[4] https://acousticbrainz.org/


If you’re going to do autocorrelation, why not just do it on the audio signal? Why go through all the trouble of doing a FFT on the input signal and extracting notes first?

Just to give more details—you can do autocorrelation using the FFT. Autocorrelation is done by calculating the convolution of a signal with a reversed copy of itself. Convolution in the time domain is multiplication in the frequency domain. So you take the FFT, do a pointwise multiplication, and then reverse FFT. The loop points will be spikes in the result.

There are some additional factors you will likely want to consider, like windowing, but this is a much more straightforward way to do things.

(As an aside, I would probably put the G in the third measure on the left hand.)


> If you’re going to do autocorrelation, why not just do it on the audio signal?

Because audio signal is very sparse. On a decent quality file you have 41K or 48K samples per second and you need a window of at least half a second to pick a looping point.


Audio signals are not sparse at all, unless you are using a definition of “sparse” that I’m unfamiliar with. I also don’t understand your complaint about the looping point, or window size.

If you don’t understand how autocorrelation works, I can provide some pseudocode to demonstrate the technique. This is not an esoteric or unusual way of finding a looping point.


I might be misusing "sparse". What I meant was that in a sufficiently high sample rate music, there are a lot of samples between notes and a lot of the samples are noise and garbage.

I do understand how autocorrelation works, to a reasonable degree. And that is the reason for the DFT. We are talking music, so we care about the frequencies more than about the immediate sample value.

And the window is important because autocorrelating over individual data points is useless, sound signals are noisy and you get a lot of randomness.

Also, I tried the code provided on a few pieces I thought would be appropriate and it didn't do a very good job. I might run a few experiments.


> We are talking music, so we care about the frequencies more than about the immediate sample value.

The frequencies and the sample values are just two different ways at looking at the same data.

> And the window is important because autocorrelating over individual data points is useless, sound signals are noisy and you get a lot of randomness.

I’m not sure what you mean by “individual data points”. The autocorrelation is computed for an entire signal, not for any individual data point. Talking about an individual data point in the input doesn’t make much sense.

Autocorrelation is not especially susceptible to noise.

> Also, I tried the code provided on a few pieces I thought would be appropriate and it didn't do a very good job. I might run a few experiments.

Try doing just a simple autocorrelation instead. This will work for the use cases described (looped samples of music), you will see a large spike in the autocorrelation for the loop point.

For music which repeats but is not looped, you can do some preprocessing e.g. to find envelopes and do autocorrelation on that.


> you will see a large spike in the autocorrelation for the loop point

Why? All I can hope for is to pick the max autocorrelation value, but why would I see a spike?


That is the exact intended purpose of the autocorrelation



Cool and nicely illustrated article. It is also funny to me because in secondary school (high school) I used to edit certain songs manually in order to create x-hour versions of them. I'd manually look for repeat-points and just copy paste pieces of the song in Audacity (an audio editor).

At some points people at school came to me asking if I could make multiple-hours versions of songs for them.


Reminds me of when I ripped the Ace Combat 4 and Ace Combat 5 soundtracks off the game DVDs. I forget the details, but basically it was possible to stick the DVD in a computer and play part it as ADPCM data and record it. The result was one big recording consisting of all the music tracks separated by short bursts of very loud static. The static was presumably the track metadata, which I didn't know how to decode, but it sure made it easy enough to find the track boundaries. In AC4, each mission's music was already looped out to the length of the mission (e.g. if the mission timer was 13 minutes, then the song was looped as many times as needed to go over 13 minutes), so I just had to find approximately where the first or second loop ended and fade it out after that to make an appropriate-length "album" version of the track.

For AC5, there were a lot more missions, so I guess they didn't have the disc space to do this and had to get creative. Instead, each track cut off exactly at the loop point, so in order to make it loop and fade, I had to find the point near the beginning that it was meant to loop back to, copy the track, and manually line up that point with the end of the track and do a short cross-fade to hide the seam. Then I could fade it out a few seconds into the second loop to complete the "album" version.

I believe at some point they released official OST albums for both games, but a few special tracks were missing from both, so my versions are slightly more complete. (Specifically, the AC4 OST was missing the version of the hangar music for the final mission that included someone giving a speech over the loudspeaker, while AC5 was missing the track of everyone singing together over the radio in the penultimate mission.)


Did you have a YouTube channel? I used to do this and post them!


That is cool! Back when I did this either YouTube didn't exist yet, or it wasn't a thing yet. So I never posted them. I probably would have worried about copyright things as well.


Minor correction: the graph of the PCM data is displaying the data as unsigned instead of signed, so it has lots of discontinuities between ~0 and ~64k.

Though, for the sake of showing the data as "raw", I guess it doesn't matter too much.


> This happened a few months ago to BrawlBRSTMs, one of the first accounts to do it for a lot of music tracks, and people were devastated.

You can still access their website https://www.smashcustommusic.com/ through the Internet Archive and can still download the BRSTM files that they use to generate those videos. BRSTM file is an audio file format that has loop points encoded in it[1]. Of course, you'll need a special program to play and loop it, like BrawlBox[2] or a web-based one that I created few months ago[3].

[1] https://wiibrew.org/wiki/BRSTM_file

[2] https://github.com/libertyernie/brawltools

[3] https://github.com/kenrick95/nikku


There’s a great app for iOS called snesmusic which lets you download videogame music from a bunch of consoles including Nintendo 64 and play it straight through or looped endlessly using the “real” loop points. You can download music from within the app, and it’s all ripped directly from the ROMs. The playback sounds pretty accurate to my ears.


Does it work with GBA?



Yes.

NES SNES N64 Gameboy Mega drive Master system PC Engine PS1 NDS GBA


Semi-related and possibly not adding much other than an additional anecdote to the concept, but this is a full continuous album that loops from the end back to the beginning: https://en.wikipedia.org/wiki/Nonagon_Infinity

I happen to like the music as well.


Related, but I was looking for an audio format that could embed these loop points inside of it so the file would be small but play infinitely. Is there anything "standard" that does this? I looked into WAV and AIFF, which seem to have "cue" points, but they don't seem to quite do what I want…


The "standard" is unfortunately (as far as I can determine) not open source, but it is an "open" standard. The REX2 file format from Propellerhead is documented somewhere at https://www.reasonstudios.com/developers but you will need to sign up for a developer account. I'm pretty sure it's free to use, and it's supported by most serious DAWs.

>"REX2 is a proprietary type of audio sample loop file format developed by Propellerhead, a Swedish music software company.

It is one of the most popular and widely supported loop file formats for sequencer and digital audio workstation software. It is supported by PreSonus Studio One, Propellerhead Reason, Steinberg Cubase, Steinberg Nuendo, Cockos REAPER, Apple Logic, Digidesign Pro Tools, Ableton Live, Cakewalk Project5, Cakewalk Sonar, Image-Line FL Studio, MOTU Digital Performer, MOTU Mach 5 (software sampler), and Synapse Audio Orion Platinum, among others."

https://en.wikipedia.org/wiki/REX2


Did something change about their licensing? At least up until recently I remember that Propellerheads applied proprietary licensing terms and requiring every developer to be associated with a company for any of their third-party stuff including REX and ReWire.


I think you're probably correct. They had a press release in 2004 talking about "opening" the format. It's a bit of a let down to tell the truth.

"In the past, major Developers such as Steinberg and Emagic applied for and obtained a license to support REX playback in their applications. Now as an open format, any manufacturer, large or small, can support REX playback in their applications.

Third-party manufacturers are encouraged to download REX2 developer documentation. Implementing the Propellerheads REX2 file format in other applications or hardware is free of charge. Further information about the REX2 file format is available at http://www.propellerheads.se/developer"

https://www.reasonstudios.com/press/21-propellerhead-softwar...

A bit more info at https://www.reasonstudios.com/developer/index.cfm?fuseaction...

EDIT: Speeling


Ideally I'd be able to play this on an iPod :(


That's up to the media player (with standard WAV / AIFF etc). If it buffers the file and respects cue points, you should be able to have seamless playback, but very few media players get this 100% correct. See https://en.wikipedia.org/wiki/Gapless_playback for mroe details.


What are we meant to look for on that Wikipedia article page, and how does it relate to looping? The article says nothing about cue points as far as I can see.


Looping (with normal WAV files etc) is gapless playback. You can get almost perfect looping if you use buffering.

The wikipedia article is basically just illustrating that, to do looping on an iPod (as requested) you would just require sample accurate wav files, and a media player that supported gapless playback.


My apologies but what I was wondering about was rather, even though that article tells us which players support gapless playback, it doesn't say which players support loop points.

In order to know which player to use we would need to know which players support both gapless playback and loop points. Preferably the list of such software would also state whether the player was also able to do gapless playback of loop points, since conceivably a player might be able to gapless playback when transitioning between songs but might not do it properly in the case of loop points.


Just because you have loop points defined doesn't automatically mean that every player will be able to process them in a seamless way.


I responded to a sibling comment here with some more details about what I was wondering about: https://news.ycombinator.com/item?id=21384082

Basically, what I was wondering about was rather, even though that article tells us which players support gapless playback, it doesn't say which players support loop points.


WAV has also the info you need, in the "smpl" (Sampler information) chunk. You can also embed this into FLAC files as foreign metadata of type "riff".

Reference: http://www-mmsp.ece.mcgill.ca/Documents/AudioFormats/WAVE/Do...


Could you pick a frame of audio and then cross correlate it with the entire song? The peaks of the cross correlation should indicate when the segment of audio repeats, i.e. potential loop points.


That's called the autocorrelation and is indeed used for finding periodicity in signals.


well its slightly different in that you are cross-correlating only a section of the signal with the full signal.


This takes me back to my days creating Amiga modules, desperately trying to find places in the samples that I could set the loop points so they wouldn't click annoyingly.

I didn't have access to Fourier transformations back then, I would just keep setting the loop points in likely looking places and hope no-one would notice the pop.


Sheesh even the Akai S1000 at the time had a crossfade option to blend the edges.


There may have been better options but I was 15 and didn't know anything. Unlike now, where I am 44 and don't know anything.


Oh man, this takes me back! I used the Echonest library and made a GUI interfacef or it in Max/MSP for a class back in undergrad... good times.

https://www.youtube.com/watch?v=DL8vJO05DCs


Beware - the power of music isn't innocent: http://www.earthlyfireflies.org/government-use-of-music-to-i.... And here we had an interesting discussion on LinkedIn of its influences: http://www.earthlyfireflies.org/linkedin-dialogue-on-music/.


On a related note: I've noticed that ffmpeg supports the BCSTM format, which has built-in loop points (unsurprisingly, since it's used for video game music). ffmpeg decodes and stores that information. mpv however doesn't take advantage of that. That's mildly unfortunate.


What is a 'frame' in terms of PCM audio? If the raw data is just measures of amplitude at a set frequency, don't you have to pick an arbitrary time slice to examine (and so risk losing lower-frequency sounds)?


> What is a 'frame' in terms of PCM audio?

Sometimes people in the audio world use "frame" to mean "set of PCM samples per channel". So a typical stereo audio signal at 44100hz would be 44100 "frames" per second, even though it's really 88200 samples per second. The author seems to be using it that way, but it's confusing because he's also talking about sliding a FFT window over by frames.

> If the raw data is just measures of amplitude at a set frequency

LPCM is literally amplitude samples at a rate. Thinking of the sample rate as a 'set frequency' will lead to confusion (even though it obviously is a frequency). When you're thinking of samples as sequential amplitudes, you're thinking in the "time domain". For oscillations of the signal, that's the called the "frequency domain". Fourier transform is how you convert from time domain to frequency domain.

> don't you have to pick an arbitrary time slice to examine (and so risk losing lower-frequency sounds)?

You need at least 2 samples to make an audible frequency. If you only had 1 sample, you wouldn't hear anything, because nothing would be moving. So at 44100hz of sampling frequency, you can capture 0hz to 22050hz of audio frequency. That's called the Nyquist frequency, and it's always half of the sample rate.


> You need at least 2 samples to make an audible frequency.

That's not strictly true. An audio file with a single non-zero sample (usually set to full amplitude) is often used for testing -- usually called a Dirac impulse or similar.

That impulse will be (necessarily) band-passed by the playback hardware and put out filtered "white" noise.

That impulse can be recovered by a mic to show e.g., pre-ringing caused by (FIR) filters. An FFT of that impulse will show the playback hardware's response in the frequency domain vs full bandwidth.

> Thinking of the sample rate as a 'set frequency'

For e.g., a WAV file, that's a fixed number of samples per second (a frame being 1 sample x n channels). That is a set frequency, and deviating from it will alter the pitch of the music.

There really is no case where sample rate varies, unless we're talking about minute variations between the clock signals of different hardware, which requires the use of sample rate conversion to match.

A related concept is the bits-per-second of lossy formats (e.g., AAC) which may vary from frame to frame (and that frame will mean something different from a WAV frame).

> You need at least 2 samples to make an audible frequency.

I think you're confusing this with Nyquist being 1/2 of the sampling frequency. You can very much capture an audible signal with a single sample, but that signal will be limited (by hardware, by Nyquist, etc).

[Edit]

I should say that this single sample has to be non-zero and the playback system has to have a DC-offset that isn't equal to that sample's amplitude.


> That's not strictly true. An audio file with a single non-zero sample (usually set to full amplitude) is often used for testing -- usually called a Dirac impulse or similar.

Is this a PCM type sample or a frequency-domain sample? If the former, how frequently does this impulse get repeated in order to turn into white noise after going through the playback hardware? It sounds like if it's not repeated it should just make a nasty 'pop'.

> I should say that this single sample has to be non-zero and the playback system has to have a DC-offset that isn't equal to that sample's amplitude.

As I understand it, if you try to play a PCM audio file with a uniform value, you're effectively putting DC through the speakers, driving them to a particular offset where they'll stay until the end of the track. Is that not the case?


This is more correct than how I think about it :)

> You can very much capture an audible signal with a single sample, but that signal will be limited (by hardware, by Nyquist, etc).

This makes no sense to me, but it's obviously true because this exists: https://en.wikipedia.org/wiki/1-bit_DAC


I don't think your comment is in conflict with the parent comment.

Yes, a single sample file can produce a sound — a tiny impulse spike "tick" — but that sound doesn't have any audible frequency or pitch because there's no oscillation or tone.


I think my confusion is around the article's 'frames' and your use of a FFT 'window' - again, what size should be used for the frame/window?

If you have just two samples of audio at 40khz, I understand that the max frequency you can capture is 20khz. But, given just two samples, I can't see how you can separate out the multiple lower frequencies that could be present. To do so, you would need more samples (i.e. for a longer period of time, not a higher frequency of samples). So my question is, how many samples do you pick to do the FFT on?


Oh, I see what you're asking now. This might get a bit mathy, so I won't be offended if I get something wrong and someone corrects me :)

I think where you're lost is that the FFT sample size determines the number (and therefore size) of the bins, while nyquist determines the maximum frequency that can be binned.

If you have a window of 8192 samples, for example, that gets you 4096 bins. If you're sampling at 44100hz, those 4096 bins gets you bin sizes of around ~5.38Hz per bin. So your lowest frequency range that you can identify would be 0-5.38hz, then 5.38hz-10.76hz, etc, up until 22044hz-22050hz. So if you're sliding a window of 8192 samples with a sample rate of 44100, that's about 18.5ms per FFT window. That means you can draw a spectrum with a resolution 4096 bins per 18.5ms of audio. If you wanted to plot a spectrum with a finer time resolution, you'll have to give up some frequency resolution by having smaller bins.

The actual results vary by FFT implementation, and it's common to have windowing inside the FFT that I don't really understand, but has something to do with the accuracy of those bins by preventing leakage of one bin into another. "Hamming Window" is probably the most common and usually happens by default, but that's a different window than the window you're sliding through the time domain to take FFTs on and plot.

As a user of FFT, at least for audio stuff (RF might be different), you mostly just think in terms of bin size.

edit: I did it wrong, fixed the numbers (I think)


> If you're sampling at 44100hz, those 4096 bins gets you bin sizes of around ~10.76Hz per bin. So your lowest frequency range that you can identify would be 0-10.76hz, then 10.76hz-21.52hz, etc, up until 44062hz-44073hz.

I was going to say that the bins are distributed logarithmically, so they aren't uniformly-sized like this. But I did some research and I guess I was wrong and they are uniformly sized, so I learned something.

> it's common to have windowing inside the FFT that I don't really understand

The FFT turns a signal from the time domain to the frequency domain. To do that, the math assumes that the signal is unchanged for all time. In other words, it treats that chunk of samples you give it as looping infinitely backwards and forwards in time.

But the set of samples you gave it are a segment from a signal that does change over time. So when you loop it, you'll get discontinuities.

For example, let's say your signal is a single sine wave whose period is twelve samples:

    -.        .--.        .--
      \      /    \      /
       \    /      \    /
        '--'        '--'
    |...........|...........|
For your FFT, you take a chunk of sixteen samples:

    -.        .--.  
      \      /    \ 
       \    /      \
        '--'        
    |...........|...
From the Fourier transform's perspective, the sound looks like a loop of that chunk:

    -.        .--.  -.        .--.  -.        .--.  
      \      /    \ | \      /    \ | \      /    \ 
       \    /      \|  \    /      \|  \    /      \
        '--'            '--'            '--'        
    |...........|...
But that loop introduces a sharp discontuity. The FFT doesn't realize that discontinuity is not part of the original signal, so it will go ahead and analyze it. In order to get a jump like that, you need a lot of high frequency impulses, so the analysis will give you all of these extra high frequency results that aren't part of the original signal but are merely artifacts of you chopping the signal into pieces.

Windowing basically fades outs the edges of each segment to reduce those discontinuities so that you don't get the artifacts in the results. There are a bunch of different ways to do it because they're all sort of hacks that balancing fixing bogus artifacts with not wanting to mask actual signal that happens to occur near the edge of the segment.


> I was going to say that the bins are distributed logarithmically, so they aren't uniformly-sized like this. But I did some research and I guess I was wrong and they are uniformly sized, so I learned something.

I thought that to until I looked it up. The reason is because we're used to seeing it drawn in logarithmic scale on spectrum analyzers.

Btw, those are some neat ascii graphs. What tool do you use for that?

Also, good explanation :)


> The reason is because we're used to seeing it drawn in logarithmic scale on spectrum analyzers.

Yes, that's exactly why I was confused. :)

> What tool do you use for that?

Sublime Text and patience.


Thanks to both of you for the detailed explanation!


A frame is a function of the encoding frequency and bit depth, these both determine how many raw PCM samples consititute a "frame".


Oops, and number of channels too!


I use mpv commandline tool to play music in a loop, seamlessly. You open an audio file, press “l” (lowercase L) at the starting point, then use arrow keys to go to the and point and press “l” again. That’s it. The only problem is to press the keys precisely at a music bar boundary, but with enough practice you can get pretty close so the seam is barely noticeable or not noticeable at all.


Maybe iOS 13 should pay attention (single-track loops often stutter at the end).


[insert xkcd411 here]




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: