For anyone interested, I'd recommend checking out "The Infinite Jukebox", which has a similar goal, but perhaps a more robust approach: http://infinitejukebox.playlistmachinery.com
If I had to guess at why your approach didn't work well on recorded music, it's probably because most of the time, there is more than a single event happening at a time, so picking out just the highest FFT bin is probably not a very robust "fingerprint" of that part of the music. The infinite jukebox uses timbral features as the fingerprint, rather than just a single note.
I'd love to see the app live on too! Spotify's API has similar functionality to the old Echo Nest API, for now at least. (But I don't know if it returns all the same data.) Or, if you don't want to rely on Spotify, I bet Essentia could do the job just as well. Essentia is the open-source brains behind AcousticBrainz. So you could either use Essentia directly, or grab the data from the AcousticBrainz API.
Just to give more details—you can do autocorrelation using the FFT. Autocorrelation is done by calculating the convolution of a signal with a reversed copy of itself. Convolution in the time domain is multiplication in the frequency domain. So you take the FFT, do a pointwise multiplication, and then reverse FFT. The loop points will be spikes in the result.
There are some additional factors you will likely want to consider, like windowing, but this is a much more straightforward way to do things.
(As an aside, I would probably put the G in the third measure on the left hand.)
Because audio signal is very sparse. On a decent quality file you have 41K or 48K samples per second and you need a window of at least half a second to pick a looping point.
If you don’t understand how autocorrelation works, I can provide some pseudocode to demonstrate the technique. This is not an esoteric or unusual way of finding a looping point.
I do understand how autocorrelation works, to a reasonable degree. And that is the reason for the DFT. We are talking music, so we care about the frequencies more than about the immediate sample value.
And the window is important because autocorrelating over individual data points is useless, sound signals are noisy and you get a lot of randomness.
Also, I tried the code provided on a few pieces I thought would be appropriate and it didn't do a very good job. I might run a few experiments.
The frequencies and the sample values are just two different ways at looking at the same data.
> And the window is important because autocorrelating over individual data points is useless, sound signals are noisy and you get a lot of randomness.
I’m not sure what you mean by “individual data points”. The autocorrelation is computed for an entire signal, not for any individual data point. Talking about an individual data point in the input doesn’t make much sense.
Autocorrelation is not especially susceptible to noise.
> Also, I tried the code provided on a few pieces I thought would be appropriate and it didn't do a very good job. I might run a few experiments.
Try doing just a simple autocorrelation instead. This will work for the use cases described (looped samples of music), you will see a large spike in the autocorrelation for the loop point.
For music which repeats but is not looped, you can do some preprocessing e.g. to find envelopes and do autocorrelation on that.
Why? All I can hope for is to pick the max autocorrelation value, but why would I see a spike?
At some points people at school came to me asking if I could make multiple-hours versions of songs for them.
For AC5, there were a lot more missions, so I guess they didn't have the disc space to do this and had to get creative. Instead, each track cut off exactly at the loop point, so in order to make it loop and fade, I had to find the point near the beginning that it was meant to loop back to, copy the track, and manually line up that point with the end of the track and do a short cross-fade to hide the seam. Then I could fade it out a few seconds into the second loop to complete the "album" version.
I believe at some point they released official OST albums for both games, but a few special tracks were missing from both, so my versions are slightly more complete. (Specifically, the AC4 OST was missing the version of the hangar music for the final mission that included someone giving a speech over the loudspeaker, while AC5 was missing the track of everyone singing together over the radio in the penultimate mission.)
Though, for the sake of showing the data as "raw", I guess it doesn't matter too much.
You can still access their website https://www.smashcustommusic.com/ through the Internet Archive and can still download the BRSTM files that they use to generate those videos. BRSTM file is an audio file format that has loop points encoded in it. Of course, you'll need a special program to play and loop it, like BrawlBox or a web-based one that I created few months ago.
I happen to like the music as well.
>"REX2 is a proprietary type of audio sample loop file format developed by Propellerhead, a Swedish music software company.
It is one of the most popular and widely supported loop file formats for sequencer and digital audio workstation software. It is supported by PreSonus Studio One, Propellerhead Reason, Steinberg Cubase, Steinberg Nuendo, Cockos REAPER, Apple Logic, Digidesign Pro Tools, Ableton Live, Cakewalk Project5, Cakewalk Sonar, Image-Line FL Studio, MOTU Digital Performer, MOTU Mach 5 (software sampler), and Synapse Audio Orion Platinum, among others."
"In the past, major Developers such as Steinberg and Emagic applied for and obtained a license to support REX playback in their applications. Now as an open format, any manufacturer, large or small, can support REX playback in their applications.
Third-party manufacturers are encouraged to download REX2 developer documentation. Implementing the Propellerheads REX2 file format in other applications or hardware is free of charge. Further information about the REX2 file format is available at http://www.propellerheads.se/developer"
A bit more info at https://www.reasonstudios.com/developer/index.cfm?fuseaction...
The wikipedia article is basically just illustrating that, to do looping on an iPod (as requested) you would just require sample accurate wav files, and a media player that supported gapless playback.
In order to know which player to use we would need to know which players support both gapless playback and loop points. Preferably the list of such software would also state whether the player was also able to do gapless playback of loop points, since conceivably a player might be able to gapless playback when transitioning between songs but might not do it properly in the case of loop points.
Basically, what I was wondering about was rather, even though that article tells us which players support gapless playback, it doesn't say which players support loop points.
I didn't have access to Fourier transformations back then, I would just keep setting the loop points in likely looking places and hope no-one would notice the pop.
Sometimes people in the audio world use "frame" to mean "set of PCM samples per channel". So a typical stereo audio signal at 44100hz would be 44100 "frames" per second, even though it's really 88200 samples per second. The author seems to be using it that way, but it's confusing because he's also talking about sliding a FFT window over by frames.
> If the raw data is just measures of amplitude at a set frequency
LPCM is literally amplitude samples at a rate. Thinking of the sample rate as a 'set frequency' will lead to confusion (even though it obviously is a frequency). When you're thinking of samples as sequential amplitudes, you're thinking in the "time domain". For oscillations of the signal, that's the called the "frequency domain". Fourier transform is how you convert from time domain to frequency domain.
> don't you have to pick an arbitrary time slice to examine (and so risk losing lower-frequency sounds)?
You need at least 2 samples to make an audible frequency. If you only had 1 sample, you wouldn't hear anything, because nothing would be moving. So at 44100hz of sampling frequency, you can capture 0hz to 22050hz of audio frequency. That's called the Nyquist frequency, and it's always half of the sample rate.
That's not strictly true. An audio file with a single non-zero sample (usually set to full amplitude) is often used for testing -- usually called a Dirac impulse or similar.
That impulse will be (necessarily) band-passed by the playback hardware and put out filtered "white" noise.
That impulse can be recovered by a mic to show e.g., pre-ringing caused by (FIR) filters. An FFT of that impulse will show the playback hardware's response in the frequency domain vs full bandwidth.
> Thinking of the sample rate as a 'set frequency'
For e.g., a WAV file, that's a fixed number of samples per second (a frame being 1 sample x n channels). That is a set frequency, and deviating from it will alter the pitch of the music.
There really is no case where sample rate varies, unless we're talking about minute variations between the clock signals of different hardware, which requires the use of sample rate conversion to match.
A related concept is the bits-per-second of lossy formats (e.g., AAC) which may vary from frame to frame (and that frame will mean something different from a WAV frame).
> You need at least 2 samples to make an audible frequency.
I think you're confusing this with Nyquist being 1/2 of the sampling frequency. You can very much capture an audible signal with a single sample, but that signal will be limited (by hardware, by Nyquist, etc).
I should say that this single sample has to be non-zero and the playback system has to have a DC-offset that isn't equal to that sample's amplitude.
Is this a PCM type sample or a frequency-domain sample? If the former, how frequently does this impulse get repeated in order to turn into white noise after going through the playback hardware? It sounds like if it's not repeated it should just make a nasty 'pop'.
> I should say that this single sample has to be non-zero and the playback system has to have a DC-offset that isn't equal to that sample's amplitude.
As I understand it, if you try to play a PCM audio file with a uniform value, you're effectively putting DC through the speakers, driving them to a particular offset where they'll stay until the end of the track. Is that not the case?
> You can very much capture an audible signal with a single sample, but that signal will be limited (by hardware, by Nyquist, etc).
This makes no sense to me, but it's obviously true because this exists: https://en.wikipedia.org/wiki/1-bit_DAC
Yes, a single sample file can produce a sound — a tiny impulse spike "tick" — but that sound doesn't have any audible frequency or pitch because there's no oscillation or tone.
If you have just two samples of audio at 40khz, I understand that the max frequency you can capture is 20khz. But, given just two samples, I can't see how you can separate out the multiple lower frequencies that could be present. To do so, you would need more samples (i.e. for a longer period of time, not a higher frequency of samples). So my question is, how many samples do you pick to do the FFT on?
I think where you're lost is that the FFT sample size determines the number (and therefore size) of the bins, while nyquist determines the maximum frequency that can be binned.
If you have a window of 8192 samples, for example, that gets you 4096 bins. If you're sampling at 44100hz, those 4096 bins gets you bin sizes of around ~5.38Hz per bin. So your lowest frequency range that you can identify would be 0-5.38hz, then 5.38hz-10.76hz, etc, up until 22044hz-22050hz. So if you're sliding a window of 8192 samples with a sample rate of 44100, that's about 18.5ms per FFT window. That means you can draw a spectrum with a resolution 4096 bins per 18.5ms of audio. If you wanted to plot a spectrum with a finer time resolution, you'll have to give up some frequency resolution by having smaller bins.
The actual results vary by FFT implementation, and it's common to have windowing inside the FFT that I don't really understand, but has something to do with the accuracy of those bins by preventing leakage of one bin into another. "Hamming Window" is probably the most common and usually happens by default, but that's a different window than the window you're sliding through the time domain to take FFTs on and plot.
As a user of FFT, at least for audio stuff (RF might be different), you mostly just think in terms of bin size.
edit: I did it wrong, fixed the numbers (I think)
I was going to say that the bins are distributed logarithmically, so they aren't uniformly-sized like this. But I did some research and I guess I was wrong and they are uniformly sized, so I learned something.
> it's common to have windowing inside the FFT that I don't really understand
The FFT turns a signal from the time domain to the frequency domain. To do that, the math assumes that the signal is unchanged for all time. In other words, it treats that chunk of samples you give it as looping infinitely backwards and forwards in time.
But the set of samples you gave it are a segment from a signal that does change over time. So when you loop it, you'll get discontinuities.
For example, let's say your signal is a single sine wave whose period is twelve samples:
-. .--. .--
\ / \ /
\ / \ /
\ / \
\ / \
-. .--. -. .--. -. .--.
\ / \ | \ / \ | \ / \
\ / \| \ / \| \ / \
'--' '--' '--'
Windowing basically fades outs the edges of each segment to reduce those discontinuities so that you don't get the artifacts in the results. There are a bunch of different ways to do it because they're all sort of hacks that balancing fixing bogus artifacts with not wanting to mask actual signal that happens to occur near the edge of the segment.
I thought that to until I looked it up. The reason is because we're used to seeing it drawn in logarithmic scale on spectrum analyzers.
Btw, those are some neat ascii graphs. What tool do you use for that?
Also, good explanation :)
Yes, that's exactly why I was confused. :)
> What tool do you use for that?
Sublime Text and patience.