
Granular Audio Synthesis - unsatchmo
https://blog.demofox.org/2018/03/05/granular-audio-synthesis/
======
jedimastert
I've seen a lot of people commenting about the artifacts you hear when the
samples are stretched. These happen because of phasing issues, where
frequencies in each of the grains are interfering with one another.

I'm surprised I don't see it mentioned here, but there's a rather interesting
extension to this technique made by Paul Nasca[0], which midigates these
artifact by (1)carefully choosing the size and placement of grains and
(2)randomly changing the phase of each grain before recombining. You can see
the algorithm here[1].

The results are absolutely incredible. You can end up slowing a sample down by
800% or more with no artifacts. For example, here[2] is the Windows 95 startup
sound extended to be a little over 6 minutes long. The reverb you hear isn't
added, that's just what is sounds like.

Also, if you didn't notice from the page, it's one of the default plug-ins in
Audacity.

[0]: [http://www.paulnasca.com/](http://www.paulnasca.com/) [1]:
[http://www.paulnasca.com/algorithms-created-by-me#TOC-
PaulSt...](http://www.paulnasca.com/algorithms-created-by-me#TOC-PaulStretch-
extreme-sound-stretching-algorithm) [2]:
[https://www.youtube.com/watch?v=FsJdplLB1Bs](https://www.youtube.com/watch?v=FsJdplLB1Bs)

~~~
mushishi
You made my day, thank you! I've been using Ableton Live's warping mechanism
with Complex Pro settings, and this seems really promising alternative.

~~~
MrScruff
There's a recently released VST/AU version of Paul's Stretch under active
development.

[https://xenakios.wordpress.com/paulxstretch-
plugin/](https://xenakios.wordpress.com/paulxstretch-plugin/)

~~~
mushishi
Nice, first time I used a VST. Thank you!

------
mgeorgoulo
Very good results and embarrassingly easy to implement!

The very stretched waveform did contain some audible artifacts, but I think
other methods like FFT would introduce some as well.

This kind of trick works because our hearing is frequency-based. So the
crucial thing is to preserve the frequencies and it is going to sound exactly
the same.

Spatial mapping of frequencies in the human ear here:
[https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2394499/](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2394499/)
(see fig 5.)

Trying this with an image for example wouldn't work, because our vision is
sample-based. Imagine splitting an image in tiny fragments and
repeating/interpolating them on top of one another.

------
jancsika
> What this does is make it so you can put any grain next to any other grain,
> and they should fit together pretty decently. This gives you C0 continuity
> by the way, but higher order discontinuities still affect the quality of the
> result. So, while this method is fast, it isn’t the highest quality. I
> didn’t try it personally, so am unsure how it affects the quality in
> practice.

It's not just about continuity. It also removes an entire set of concerns from
the process.

For example-- suppose someone analyzes an audio recording, splits it into
grains, then does some fancy re-organization based on the timbral content of
the recording/grains.

Now suppose they are subjectively unhappy with the result. Perhaps it sounds
"wimpy," "fluttery," or some other such vague complaint. Is that sound due to
a) their process of re-organizing the grains, b) the quality of the original
recording, c) the envelopes they used, or d) something else entirely?

If instead one uses grains which begin and end at zero, the answer can't be C
because it doesn't exist. I can say that the quality sounds fine in the few
examples I've heard that use this technique.

I'd imagine the reason the latter isn't used as often is because it's simply
more difficult to program if each grain can be an arbitrary size (or at least
not quantized).

------
aidenn0
I think the speed-up sounds _much_ better than the slow-down. With the slow-
down there are very noticable artifacts; I'm not sure if it's because of the
envelope they choose or just because repeating a grain adds harmonics.

------
vladimirralev
As far as I see this is basically naive TDHS (Time Domain Harmonic Scaling).
It's a great starter project as an intro to audio-effect coding, since you can
visually observe where you go wrong and where the noise comes from at the
edges. Just great for learning how audio works for beginners. It's very rare
to have an audio effects algorithm so cool and so easy to observe without
special analysis tools.

Some more famous algorithms that work this way and are similarly easy to
implement are TDHS and PSOLA. They all work in the time domain but find
different ways to smooth out the discontinuities and to make more extreme
shifts sound better.

------
amelius
Perhaps a better way of looking at it is this. Basically, a sound triggers
hair cells in the ear. A single harmonic tone triggers a single group of hair
cells. Through modeling, you can compute which hair cells are triggered at
what moment for a given signal. Your task is then to compute a new signal for
which the same haircells are triggered but faster.

~~~
yoklov
I think their description is a much more actionable description of it than
yours, to be honest.

~~~
amelius
Yes, it would require some math, but I suspect you'd get superior results. For
example, you can replace the "modeling" by the FFT transform applied to small
time-slices (i.e., this determines which hair-cells get triggered at a given
time). Now you have to stitch these slices together without introducing
spurious frequencies, which is the difficult part.

~~~
Atrix256
Are you trolling us? The ear drum is how you hear...

~~~
khedoros1
I'm not sure what about that post sounded like a troll to you. It sounds like
a reasonable approach, to me.

~~~
Atrix256
"hair cells in the ear" isn't how hearing works. That's how balance works :/

------
jeffreyrogers
This is pretty neat. One frustrating thing I found while doing some audio
programming recently is how hard it was working with different audio formats.
Most of the libraries I found for doing so were GPL or required a commercial
license.

~~~
Atrix256
Does this source code help you much with that? It only deals with wave files
but can read / write 8, 16, 24 or 32 bit wave files, at whatever sample rate,
with however many channels.

I really wish someone would make a header only C++ audio library, that would
be soooo nice.

~~~
jeffreyrogers
Wave files are pretty easy to deal with because the format is simple and the
data isn't usually compressed. It's all the other formats that make this hard.
Actually, it's probably not that hard, but parsing file formats isn't really a
fun programming task (for me at least).

------
recentdarkness
Title should be “Granular Audio Synthesis”

Couldn’t find anything about C++ in that article on a quick scan - feel free
to correct me

~~~
FraKtus
It's here
[https://github.com/Atrix256/GranularSynth](https://github.com/Atrix256/GranularSynth).
But unfortunately it does not go very deep into the detail, and there is no
real motivation on why C++ would be cool to do that processing...

~~~
Atrix256
I'm the author and didn't make this post on yc, but yes, the implementation is
in 680 lines of standalone c++

The point of the article isn't about c++ or why it's a good language for doing
this sort of thing, but I'm a real time graphics and game engine programmer,
so it's my language of choice.

~~~
jcelerier
> typedef uint16_t uint16;

does adding that little "_t" hurt you that much ? think that now every
programmer reading your code will wonder "hmm... what is that uint16 type ? is
it equivalent to uint16_t ? is it some weird macro designed to accomodate 1990
compilers ?" etc etc.

> const char *fileName

for the love of all things holy, use std::string_view for non-performance-
critical stuff like this.

More generally, the only remotely C++-like thing in your code is the use of
std::vector. The rest is honestly more C than C++. This shows for instance in
SampleChannelFractional with that ugly #if 0. Proper C++ design would have
SampleChannelFractional be instead a function object that you could pass as
template argument: this way, the user can choose which implementation he wants
without requiring a recompilation, and without indirection cost.

In addition, if you change some parts of your code, the C++ compiler will be
able to check both code paths directly.

That is :

    
    
        struct LinearSampleChannelFractional
        {
          float operator()(const std::vector<float>& input
                         , float sampleFloat
                         , uint16 channel
                         , uint16 numChannels)  
          {
            // your linear implementation here
          }
        };
        
        struct CubicSampleChannelFractional
        {
        
          float operator()(const std::vector<float>& input
                         , float sampleFloat
                         , uint16 channel
                         , uint16 numChannels)  
          {
            // your cubic implementation here
          }
         
          private:
             float CubicHermite (float A, float B, float C, float D, float t)
             { 
                // encapsulate it here: 
                // the rest of your code does not care about this function.
             }
        };
        
        // First modification: pass Fractional as template argument
        template<typename Fractional>
        void TimeAdjust (...)
        {
          // replace SampleChannelFractional: 
          output[...] = Fractional{}(input, srcSampleFloat, channel, numChannels);
        }
    
        template<typename Fractional>
        void SplatGrainToOutput (...)
        {
          // same
          output[...] = Fractional{}(...);
        }
        
        // Second modification: refactor this in a Granulator class of some sorts...
        // and use the standard C++ naming convention    
        template<typename Fractional>
        class granulator
        {
          public:        
            void time_adjust (...)
            { 
              // uses the class template argument
              output[...] = Fractional{}(input, srcSampleFloat, channel, numChannels);
            }
            
            void splat_grain_to_output (...)
            {
              // same
              output[...] = Fractional{}(...);
            }
            
            void granular_time_pitch_adjust (...)
            {
               // ... splat_grain_to_output(...)
            }
        
        int main()
        {
            // now both kinds can be used at the same time in your code, and both will be 
            // just as efficient as with the #if 1 ; for instance the choice 
            // of the granulator to use can then be part of a configuration option 
            // in a GUI software
            constexpr granulator<cubic_sample_channel_fractional> gran1;
            gran1.time_adjust(source, out, numChannels, 0.7f);
            
            constexpr granulator<linear_sample_channel_fractional> gran2;
            gran2.time_adjust(source, out, numChannels, 0.7f);
        }

~~~
bstamour
One other minor nitpick: know the difference between vector.resize vs
vector.reserve. If all you're doing is copying new data into the vector after
sizing it, use reserve. This avoids default-constructing all of the values
inside of it, only to overwrite them with the new values you're copying in. In
the case of primitive types it's probably not a big deal, but it's still doing
a second pass over the data just to set it to zero before copying the contents
of the file.

~~~
jcelerier
> This avoids default-constructing all of the values inside of it, only to
> overwrite them with the new values you're copying in.

I think that I saw a few benchmarks once showing that for primitive types such
as int & such, it was actually more efficient to resize() ; only past 32 or 64
bytes structs did reserving' become more interesting. In any case, when
nitpicking on this it's even better to use boost::vector which allows to do an
uninitialized resize :
[http://www.boost.org/doc/libs/1_66_0/doc/html/container/exte...](http://www.boost.org/doc/libs/1_66_0/doc/html/container/extended_functionality.html#container.extended_functionality.default_initialialization).

~~~
bstamour
That's interesting, and a bit counter-intuitive IMO. But I guess that's why we
measure things. The reserve operation is essentially just a single
reallocation, whereas the resize is a reallocation plus a bunch of zeroing
out. Strange.

------
luk32
How does granular analysis differ from pcm representation, Fournier
transformation and sampling? Or is it a different name for the same thing. I
think it's natural to whoever worked with sound on a Pc.

It's probably debatable, but I don't agree with the statement that shortnening
the "sound" changes pitch. It depends on your representation of the sound. If
you represent it as a function of amplitude vs time then scaling the time axis
does change pitch.

This makes a sensational tone about a fallacy. No instrument plays sound
faster or slower to make it shorter or longer.... It just stops playing it or
doesn't. If one thinks about the phenomenon this way, it becomes natural why
you cannot compress time, to play shorter sounds.

~~~
teilo
You don't seem to have a very good grasp of this subject and don't appear to
have read the article very carefully. The only viable alternative to PCM is
DSD, which failed to gain any traction for good reasons. So for all practical
purposes, sampling and PCM are the same thing. You also throw in Fourier (not
Fournier) transformation for good measure, which is relevant to additive
synthesis, but not to granular synthesis, which is the topic of this article.

> I don't agree with the statement that shortnening the "sound" changes pitch.
> It depends on your representation of the sound. If you represent it as a
> function of amplitude vs time then scaling the time axis does change pitch.

The only relevant "representation" is digital audio, which by definition is
encoded as amplitude over time regardless of encoding technique. To lengthen
time without changing pitch or pitch without changing time requires
manipulation of the audio data. That manipulation is either done by granular
synthesis, or by utilizing a Fast Fourier Transform to decompose the audio
into its component waveforms, changing the frequencies or shortening the wave
components, and recomposing them back to a composite waveform. This article is
about granular synthesis, which requires far less computation than FFT.

> No instrument plays sound faster or slower to make it shorter or longer....

Irrelevant. We aren't dealing with physical instruments, but with digital
audio.

There is nothing in the least fallacious or sensational about this article.

~~~
tabtab
Fourier-based techniques (FBT) and sample slicing (SC) may be similar if doing
"raw" transformations, but FBT can potentially be cleaner, or at least easier
to clean up. If you use raw "bit-maps" for FBT, yes it will be choppy like SC,
but one can use regression or regression-like curve-fitting to give FBT smooth
time/frequency curves to synthesize against, sounding more natural. There are
down-sides to using regression, but for typical voice and music, those won't
matter much.

One rough area for curve-fitting is white-noise-esque sounds (WNES) like the
letter "s" or "h" and tambourines. The processor can perhaps detect if WNES
exceed a threshold, and use other techniques such as SC instead.

It's roughly comparable to JPEG versus GIF images. JPEG is better (more
faithful) at gradual shades while GIF is better at edges. A better compression
algorithm perhaps would use each where it does best per given image. However,
at the cost of algorithm complexity and compression/decompression processing
time.

