Hacker News new | comments | ask | show | jobs | submit login
Granular Audio Synthesis (demofox.org)
153 points by unsatchmo 11 months ago | hide | past | web | favorite | 33 comments

I've seen a lot of people commenting about the artifacts you hear when the samples are stretched. These happen because of phasing issues, where frequencies in each of the grains are interfering with one another.

I'm surprised I don't see it mentioned here, but there's a rather interesting extension to this technique made by Paul Nasca[0], which midigates these artifact by (1)carefully choosing the size and placement of grains and (2)randomly changing the phase of each grain before recombining. You can see the algorithm here[1].

The results are absolutely incredible. You can end up slowing a sample down by 800% or more with no artifacts. For example, here[2] is the Windows 95 startup sound extended to be a little over 6 minutes long. The reverb you hear isn't added, that's just what is sounds like.

Also, if you didn't notice from the page, it's one of the default plug-ins in Audacity.

[0]: http://www.paulnasca.com/ [1]: http://www.paulnasca.com/algorithms-created-by-me#TOC-PaulSt... [2]: https://www.youtube.com/watch?v=FsJdplLB1Bs

You made my day, thank you! I've been using Ableton Live's warping mechanism with Complex Pro settings, and this seems really promising alternative.

There's a recently released VST/AU version of Paul's Stretch under active development.


Nice, first time I used a VST. Thank you!

Very good results and embarrassingly easy to implement!

The very stretched waveform did contain some audible artifacts, but I think other methods like FFT would introduce some as well.

This kind of trick works because our hearing is frequency-based. So the crucial thing is to preserve the frequencies and it is going to sound exactly the same.

Spatial mapping of frequencies in the human ear here: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2394499/ (see fig 5.)

Trying this with an image for example wouldn't work, because our vision is sample-based. Imagine splitting an image in tiny fragments and repeating/interpolating them on top of one another.

> What this does is make it so you can put any grain next to any other grain, and they should fit together pretty decently. This gives you C0 continuity by the way, but higher order discontinuities still affect the quality of the result. So, while this method is fast, it isn’t the highest quality. I didn’t try it personally, so am unsure how it affects the quality in practice.

It's not just about continuity. It also removes an entire set of concerns from the process.

For example-- suppose someone analyzes an audio recording, splits it into grains, then does some fancy re-organization based on the timbral content of the recording/grains.

Now suppose they are subjectively unhappy with the result. Perhaps it sounds "wimpy," "fluttery," or some other such vague complaint. Is that sound due to a) their process of re-organizing the grains, b) the quality of the original recording, c) the envelopes they used, or d) something else entirely?

If instead one uses grains which begin and end at zero, the answer can't be C because it doesn't exist. I can say that the quality sounds fine in the few examples I've heard that use this technique.

I'd imagine the reason the latter isn't used as often is because it's simply more difficult to program if each grain can be an arbitrary size (or at least not quantized).

I think the speed-up sounds much better than the slow-down. With the slow-down there are very noticable artifacts; I'm not sure if it's because of the envelope they choose or just because repeating a grain adds harmonics.

As far as I see this is basically naive TDHS (Time Domain Harmonic Scaling). It's a great starter project as an intro to audio-effect coding, since you can visually observe where you go wrong and where the noise comes from at the edges. Just great for learning how audio works for beginners. It's very rare to have an audio effects algorithm so cool and so easy to observe without special analysis tools.

Some more famous algorithms that work this way and are similarly easy to implement are TDHS and PSOLA. They all work in the time domain but find different ways to smooth out the discontinuities and to make more extreme shifts sound better.

Perhaps a better way of looking at it is this. Basically, a sound triggers hair cells in the ear. A single harmonic tone triggers a single group of hair cells. Through modeling, you can compute which hair cells are triggered at what moment for a given signal. Your task is then to compute a new signal for which the same haircells are triggered but faster.

I think their description is a much more actionable description of it than yours, to be honest.

Yes, it would require some math, but I suspect you'd get superior results. For example, you can replace the "modeling" by the FFT transform applied to small time-slices (i.e., this determines which hair-cells get triggered at a given time). Now you have to stitch these slices together without introducing spurious frequencies, which is the difficult part.

Are you trolling us? The ear drum is how you hear...

I'm not sure what about that post sounded like a troll to you. It sounds like a reasonable approach, to me.

"hair cells in the ear" isn't how hearing works. That's how balance works :/

This is pretty neat. One frustrating thing I found while doing some audio programming recently is how hard it was working with different audio formats. Most of the libraries I found for doing so were GPL or required a commercial license.

Does this source code help you much with that? It only deals with wave files but can read / write 8, 16, 24 or 32 bit wave files, at whatever sample rate, with however many channels.

I really wish someone would make a header only C++ audio library, that would be soooo nice.

Wave files are pretty easy to deal with because the format is simple and the data isn't usually compressed. It's all the other formats that make this hard. Actually, it's probably not that hard, but parsing file formats isn't really a fun programming task (for me at least).

libsndfile is LGPL

Thanks. I actually came across libsndfile, but for some reason thought it was GPL instead of LGPL. Ideally there'd be a BSD licensed library, but LGPL is usable.

Title should be “Granular Audio Synthesis”

Couldn’t find anything about C++ in that article on a quick scan - feel free to correct me

It's here https://github.com/Atrix256/GranularSynth. But unfortunately it does not go very deep into the detail, and there is no real motivation on why C++ would be cool to do that processing...

I'm the author and didn't make this post on yc, but yes, the implementation is in 680 lines of standalone c++

The point of the article isn't about c++ or why it's a good language for doing this sort of thing, but I'm a real time graphics and game engine programmer, so it's my language of choice.

I really enjoyed this article, and especially how it dealt with audio directly rather than getting lost in library abstractions.

> typedef uint16_t uint16;

does adding that little "_t" hurt you that much ? think that now every programmer reading your code will wonder "hmm... what is that uint16 type ? is it equivalent to uint16_t ? is it some weird macro designed to accomodate 1990 compilers ?" etc etc.

> const char *fileName

for the love of all things holy, use std::string_view for non-performance-critical stuff like this.

More generally, the only remotely C++-like thing in your code is the use of std::vector. The rest is honestly more C than C++. This shows for instance in SampleChannelFractional with that ugly #if 0. Proper C++ design would have SampleChannelFractional be instead a function object that you could pass as template argument: this way, the user can choose which implementation he wants without requiring a recompilation, and without indirection cost.

In addition, if you change some parts of your code, the C++ compiler will be able to check both code paths directly.

That is :

    struct LinearSampleChannelFractional
      float operator()(const std::vector<float>& input
                     , float sampleFloat
                     , uint16 channel
                     , uint16 numChannels)  
        // your linear implementation here
    struct CubicSampleChannelFractional
      float operator()(const std::vector<float>& input
                     , float sampleFloat
                     , uint16 channel
                     , uint16 numChannels)  
        // your cubic implementation here
         float CubicHermite (float A, float B, float C, float D, float t)
            // encapsulate it here: 
            // the rest of your code does not care about this function.
    // First modification: pass Fractional as template argument
    template<typename Fractional>
    void TimeAdjust (...)
      // replace SampleChannelFractional: 
      output[...] = Fractional{}(input, srcSampleFloat, channel, numChannels);

    template<typename Fractional>
    void SplatGrainToOutput (...)
      // same
      output[...] = Fractional{}(...);
    // Second modification: refactor this in a Granulator class of some sorts...
    // and use the standard C++ naming convention    
    template<typename Fractional>
    class granulator
        void time_adjust (...)
          // uses the class template argument
          output[...] = Fractional{}(input, srcSampleFloat, channel, numChannels);
        void splat_grain_to_output (...)
          // same
          output[...] = Fractional{}(...);
        void granular_time_pitch_adjust (...)
           // ... splat_grain_to_output(...)
    int main()
        // now both kinds can be used at the same time in your code, and both will be 
        // just as efficient as with the #if 1 ; for instance the choice 
        // of the granulator to use can then be part of a configuration option 
        // in a GUI software
        constexpr granulator<cubic_sample_channel_fractional> gran1;
        gran1.time_adjust(source, out, numChannels, 0.7f);
        constexpr granulator<linear_sample_channel_fractional> gran2;
        gran2.time_adjust(source, out, numChannels, 0.7f);

One other minor nitpick: know the difference between vector.resize vs vector.reserve. If all you're doing is copying new data into the vector after sizing it, use reserve. This avoids default-constructing all of the values inside of it, only to overwrite them with the new values you're copying in. In the case of primitive types it's probably not a big deal, but it's still doing a second pass over the data just to set it to zero before copying the contents of the file.

> This avoids default-constructing all of the values inside of it, only to overwrite them with the new values you're copying in.

I think that I saw a few benchmarks once showing that for primitive types such as int & such, it was actually more efficient to resize() ; only past 32 or 64 bytes structs did reserving' become more interesting. In any case, when nitpicking on this it's even better to use boost::vector which allows to do an uninitialized resize : http://www.boost.org/doc/libs/1_66_0/doc/html/container/exte....

That's interesting, and a bit counter-intuitive IMO. But I guess that's why we measure things. The reserve operation is essentially just a single reallocation, whereas the resize is a reallocation plus a bunch of zeroing out. Strange.

Because this kind of processing needs to run fast, almost all resources and tutorials are in c/cpp, and security is not a concern.

We're reverted the submitted title of “Granular Synthesis in C++” to that of the article.

How does granular analysis differ from pcm representation, Fournier transformation and sampling? Or is it a different name for the same thing. I think it's natural to whoever worked with sound on a Pc.

It's probably debatable, but I don't agree with the statement that shortnening the "sound" changes pitch. It depends on your representation of the sound. If you represent it as a function of amplitude vs time then scaling the time axis does change pitch.

This makes a sensational tone about a fallacy. No instrument plays sound faster or slower to make it shorter or longer.... It just stops playing it or doesn't. If one thinks about the phenomenon this way, it becomes natural why you cannot compress time, to play shorter sounds.

You don't seem to have a very good grasp of this subject and don't appear to have read the article very carefully. The only viable alternative to PCM is DSD, which failed to gain any traction for good reasons. So for all practical purposes, sampling and PCM are the same thing. You also throw in Fourier (not Fournier) transformation for good measure, which is relevant to additive synthesis, but not to granular synthesis, which is the topic of this article.

> I don't agree with the statement that shortnening the "sound" changes pitch. It depends on your representation of the sound. If you represent it as a function of amplitude vs time then scaling the time axis does change pitch.

The only relevant "representation" is digital audio, which by definition is encoded as amplitude over time regardless of encoding technique. To lengthen time without changing pitch or pitch without changing time requires manipulation of the audio data. That manipulation is either done by granular synthesis, or by utilizing a Fast Fourier Transform to decompose the audio into its component waveforms, changing the frequencies or shortening the wave components, and recomposing them back to a composite waveform. This article is about granular synthesis, which requires far less computation than FFT.

> No instrument plays sound faster or slower to make it shorter or longer....

Irrelevant. We aren't dealing with physical instruments, but with digital audio.

There is nothing in the least fallacious or sensational about this article.

Fourier-based techniques (FBT) and sample slicing (SC) may be similar if doing "raw" transformations, but FBT can potentially be cleaner, or at least easier to clean up. If you use raw "bit-maps" for FBT, yes it will be choppy like SC, but one can use regression or regression-like curve-fitting to give FBT smooth time/frequency curves to synthesize against, sounding more natural. There are down-sides to using regression, but for typical voice and music, those won't matter much.

One rough area for curve-fitting is white-noise-esque sounds (WNES) like the letter "s" or "h" and tambourines. The processor can perhaps detect if WNES exceed a threshold, and use other techniques such as SC instead.

It's roughly comparable to JPEG versus GIF images. JPEG is better (more faithful) at gradual shades while GIF is better at edges. A better compression algorithm perhaps would use each where it does best per given image. However, at the cost of algorithm complexity and compression/decompression processing time.

By playing a sound faster I mean changing it's sample rate, without doing anything else.

Applications are open for YC Summer 2019

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact