Hacker News new | past | comments | ask | show | jobs | submit login
DDSP: Differentiable Digital Signal Processing (github.com/magenta)
264 points by matt_d on Jan 15, 2020 | hide | past | favorite | 31 comments



Wow what a coincidence! I got excited because I thought someone had posted my project which is also called Ddsp! Mine stands for D Digital Signal Processing since it is written in D.

Just going to shamelessly post it here. https://github.com/ctrecordings/ddsp

It's always great to see more DSP projects though!


Sorry about that :), definitely an interesting coincidence. Funny enough we actually were thinking of naming dDSP (as in derivative of DSP) but had already submitted the ICLR paper so just stuck with all caps (also because python is all lowercase).


It would be very interesting if someone starts a project on 3DSP, Differentiable D Digital Signal Processing.

Swift community is currently trying to come up with their own differentiable programming initiative [1]

https://github.com/apple/swift/blob/master/docs/Differentiab...


arXiv link for more context: https://arxiv.org/abs/2001.04643

and blog posting with audio examples: https://magenta.tensorflow.org/ddsp


Are you folks planning on extending this to speech? I'm always been disappointed by how speech vocoder networks aren't built with any great inductive biases for waveform generation (besides very long receptive fields), and have desperately wanted something like this tuned for speech. It'd be great if a DSP-based architecture could be shown to outperform WaveNet / Parallel WaveNet / WaveRNN / WaveFlow / etc, and I'd love to use that in our own work. (There's been some attempts based on source-filter models like the "neural source filter (NSF) network", but nothing's caught on as best as I can tell.)


I have a few of questions for the authors:

- (w.r.t time varying FIRs) How did your results compare to traditional NLMS/adaptive approaches? Were you able to achieve similar results with fewer CPU cycles/lower filter order?

- (also w.r.t FIRs) Have you looked at your approach as more general/nonlinear model of adaptive filtering?

- How do you deal with highly correlated parameters in your models?

- (w.r.t dereverberation) How does your approach compare in fidelity and performance to homomorphic filtering approaches for deconvolution?


Hi I'm Jesse, one of the authors, thanks for the interesting questions!

- In terms fo the FIRs, I think you can think of this as a form of more general/nonlinear filter modeling. The difference being I think that you can have a filter as one of several components, and adapt them all jointly to achieve some task (which itself can be more flexibly defined (different losses, adversarial etc.). The filter itself is still just LTV-FIR, but it's being controlled nonlinearly. We only have examined synthesis so far, but other signal processing problems like denoising are definitely good directions. The "effects" processors are designed for this.

- It's true neural networks often learned correlated parameters but it usually is of less significance because they operate in an overparameterized "interpolative" regime, which has a lot of interesting ongoing research trying to understand it.

- We didn't do a quantitative comparison, but in general the tradeoffs will be different. Dereverberation by a modular generative model will only sound as good as the generative model itself, so artifacts will be from not modeling the source properly. However, if you learn a good model, the dereverberation should be essentially perfect (you can losslessly apply different reverb), although that's a big if.


Thanks for the reply! This work is fascinating and while I'm not a python guy I'm going to play with your library a bunch.

I do think you should investigate comparisons to adaptive FIRs much more. This field is critical to the design of low power medical devices like hearing aids, which need feedback reduction, echo cancellation, and the like with minimal filter orders.

My question on correlated parameters was a bit more abstract. Often in the design of classical audio signal processors for creative applications you find that the user space parameters can be correlated, which map to more design space parameters that are even more correlated, and down to implementation level parameters which are even more correlated. For example in a filter designed by frequency sampling, the adjacent bins of an FFT are highly correlated in their I/O and I was curious if you optimized a bit by taking a DCT or similar approach for reparameterization like you'd find in calculating MFCCs and the like. It's really tough to design ML approaches for creative signal processing that are better than traditional methods due to this nature, humans learn and adapt to correlations very quickly, machines not so much when dealing with oscillation and ripple. Many local extrema in the parameter space and all that.


Adaptive IIR would be more interesting, as automatically controlling and designing those filters in a stable way is rather hard. And they're both differentiable and power efficient. Especially anything that is not a biquad series, and because they have recursion related computational noise, which the ANN should be able to optimize out.


If the FIR filters are time variant, what determines the values of the kernel at any given moment - is it the prior values of the signal being generated or something else?

For IIR filters, I imagine the network would have to be recurrent. If you train the weights with labeled data, does that amount to a different mechanism for designing a filter, sort of like exchanging bilinear transformation for a data-driven approach? The former seems like forward engineering from first principles, whereas the latter seems like reverse engineering from data that exhibits transformations that you want the filter to encode. Fascinating stuff.


I imagine the authors keep filter order as a design parameter, since dynamically changing order is a rather tricky proposition.

For IIRs the approach (imo) would be to use transformed analog biquads (such as an SVF topology, via the TPT/Zavalishin's method) which is designed to handle the time variant issues with state evolution. From there your model wouldn't synthesize filter coefficients directly but either the design parameters of the filter or the gains of the outputs or multi mode filters built from cascaded biquad sections. A good example of this in practice is the Oberheim "pole mixing" filter (analog) or the Shruthi's variation on it, which is a 4th order filter that achieves a wide array of frequency responses by mixing the outputs of each stage. Those gains in the mixer can be varied for very cool results.


Hi, yup that's true we keep the filter order fixed. For the experiments in the paper, the time-varying coefficients are generated by a neural network that is trained end-2-end to generate audio like the training set (conditioned on high-level controls such pitch and loudness).

I agree that IIRs are a great avenue of future study, also with time-varying coefficients. I've played around a bit with them, but they are harder to efficiently train with current autodiff software and GPUs/TPUs. I think they may require writing a custom cuda kernel, but I'm hopeful for things like JAX's scan operation.


Fantastic project!

For someone looking to learn more about signals and audio processing DFTs FFTs etc any good material people can recommend? I've completely forgotten my Uni CompEng Signals and want to revisit it.


I don't have links, but I have some advice that may be contrary to what a CS major might say. I recommend against a CS-first approach. Focus on theoretical fundamentals. Take "DFT" and "FFT" out of your lexicon -- those are not important right now. Forget about the the D and the P in DSP. Focus on the S: signals. Signals are just waves. Understand the mathematical basics of waves -- bonus points if you study differential equations whose solutions are waves.

Once you're pretty confident with your understanding of wave math, next focus on linear algebra. Linear algebra sounds like a strange requirement for understanding signals, but it actually is fundamental. Fourier series should then just "click" for you -- a Fourier transform is basically a change of basis in a vector space.

Bonus points if you learn about infinite-dimensional vector spaces, what Dirac delta functions are and what they mean. For example, why is it that a delta function has infinite spectral energy? Either you have no idea, or you find the answer obvious. There is no in between. Good DSP folks can answer theoretical questions like this without thinking. Once you understand how to think about waves in the time basis and the frequency basis, you will be well equipped to understand pretty much everything in DSP.


Coincidentally, that's how I came around to it, from a physics undergrad, to music software, to cs.


What’s wave math?


Fourier transforms and more generally convolutions.


So remove DFT but use Fourier still?

No, real wave maths is continuous wavelets and those are not as useful in processing. (DFT is a kind of wavelet transform, just limited.)

And then you have the advanced tensor wave math which is used in physics but rarely in sound processing. I bet you didn't have evaluating Schrodinger's and symmetric solutions was what you had in mind.

You'll get much more mileage out of statistics and discrete mathematics, plus control theory and optimization theory.


There are many different types of Fourier transforms (i.e. generalisations and specifications(?)) useful for signal analysis, not just the (classical full spectrum) FT. E.g. the fractional FT (useful for signal separation), laplace transform (used all over the place in control for linear(ised) systems and analysis).

> Schrödinger

Well momentum is the FT of position so...

Also note the OP's question is "Whats wave math?" not "why don't use the (D|F)FT?" which I'm still surprised was suggested: Its a tool, use it when appropriate, use something else when its not.


What would be a good practical project to learn about DSP? Preferably in Python, and preferably relevant to music somehow.


Well I did most of my signals analysis with an oscilloscope circuitry and a function (read: waveform) generator. You could very easily get computer versions of those and substitute the function generator for some music. But I'd consult the elders of the internet for better ideas.


Julius Smith @ Stanford has a couple of great free books online:

Introduction to Digital Filters with Audio Applications

https://ccrma.stanford.edu/~jos/filters/

Mathematics of the Discrete Fourier Transform

https://ccrma.stanford.edu/~jos/mdft/


http://dspguide.com/ is a practical, not too mathematical guide. Really, I recommend it to anyone that knows math and computer science. Understanding linearity, what an impulse response is, and convolution is really useful, and intuitive.



"Understanding Digital Signal Processing" by Richard Lyons is a wonderful book. Very readable, heavier on intuitive explanations/diagrams and lighter on formulas.


If you like audio, Will Pirkle's books on the subject are great although he gets some flak in professional circles for the code quality.


Do you think DDSP could be used for feature engineering/discovery? My setting is a time series that I want to do regression on. But it's not clear a priory what features of the series have good predictive power. I imagine making the DDSP the first layer of a FNN and the gradients helping me identify the right filters to use to extract important features from my data.


DDSP modules are helpful in situations where you want to impose some level of interpretability and modularity. Most also don't have parameters themselves, but must have them provided by another network or variable. So you could imagine for instance feeding your data through a NN that then predicts filter coefficients, then running the same data through a filter with those coefficients (if you wanted to enforce time-varying linearity for interpretability let's say).


I don't know much about DSP, neural nets, and audio, but I am really intrigued by this project. If you have a second, could you give a simple example of how this approach could be applied to problems outside of audio?


So I'm using the gradients "grad" of the second network (the DDSP) to make a loss function for the first network, such as |grad|^2?


I don't really understand why I'd use gin. From the example ipynbs it looks like pretty much the same amount of code but in a gin file and then it spookily fills in parameters for you in python. Why is this useful?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: