Hacker News new | past | comments | ask | show | jobs | submit login
Manipulate audio with a simple Python library (pydub.com)
164 points by coppolaemilio on Oct 12, 2014 | hide | past | web | favorite | 43 comments

Thanks for making this.

Relevant plug: I have a Vagrant box [1] and GitHub repo [2] containing IPython notebooks that we use in a workshop on music information retrieval. (Caution: the IPython notebooks are under heavy development, i.e. incomplete. If you have any feedback, please create a GitHub issue.) I just added pydub to the latest version of the Vagrant box [3].

The nice thing about this setup is that, regardless of your host OS, everybody has the same development environment so you don't have to go through the pains of installing numpy, scipy, scikit-learn, essentia, and more.

[1] https://vagrantcloud.com/stevetjoa/boxes/stanford-mir

[2] https://github.com/stevetjoa/stanford-mir

[3] https://vagrantcloud.com/stevetjoa/boxes/stanford-mir/versio...

You're welcome :)

Looks like a cool project!

A couple of things that people who like this might also be interested in:

Coursera.org is currently running a course called "Audio Signal Processing for Music Applications" which I believe uses python. It's in its second week, so you have time to catch up. https://www.coursera.org/course/audio

http://aubio.org/ is a library that does note onset detection, pitch detection, beat / tempo tracking and various other things. It has python bindings.

This looks really cool.

I'm a Python developer but have a little experience with audio/music processing. Is there some other software or DSL that can manipulate audio files with a high-level syntax like this?

I don't know if this fits the bill exactly, but I've been dying to have time to play around with Overtone[1] and Emacs Live[2]!

[1]: http://overtone.github.io [2]: http://overtone.github.io/emacs-live/

http://supercollider.github.io/ is one of the most widely used.

There's ChucK http://chuck.cs.princeton.edu/ and also for Python there's Pyo http://ajaxsoundstudio.com/software/pyo/

It reminds me of VirtualDub's scripting language, which is meant for videos but could do audio, if I recall correctly. It's been about half a decade since I used it, though...

What do professional music producers and mixers use? I imagine GUI-type mixing software is still the most popular, but I feel like there must be some experimental musicians out there producing music by fuzzing different parameters in some kind of scripting or configuration language.

If there actually aren't, I kind of feel like there should be.

There are many DSLs for music! Here are a few: https://en.wikipedia.org/wiki/Audio_programming_language

Many of which are inspired by or even directly based on Max Matthew's work: https://en.wikipedia.org/wiki/MUSIC-N

Thanks, these are really cool. These were the kinds of things I had in mind. I'm tempted to experiment with one myself, but my music skills and knowledge are non-existent. And I'm pretty sure those are much more essential than programming skills when using these languages.

That doesn't sound like a reason to avoid experimenting. If you're tempted to experiment, jump in! Be bold!

I've seen some really cool stuff being done with Overtone: http://overtone.github.io/

This is awesome, thanks.


Just found this example song apparently coded live in Overtone: https://soundcloud.com/meta-ex/spiked-with-recursive-dreams

Quite amazing.

I have a simple Python library[0] for direct audio synthesis, with oscillators, filters, various other effects, MIDI input, plus some basic building blocks for algorithmic music construction. It pretty much needs PyPy to provide samples fast enough for real-time audio, but even PyPy's too slow sometimes.

Not too much documentation right now, and there's barely any example usage. Think it's about time to remedy that! I have plans for an open-source album created with it, but it's just an idea at this point.

[0] https://github.com/zwegner/undulance

Very cool to see so many python music projects here! I suppose I'll add mine, which is also sorely lacking in documentation and is very much in a pre-release state: https://github.com/hecanjog/pippi

I've been dogfooding it for my own music since I started working on it, here's a recent-ish album made with it: http://music.hecanjog.com/album/solos-for-unattended-compute...

I've been similarly frustrated with DAWs too. In the last few weeks I had a go at writing a DSL that shells out to SoX for audio manipulation. This way I don't have to manipulate audio samples myself.

It's way less polished than pydub but here it is if anyone is interested:




At first it was a pretext to play with free monads, a way of building EDSLs. But right now I'm not sure it's not just a complication. Though, having an intermediate representation before executing the SoX commands makes it possible to write an optimizer (for example, collapsing two audio shifts).

I was thinking of using sox as well - turns out a lot a lot of the audioop module in the Stdlib comes directly from sox :)

relevant comment in the audioop source: https://github.com/python-git/python/blob/master/Modules/aud...

I know it's not available on Linux and not free but have you tried Reaper ? It has lots of scripting / coding capabilities.

Interesting, I didn't know about these capabilities, thanks!

I'd be crazy for a library (any language) to distill a sound file into discrete MIDI notes (or any notation), with configurable threshold levels. Clustering those into different tracks by some unsupervised learning would be a dream.

If what you're talking about is starting from an audio mix and estimating the complete set of notes that produced it, you could try Silvet (http://code.soundsoftware.ac.uk/projects/silvet), a (C++) implementation of a polyphonic note estimator from audio.

It's realised as a Vamp plugin which you can run in a host like Sonic Visualiser to review the results, play them back, and export as MIDI. (I'm involved with both projects.)

The general shape of this method, and of many related methods, is:

* convert audio to a time-frequency representation using some variation on the short-time Fourier transform

* match each time step of the time-frequency grid against a set of templates extracted from frequency profiles of various instruments, using some statistical approximation technique

* take the resulting pitch probability distributions and estimate what note objects they might correspond to, using simple thresholding (as in Silvet) or a Markov model for note transitions etc

Silvet is a useful and interesting implementation, but if you try it, you'll also learn the limitations of current methods when used against complete musical mixes. (Some of this is intrinsic to the problem -- the information might not be there, and humans can't always transcribe it either.)

I've heard Melodyne solves this problem very successfully, and the demos look impressive. Any idea what it's doing? Is it patented / secret / witchcraft? Or just has more templates?

I don't have any worthwhile insight, I'm afraid. I expect it's partly high-quality methods, partly a lot of refinement for common inputs and use cases.

Academic methods tend to be trying to work towards a very general problem such as "transcribing a music recording". A tool intended for specific real users can approach the problem from a perhaps more realistic perspective.

This comment is mostly why I have never tread in the feared world of multimedia. I simply don't understand what it means. I mean I can guess at trying to pull the original "tracks" "laid down" out from a recording but - is that possible, common but not FOSS?

I simply don't know where to start and have not had the incentive to discover. It's probably as laughable to others as say someone turnin up and saying "yeah SQL - I sort of understand it's to do with tables right?" But audio and video seem like closed worlds of programming. There seems to be no gateway from here to there.

So something like this is v exciting - it might be the gateway drug.

Audio and video really aren't that closed; they just require a commitment to learning things that are pretty different than what a typical programmer does; and they get math heavy. But they're really, really fun to play with, when you knuckle down, and there's a fairly gentle learning curve from understanding formats and containers to samples to statistical signal processing and psychoacoustic and visual models.

And being good at it means writing your own ticket, if you're a careerist.

Is there a recognised / common syllabus for that gentle learning curve - it might make a worthwhile hobby ... But it's definitely a "one day" for the moment.

I don't know of a defined syllabus but here is what I did:

1. Write some basic synthesis code(naive oscillator and volume envelope) and some way of sequencing it. Write little toy trackers and some procedural audio with this technique.

2. Try and fail a few times to write dataflow engine for audio like Pure Data. (this is kind of a big project and it turns out, you really don't need it to experiment.)

3. Write Standard MIDI Format playback system for a one-voice PC beeper emulation. This turns out to be a gateway drug for me learning more in depth because all you have to do is add "just one more" feature and every MIDI file you play sounds a little better.

4. Expand MIDI playback and synthesizer in tandem. End up with polyphonic .WAV sampler, then Soundfont playback. Learn everything about DSP badly, and gradually correct misconceptions. (DSP material can be tricky since the math concepts map to illegible code, and a lot of the resources are for general engineering applications instead of audio.)

5. Rewrite things and work on custom subtractive synthesizer, after getting everything wrong a few times over in the first synthesis engine. Still do not know to write nice sounding IIR filters; steal a good free implementation instead.

And that is where I stand today. I know enough to engineer complex signal chains, how some of the common formats are structured, and some tricks for improving sound quality and optimizing CPU time; what I miss is the background for writing original signal processing algorithms, which gets really specialized(there are some people who devote themselves entirely to reverbs, for example). These algorithms and their sound characteristics are effectively trade secrets, so the opaqueness of the field is not just a matter of the problems being hard - DSP just hasn't become as commodified as other software.

With due respect to yourself (and jfb) the first paragraph is what I am talking about. I can just about guess you mean produce a sinewave, chop it into time slices and output those in some audio format.

I think I will give it a go shortly - close your ears :-)

Chipsy's peer comment is excellent. For video I'd:

1. write an MPEG-4 parser -- this is much simpler than you probably think;

2. decode the H.264 metadata;

3. decode the H.264 picture data and write it to files, one per frame -- do not be ashamed to use an existing decoder!

4. put these frames back into a MJPEG, for instance;

5. try your hand at developing a DEAD SIMPLE I-frame only video codec, using e.g. zip for the frames.

This could teach you if you are interested in video without too much conceptual overhead. I had a friend and coworker who did #5 in Ruby, so don't think you need to get into GPU vectorization and signals theory right away.

Ableton Live's most recent version has this feature. It's called harmony to midi, and it's pretty cool (Ableton is a fantastic program in general).

I'm not aware of any open libs for this task though. I'm not really sure how you would go about this either. Something with wavelets would be my first guess? There is a wavelet lib for python [0]. You'd have to determine the correspondence between wavelet scale and midi note frequency.

This assumes audio tracks are separated. Separating mixed tracks seems like an even bigger can of worms.

[0]: http://www.pybytes.com/pywavelets/

Cool project.

The echonest remix API has similar capabilities, and can quantize into granularities of measures, beats, tatums.


> last_5_seconds = song[5000:]

I think it should be song[-5000:]

EDIT it was already reported https://github.com/jiaaro/pydub/issues/65

oh damn, updated the readme but not the dot-com.

Fixed. I really need to automate that :/

I started writing a sample-level audio suite in Java, for musical purposes:


which I'll probably discard because Java is not the right tool for the job. Inspecting pydub/pydub/pyaudioop.py , I see there are methods for working on the sample level. I'll make a mental note to come back to this project when I get back on the computer music bandwagon again.

Nice! I love little focused Python libraries like this. Will have to keep it in mind if I ever do audio stuff in the future.

What is the best way to break apart an MP3 file (similar to ID3Lib) so you can see how it's encoded and edit the artist name, song name, add an image, change the encoding format, etc?

Is there a good python library for re-encoding an MP3?

This might do what you're looking for:


pydub does it, but it shells out to ffmpeg, not sure if that's acceptable

Cool idea! I like the overloaded operators.

I would if they were consistent, but having some to change volume and some to change repeats is somewhat jarring.

I almost used a decibel type but in practice, when you write

    my_sound + 6
it feels pretty natural to think about that as "add 6dB" and

    my_sound * 3
as "my_sound 3 times"

Also, since those are the only odd overloads, it is quite easy to learn (at least I think so, but I'm the author =P )

Suggestions on some cool academic projects that could use this library?

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact