Thanks everyone for the suggestions and kind words.
Some details:
The source code for this project can be found on github [0].
I am using an AudioWorklet node with custom DSP using Rust/WebAssembly. Graphics are just done with the Canvas API. The voice leading is done algorithmically using a state machine with some heuristics.
The underlying DSP algorithm is a physical model of the human voice, similar to the model you'd find in Pink Trombone [1], but with some added improvements. The DSP code for that is a small crate [2] I've been working on just for singing synthesizers based on previous work I've done.
Apologies for the late comment, but I had a query I wanted to share.
Would it be possible for you to create a tool that allows users to mimic human emotional sounds directly in the browser? I’m thinking of sounds like realistic coughs, sighs, gasps, and other vocal expressions like shouting or crying etc
. It would be amazing if the tool could optionally incorporate TTS, but even without it, the functionality would be very valuable for content creators or people who need custom sound effects.
The idea is to let users customize these sounds by adjusting parameters such as intensity, pitch, and duration. It could also include variations for emotional contexts, like a sad sigh, a relieved sigh, a startled gasp, or a soft cough. An intuitive interface with sliders and buttons to tweak and preview sounds in real-time would make it super user-friendly, with options to save or export the generated audio much like the project pinktrombone
I’m quite new to this field and only have basic experience with HTML, CSS, and JavaScript. However, I am very much interested in this area and I was wondering if this is something that could be achieved using tools like CursorAI or similar AI-based solutions? Or better yet, is it possible for you to create something like this for people like me who aren’t very tech-savvy?
What a beautiful idea. Sadly, I do not think I currently have the skills required to build such a tool.
The underlying algorithms and vocal models I'm using here are just good enough to get some singing vowels working. You'd need a far more complex model to simulate the turbulent airflow required for a cough.
If you suspend disbelief and allow for more abstract sounds, I believe you can craft sounds that have similar emotional impact. A few years ago, I made some non-verbal goblin sounds [0] from very simple synthesizer components and some well-placed control curves. Even though they don't sound realistic, character definitely comes through.
Dear Zebproj Thankyou for the response. I see, do you believe that tools like cursor Ai or ChatGPT can help like you I too do not have the skills to make such a tool and while I am trying to get there it will be quite sometime if I can learn those skills and implement. I really wish if someone can make my wish come true I will still however have a look at what you shared cheers Alex
Holding down a note and waiting will cause a second, then a third not to appear. When you move your held, note to another pitch, the other pitches will follow, but with a bit of delay. This produces what is known as staggered voice leading, and produces interesting "in-between" chords.
Sporth is a stack-based language I wrote a few years ago. Stack-based languages are a great way to build up sound structures. I highly recommend trying it.
Chorth may need some fixes before it can run again. I haven't looked at it in a while, but I had a lot of fun using when I was in SLOrk.
If you compare codebases, SuperCollider is definitely the more "modern" of the 2. SC is written in a reasonably modern version of C++, and over the years has gone through significant refactoring. Csound is mostly implemented in C, with some of the newer bits written in C++. Many parts of Csound have been virtually untouched since the 90s.
Syntax-wise, Csound very closely resembles the MUSIC-N language used by early computer musicians in the 60s. "Trapped in Convert" by Richard Boulanger was written in Csound in 1979, and to this day is able to run on the latest version of Csound.
Both Csound and SC are both very capable DSP engines, with a good core set of DSP algorithms. You can get a "good" sound out of both if you know what you are doing.
I find people who are more CS-inclined tend to prefer SuperCollider over Csound because it's actually a programming language you can be expressive in. While there have been significant syntax improvements in Csound 6, I'd still call Csound a "text-based synthesizer" rather than a "programming language".
That being said, I also think Csound lends itself to those who have more of a formal background in music. Making an instrument in an Orchestra is just like making a synthesizer patch, and creating events in a Csound score is just like composing notes for an instrument to play.
FWIW, I've never managed to get SuperCollider to stick for me. The orchestra/score paradigm of Csound just seems to fit better with how I think about music. It's also easier to offline render WAV files in Csound, which was quite helpful for me.
I have programming experience, but that's actually why I prefer Csound. Since Csound's engine is effectively oriented around building up instruments in a modular way, I feel it can simply be wrapped up into more general purpose programming languages to get a language with the power of the more modular synth engine.
You might enjoy my project called sndkit [0]. It's a collection of DSP algorithms implemented in C, written in a literate programming style, and presented inside of a static wiki. There's also a tiny TCL-like scripting language included that allows one to build up patches. This track [1] was made entirely using sndkit.
See my other comments here for more info about the underlying technology.
It is pretty incredible that sophisticated digital physical models of the human vocal tract were being done in the early 60s. This was able to be done largely due to the deep pockets of Bell Labs. A lot of R+D was put into the voice and voice transmission.
The singing synthesizer used a surprisingly sophisticated physical model of the human voice [1].
The music was mostly likely created using some variant of MUSIC-N [2], the first computer music language. The syntax and design of Csound[3] was based off of MUSIC-N, and I believe the older Csound opcodes are either ported or based off those found.
Apparently the sources for MUSIC-V (the last major iteration of the MUSIC language) can be found on github [4], though I haven't tried to run it yet.
Sort of. Both use articulatory synthesis, which attempts to model speech by breaking it up into components and using some coordinated multi-dimensional continuous control to perform phonemes (the articulation aspect). The voder uses analog electronics, while Daisy does it digitally (and without a human performer).
The underlying signal processing used for both is different, but both use a source-filter mechanism.
The synthetic voder output sounds more or less exactly like the output of a vocoder where the input is a human voice and the carrier is a sawtooth. Not surprising, given that the voder was made by the same people.
But I'm still unsure why those two things sound so similar to each other, and formant/LPC chips sound so similar to each other, but the two groups of things sound so dissimilar (at least, IMO).
I have a background in electronic music, so I'm pretty familiar with additive, subtractive, and other types of synthesis.
I'm especially surprised about the physical modelling sounding more like a formant chip, because a guitar "talk box" gives a sound exactly like a vocoder, and that should be almost the same thing, just with a real human mouth instead of a model.
The vo(co)der uses banks of fixed filters to apply the broad shape of a spectrum to an input signal. It's basically an automated graphic EQ. The level of each fixed band in the modulator is copied to the equivalent band in the carrier.
The bandpass filters have a steeper cutoff than usual and are flatter at the top of the passband than usual. And the centre frequencies aren't linearly spaced. But otherwise - it's just a fancy graphic EQ.
The formant approach uses dynamic filters. It's more like an automated parametric EQ. Each formant is modelled with a variable BPF with its own time-varying level, frequency, and possibly Q. You apply that to a simply buzzy waveform and get speech-like sounds out. If you vary the pitch of the buzz you can make the output "sing."
LPC uses a similar model but it applies data compression to estimate future changes for each formant band. So instead of having to control all the parameters at or near audio rate, you can drop the control rate right down and still get something that can be understood.
There are more modern systems. FOF and FOG use granular synthesis to create formant sounds directly. Controlling the frequency and envelope of the grains is equivalent to filtering a raw sound, but is more efficient.
FOF and FOG evolved into PSOLA which is basically real-time granulated formant synthesis and pitch shifting.
Many of the simpler vocal tract physical models are very similar to the cascaded allpass filter topologies found in LPC speech synthesizers.
In general, tract physical models have never sounded all that realistic. The one big thing they have going for them is control. Compared to other speech synthesis techniques, they can be quite malleable. Pink Trombone [1] uses a physical model under the hood. While it's not realistic sounding, the interface is quite compelling.
Thank you! Seems like that project was incredibly far ahead of its time.
The physical-modelling aspect is super interesting. Does that mean that the similarity in sound to formant-based speech synthesis is because they're both using a sawtooth wave, noise, or other relatively simple sound as the raw input? I always imagined that a physical-modelling speech synthesizer fed by a sawtooth wave would sound more like a vocoder than Votrax or TI LPC output does, but I guess not.
> Does that mean that the similarity in sound to formant-based speech synthesis is because they're both using a sawtooth wave, noise, or other relatively simple sound as the raw input?
Essentially, yes. Both are known as "source-filter" models. A sawtooth, narrow pulse, or impulse wave is a good approximation glottal excitation for the source signal, though many articulatory speech models use a more specialized source model that's analytically derived from real waveforms produce by the glottis. The Lilencrantz-Fant Derivative Glottal Waveform model is the most common, but a few others exist.
In formant synthesis, the formant frequencies are known ahead of time and are explicitly added to the spectrum using some kind of peak filter. With waveguides, those formants are implicitly created based on the shape of the vocal tract (the vocal tract here is approximated as a series of cylindrical tubes with varying diameters).
Human speech production/perception works by articulation changing the shape, hence resonant frequencies (formants), of the vocal tract, and our ear/auditory cortex then picking up these changing formants. We're especially attuned to changes in the formants since those correspond to changes in articulation. The specific resonant frequency values of the formants vary from individual to individual and aren't so important.
Similarly the sound source (aka voice) for human speech can vary a lot from individual to individual, so serves more to communicate age/sex, emotion, identity, etc, not actual speech content (formant changes).
The reason articulatory synthesis (whether based on a physical model of the vocal tract, or a software simulation of one) and formant synthesis sound so similar is because both are designed to emphasize the formants (resonant frequencies) in a somewhat overly-precise way, and neither typically do a good job of accurately modelling the voice source, and other factors that would make it sound more natural. The ultimate form of formant synthesis just uses sine waves (not a source + filter model) to model the changing formant frequencies, and is still quite intelligible.
The "Daisy" song somehow became a staple for computer speech, and can be heard here in the 1984 DECtalk formant-synthesizer version. You can still pick up DECtalks on eBay - an impressive large VCR-sized box with a 3" 68000 processor inside.
The neat thing about this particular singing synthesizer is that it used a surprisingly sophisticated (especially for the 60s) physical model of the human vocal tract [1], and was perhaps the first use of physical modeling sound synthesis. Vowel shapes were obtained through physical measurements of an actual vocal tract via x-rays. In this case, they were Russian vowels, but were close enough for English.
While this particular kind of speech synthesis[2] isn't really used anymore, it's still fun to play around with. Pink Trombone [3] is a good example of a fun toy that uses a waveguide physical model, similar to the Kelly-Lochbaum model above. I've adapted some of the DSP in Pink Trombone a few times[4][5][6], and used it in some music[7] and projects[8]of mine.
For more in-depth information about specifically doing singing synthesis (as opposed to general speech synthesis) using waveguide physical models, Perry Cook's Dissertation [9] is still considered to be a seminal work. In the early 2000s, there were a handful of follow-ups to physically-based singing synthesis being done at CCRMA. Hui-Ling Lu's dissertation [10] on glottal source modelling for singing purposes comes to mind.
Another excellent, but quite dense, resource I've found helpful for implementing my own waveguide models is Physical Audio Signal Processing, a book available as a hard copy and online [1]. There are also an absolute ton of research papers on these topics which have failed to be summarized anywhere or cited outside the small circle of researchers, so there's a ton of institutional knowledge about physical modeling locked up in academic papers that isn't super accessible.
I've been fascinated by the simplicity of this since I ran into SAM (Software Automatic Mouth) on the C64, but never really taken the time to delve into it. Your links are an amazing resource...
Some details:
The source code for this project can be found on github [0].
I am using an AudioWorklet node with custom DSP using Rust/WebAssembly. Graphics are just done with the Canvas API. The voice leading is done algorithmically using a state machine with some heuristics.
The underlying DSP algorithm is a physical model of the human voice, similar to the model you'd find in Pink Trombone [1], but with some added improvements. The DSP code for that is a small crate [2] I've been working on just for singing synthesizers based on previous work I've done.
0: https://github.com/paulBatchelor/trio
1: https://dood.al/pinktrombone/
2: https://github.com/PaulBatchelor/voxbox
reply