I guess they built upon the Voder [1] (Homer Dudley, Bell Labs 1939 as well. But...

zebproj · on May 12, 2023

Sort of. Both use articulatory synthesis, which attempts to model speech by breaking it up into components and using some coordinated multi-dimensional continuous control to perform phonemes (the articulation aspect). The voder uses analog electronics, while Daisy does it digitally (and without a human performer).

The underlying signal processing used for both is different, but both use a source-filter mechanism.

blincoln · on May 12, 2023

The synthetic voder output sounds more or less exactly like the output of a vocoder where the input is a human voice and the carrier is a sawtooth. Not surprising, given that the voder was made by the same people.

But I'm still unsure why those two things sound so similar to each other, and formant/LPC chips sound so similar to each other, but the two groups of things sound so dissimilar (at least, IMO).

I have a background in electronic music, so I'm pretty familiar with additive, subtractive, and other types of synthesis.

I'm especially surprised about the physical modelling sounding more like a formant chip, because a guitar "talk box" gives a sound exactly like a vocoder, and that should be almost the same thing, just with a real human mouth instead of a model.

TheOtherHobbes · on May 13, 2023

The vo(co)der uses banks of fixed filters to apply the broad shape of a spectrum to an input signal. It's basically an automated graphic EQ. The level of each fixed band in the modulator is copied to the equivalent band in the carrier.

The bandpass filters have a steeper cutoff than usual and are flatter at the top of the passband than usual. And the centre frequencies aren't linearly spaced. But otherwise - it's just a fancy graphic EQ.

The formant approach uses dynamic filters. It's more like an automated parametric EQ. Each formant is modelled with a variable BPF with its own time-varying level, frequency, and possibly Q. You apply that to a simply buzzy waveform and get speech-like sounds out. If you vary the pitch of the buzz you can make the output "sing."

LPC uses a similar model but it applies data compression to estimate future changes for each formant band. So instead of having to control all the parameters at or near audio rate, you can drop the control rate right down and still get something that can be understood.

There are more modern systems. FOF and FOG use granular synthesis to create formant sounds directly. Controlling the frequency and envelope of the grains is equivalent to filtering a raw sound, but is more efficient.

FOF and FOG evolved into PSOLA which is basically real-time granulated formant synthesis and pitch shifting.

zebproj · on May 13, 2023

Many of the simpler vocal tract physical models are very similar to the cascaded allpass filter topologies found in LPC speech synthesizers.

In general, tract physical models have never sounded all that realistic. The one big thing they have going for them is control. Compared to other speech synthesis techniques, they can be quite malleable. Pink Trombone [1] uses a physical model under the hood. While it's not realistic sounding, the interface is quite compelling.

1: https://dood.al/pinktrombone/