Please pay more attention to the point 3 in my original post. To reiterate: their encoding is hilariously bad, and is easily outcompeted by a modem from the 60s.
youre missing the forest for the trees. the library this demo is using for audio encoding (ggwave) was not made by the creators of this demo. speed (or lack thereof) aside, having a direct audio<->text encoding is much more computationally efficient than speech<->text generation.
on the subject of the encoding efficiency, the ggwave depo mentions the use of reed-solomon error correction to make transmission more reliable. im struggling to find any info on error correction used by bell 103 or other modems, but if they aren't as robust that could partially explain the discrepancy you're describing
If you want to address a phone-with-internet-backchannel, that's valid too - but it assumes different problem statement and constraints.