So, at what point during a phoneme does it become distinct?
After any one phoneme has been sent down the wire you can truncate the audio and send a phoneme identifier; at the other end they just replay the phoneme from a cache.
Like doing speech to text, but using phonemes, and sending the phoneme hash string instead of the actual audio.