Yes the whole point was how you'd have to deal with the emojis.
I feel it is implied that the latency is low enough (a few 100s of ms) to not impede the conversation, and that the parties have talked before and would notice if the tone of the conversation was completely different. Or maybe I'm misunderstanding.
Yeah that seems right. I guess an attacker could tamper with the call early on, say "hey what are your emojis?" as soon as the call starts. Deepfaked voices and 2 henchmen should be sufficient here, no need for real-time audio rewriting. Then once the emojis have been verified, switch to a pure-eavesdrop MITM.
To defend, could ask to verify emojis at a random point in the middle of the call to make the attacker's life more difficult. Especially right before discussing sensitive information ;-)
Or drip verify over the course of the call, e.g. "what's your 3rd emoji?", and listen for signs of an attacker cutting in and out.
That doesn't help in a hypothetical where the attacker is doing a passive audio-forwarding ("pure eavesdrop") MITM for everything except emoji verification.
I feel it is implied that the latency is low enough (a few 100s of ms) to not impede the conversation, and that the parties have talked before and would notice if the tone of the conversation was completely different. Or maybe I'm misunderstanding.