The man in the middle would need to spoof the voice (with current microphone+environment etc.) in ~real time, probably both parties as well. And with the awareness to detect which words to replace in the middle of a normal conversation.
With the demos we've seen feels absolutely doable, but for now requires quite some effort.
You're talking about an attacker who wants to tamper with communication. Eavesdropping is far easier. [I agree you'll want deepfakes if you want your MITM eavesdropping to handle emoji codes, for mass surveillance at least.]
But even tampering seems pretty easy if the attacker has a more modest objective, of having you and your buddy each talking to one of the attacker's henchmen using voice changers. The emoji verification won't help here -- each henchman just gives the emoji for their respective conversation.
Yes the whole point was how you'd have to deal with the emojis.
I feel it is implied that the latency is low enough (a few 100s of ms) to not impede the conversation, and that the parties have talked before and would notice if the tone of the conversation was completely different. Or maybe I'm misunderstanding.
Yeah that seems right. I guess an attacker could tamper with the call early on, say "hey what are your emojis?" as soon as the call starts. Deepfaked voices and 2 henchmen should be sufficient here, no need for real-time audio rewriting. Then once the emojis have been verified, switch to a pure-eavesdrop MITM.
To defend, could ask to verify emojis at a random point in the middle of the call to make the attacker's life more difficult. Especially right before discussing sensitive information ;-)
Or drip verify over the course of the call, e.g. "what's your 3rd emoji?", and listen for signs of an attacker cutting in and out.
That doesn't help in a hypothetical where the attacker is doing a passive audio-forwarding ("pure eavesdrop") MITM for everything except emoji verification.
With the demos we've seen feels absolutely doable, but for now requires quite some effort.