And as you said, format isn't enough. You need semantics.
If two applications know enough about the other side to know how to formulate their voice queries, they know at least enough to exchange those same queries as text, and skip the stupidly wasteful text->speech->text process.
(And if world wouldn't be so full of adversarial practices driving engineering stupidity, the developers would agree on an efficient binary format beforehand.)
And as you said, format isn't enough. You need semantics.
If two applications know enough about the other side to know how to formulate their voice queries, they know at least enough to exchange those same queries as text, and skip the stupidly wasteful text->speech->text process.
(And if world wouldn't be so full of adversarial practices driving engineering stupidity, the developers would agree on an efficient binary format beforehand.)