Most audio translation is first transcribed to a text representation, I think, and then that text is what's translated. If you assume a perfect speech recognizer (for the sake of argument), then the end translation should be the same.
Only that you are talking about specifics, while I was talking in generality. (IMO) unsupervised learning of semantic relationships is a 'big deal'. This seems to be a very general method that would extend to any pair of data streams, but that hasn't been proven out yet.
Or am I misunderstanding your point?