That's a shame. I would imagine that a productively similar audio task could be devised around humans helping to transcribe hard-to-auto-transcribe audio recordings, and that it could use a similar strategy to the visual one: a known-good clip and an ambiguous one, requiring the user to transcribe both.
Well, I'm not a linguist or anything like that, but I'm not sure it would work so well, if it was useful at all.
1/ Homophony. Knowing how to write a word you hear often needs contextualisation. Giving two whole sentence for the user to hear is too long and takes too much time for the user.
'Right, but no need for a whole sentence, a few words suffice' Right, but if the computer knows where to cut these sentences, it can also transcribe it itself.
2/ You have to assume good spelling from user.
3/ Is it useful ? I mean, 100M recaptcha are done everyday, what part of these 100M are audio recaptcha ? 0.0005% ? Less ? Transcribing 2 sentences every week makes no sense. Keep in mind that homophony+bad spelling are two factors increasing hugely the number of times the same 'unknown clip' will have to go through 'human validation' until we can assume with a certain level of confidence that it has been transcribed.
Funny 4/, take a look at the [cc] button on some youtube videos : on the fly transcription. Thanks Google. :)
Oh, and also on the fly translation of on the fly transcription, btw.
Google even said they were working on Voice to voice translation for Google Voice : English speaker calls Chinese speaker, english voice transcribed, then translated, then synthesized, same the other way. :)