Consider hand-producing a sample via manual audio editing, to demonstrate the limits of what ought to be possible. Find some audio you think could be listened to at that rate, see how fast you can listen to it via standard sound stretching techniques (e.g. libsoundtouch and similar), and then demonstrate how much better you can do with hand-editing. Worry about how to automate that after you successfully demonstrate that possibility and make it compelling.
On the other hand, suppose we somehow got good training data by getting a bunch of audio samples at the same number of words per minute that were graded by human listeners as easy or hard to understand. Then in principle something like a neural net might figure out what audio features were responsible for intelligibility and then adjust the non-intelligible audio in that direction (a la using convolutional neural nets to make pictures appear in the style of a famous painter without changing the content). This would be done automatically without any humans actually understanding the solution.