We haven't found a good provider yet to do this properly for our use case, but SpeakerText, Koemei and VoiceBase are examples of companies that offer these functionalities.
Unfortunately SpeakerText doesn't offer non-post-processed prices, Koemei integrated it into their own product and VoiceBase didn't offer post-processing on request, which we would need for integration into our product.
Those formats don't accommodate for timestamps per spoken word though, which would be possible with machine transcription and which I would pay a premium for.
Not yet, but could you send us an example?