Personality is inescapable in a video presentation where either body or voice is present, unless the body and/or voice are computer generated. We are hard-wired to read body language or interpret inflection, even though we may be completely wrong.
If you're relying on the user having headphones, that's a mistaken assumption. There are any number of reasons that a person may not have headphones at the moment or ever. I have my own office, so I don't need headphones and I don't use them outside of my own home, in general, because I need to be aware of what's happening in the environment.
The UI for online video is pretty uniform across platforms. Yes, it sucks. To get a better UI with an actual, functioning scrub wheel, I would have to load it in Premiere or AfterEffects. However, the problem of chipmunk voice/missing phonemes applies just as much.
Well-written print documents will generally include a summary near the top that can be skimmed for informational cues in seconds. Furthermore, a standard page can be scanned for the same informational cues to localize the information. You cannot say the same about video.
I can't speak specifically to your Udemy class as I haven't taken it. However, printed course material can be re-read, taking in only the chunks that need to be re-read without having to remember a precise timecode and without having to spend much time in the preceding or following material.
I'm not a big fan of learning from video, but in the online classes I took this year, both on edX and on coursera, I found audio speedup perfectly functional.