It seems to do it just fine when in desktop applications using Qt, fwiw., it leverages all the standard Qt GUI testing stuff (and if you have the money you can just integrate Squish which has LLM support now).
That's my experience too. I've had increased luck encouraging the LLM to structure the code in "functional core, imperative shell" style, and telling it stupid things like "make sure you can test the code you're writing".