It's more in scenarios where I enter the room and I ask the patient whether this...

It's more in scenarios where I enter the room and I ask the patient whether this is their wife/husband etc. It's not like I'm going into the room and saying "hello patient you appear to be a human female". The model is having difficulty figuring out who actors are if their are multiple different people talking. Not a big issue if all you're doing is rewriting information. But if multi-modal context is required, its not the best.