Could some of the "wrong" answers be the LLM attempting to give an explanation rather than the answer, eg. instead of answering 'X', the LLM answers 'The letter is partially hidden by the oval, so cannot be certain, but it appears to be the english letter X'.
The scoring criteria would rank this answer as 'T', which is wrong.
The scoring criteria would rank this answer as 'T', which is wrong.