Show HN: Grading Notes for LLM-as-Judge

james-revisoai · 2024-08-27T08:09:47.000000Z

Hi! Yes we use something similar at Revision.ai to analyze student short, long and recall-based answers. The context of their studying content is considered when providing feedback in more ways than just the "correctness" of their answer.

When you say "leave small details around concepts that LLMs make mistakes on", do you imagine any way to do this for a variety of source material? Let's just say you have the course lectures, and several questions with curt answers. Can that be useful then?

shabie · 2024-08-27T14:10:13.000000Z

If I understood you correctly, yes I believe the notes should cover things that are not well-understood by LLMs more than stuff we know it typically gets right. So for us these are internal concepts and how people talk about them and less so about programming syntax.

Also, I've been thinking about adding a structure to the grading-notes so the variance in quality you get when asking people to leave notes becomes smaller. Yet, structure increases the burden... things like "what should the answer have", "what is definitely wrong to mention" etc.

james-revisoai · 2024-09-02T08:20:23.000000Z

That makes a lot of sense! I do see value there. I guess it's the context of the class, but biased to the content which LLMs at test-time are bad at understanding/marking (e.g. the items/subjects in MMLU the LLMs still fail at).

As for the structure approach, which interests me - How can you find/note down that information, which you reference in your second paragraph?

I agree it is valuable. In times past, just identifying wrong answers (with Multiple Choice Questions / distractors) could give you great insight into misconceptions in a class (e.g. and that could fill the "Things definitely wrong to mention").

But how do you without a human expert work out "what should the answer have?", in a way that doesn't knock out answers that are left-field but genuine/synthezizing?

My experience of marking exam papers in STEM (only a few courses mind, maybe 300 papers) is that markers are wildly different. Some generally give every pupil +5% higher grades than a more critical neighbour; some are "absolutist" to the course content (either as they learned it, or it is currently taught to the student), and more than would ever admit will freely move scores up 10-20% if the students answer simply novelly interests them, so long as it's not outright provably already wrong.

The available tools to remark and to get 1:1 marking feedback (as offered in Germany after exam season) allow rectifying and smoothing these differences, but in some ways, they are genuine. I just don't know how you can codify it. Even with human markers, there is so much disagreement. I doubt that making 3 different models mark with the same criteria and averaging results/synthesizing a "majority judgement" summarization would work either. It's too hard to make the models care or identify novelty relative to the other answers you've seen that day for the question, if that makes sense.