I've created a very simple package for the idea introduced in the Databricks blog post (
https://www.databricks.com/blog/enhancing-llm-as-a-judge-wit...).
It proved to be quite useful for the use-cases I've worked on since with grading notes you can leave small details on around domain concepts that the LLMs make mistakes on rather than have a full answer which consumes a lot more time labeling time.
I'd like to learn more if such an approach or similar has been useful for others too.
When you say "leave small details around concepts that LLMs make mistakes on", do you imagine any way to do this for a variety of source material? Let's just say you have the course lectures, and several questions with curt answers. Can that be useful then?