> Would be interesting to know how much a jj-specific SKILL.md would race the sc...

> Would be interesting to know how much a jj-specific SKILL.md would race the score.

That is definitely something we're interested in; we will try running this evaluation with skills soon.

> This might not fit the evaluation framework, but I'd still be interested in your experience/setup with terminal-based coding agents like Claude Code.

We have adopted Harbor as our evaluation framework, so evaluating Claude Code is straightforward: https://harborframework.com/docs/agents#installed-agents