> Would be interesting to know how much a jj-specific SKILL.md would race the score.
That is definitely something we're interested in; we will try running this evaluation with skills soon.
> This might not fit the evaluation framework, but I'd still be interested in your experience/setup with terminal-based coding agents like Claude Code.
That is definitely something we're interested in; we will try running this evaluation with skills soon.
> This might not fit the evaluation framework, but I'd still be interested in your experience/setup with terminal-based coding agents like Claude Code.
We have adopted Harbor as our evaluation framework, so evaluating Claude Code is straightforward: https://harborframework.com/docs/agents#installed-agents