Botwell is an automated framework that helps evaluate LLM capabilities through a unique approach: models grade each other's responses to challenging prompts.
The concept is based on this post: https://substack.com/inbox/post/157571824 where models act as both writers and critics.
Here's what makes it different from traditional benchmarks:
1. Peer Evaluation: Models write essays on complex topics, then grade each other's work
2. Cross-Domain Analysis: Tests across multiple domains (described below)
3. Grading Bias Detection: Measures which models grade more strictly/leniently vs. the consensus
4. Comprehensive Boswell Quotient: A 0-100 score that combines performance, evaluation capability, and efficiency
We've built test domains across three categories:
POLITICAL SCIENCE:
- Level 1: AI policy analysis
- Level 2: Complex AI governance with rigorous grading criteria
COMPUTER SCIENCE:
- Level 1: Algorithm analysis and complexity theory
- Level 2: Distributed system design challenges
PROGRAMMING (New):
- Level 1: Basic algorithms in four languages (FizzBuzz, Palindrome Checker, Binary Search)
- Level 2: Advanced algorithms (N-Queens, Longest Common Subsequence, Dijkstra's Algorithm)
- Level 3: Competitive programming challenges (Segment Trees with Lazy Propagation, Suffix Arrays, Dinic's Algorithm)
The programming domains require implementations in TypeScript, Python, Rust, and C, making them a demanding test of multilingual coding ability and correctness.
The framework is fully automated, generates detailed visualizations, and calculates a unified Boswell Quotient. Results include statistical analysis of grading bias patterns, performance metrics, and cost-efficiency trade-offs.
Our initial findings have been interesting - there's often a significant gap between models that perform well when writing content vs. evaluating content produced by others. Some models are consistently strong across all domains, while others exhibit specific strengths only in certain areas.
Code and documentation: https://github.com/alanwilhelm/botwell
I'd love to hear any feedback or ideas for any additional tests that would be valuable to include.