azzarcher's comments

azzarcher · 2026-01-20T13:00:14 1768914014

I am doing similar experiments for in-browser user testing, where the user is basically Claude. Incredible results with this simple pipeline: 1. Claude tests a feature 2. Notes down all friction and pain points 3. Convert those as a prioritized todo list 4. Use Claude Code or similar to action the todo list

It has some friction in-between steps 3 and 4, but nothing that can't be solved without running `claude --chrome` via CLI instead of the Chrome extension.

simedw · 2026-01-20T13:27:38 1768915658

It would be neat if it had a headless mode.

azzarcher · on Aug 17, 2023

Don't forget https://benchllm.com/

swyx · on Aug 17, 2023

appreciate it - part of why i put this list up is so that people can add to it lol

azzarcher · on Aug 17, 2023

How is this standing out from https://benchllm.com/?

Eddygandr · on Aug 17, 2023

I really dislike benchllm's use of yamls for test cases - I'd rather it be in code.

""" input: What's 1+1? Be very terse, only numeric output expected: - 2 - 2.0 """

jacky2wong · on Aug 17, 2023

Agreed. No one should ever have to touch YAML for writing unit tests for LLMs. Ever. Most people writing agents and LLM applications are Python developers/data scientists/ML engineers.