Hacker Newsnew | past | comments | ask | show | jobs | submit | oliveskin's commentslogin

Hi HN, I’m sharing an open-source project I’ve been building called Agent Tinman.

It’s a forward-deployed research agent designed to live alongside real AI systems and continuously:

Generate hypotheses about where models may fail

Design and run experiments in: LAB (sandboxed) SHADOW (mirrored production traffic) PRODUCTION (real users, gated)

Classify failures across: - Reasoning - Long-context behavior - Tool use - Feedback loops - Deployment & latency

Propose interventions

Simulate those interventions on real traces before deployment

Gate risky changes with optional human approval

It’s meant for teams who already run AI in production and want continuous, structured failure discovery, not just offline evals.

It’s: Open source (Apache 2.0) Python-first Designed to integrate as a sidecar via a pipeline adapter

Built around explicit modes, risk tiers (SAFE / REVIEW / BLOCK), and severity levels (S0–S4)

This is early but functional. I’d really appreciate:

Skeptical feedback

Edge cases you think would break this

Whether this solves a real problem for you or not

Repo: https://github.com/oliveskin/Agent-Tinman

Happy to answer anything technical.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: