I've updated several iterations to improve the accuracy for release stability. And I open sourced the project so that you may contribute to the dashboard to make it more useful: https://github.com/davideuler/agent-watch
THANK YOU for all guys who gives feedback for the tiny project.
GPT would analyze each issue if it is negative. And also it would analyze if it the core features related issue. I iterated it several times. The dashboard seems more reasonable than the initial version. I would open source the project soon so that other could contribute to build a better stability dashboard for the daily Agents we use.
Which platform have you found is most hackable? I have Garmin atm and like it but there’s no easy way to pipe my data into my agent or server for offline analysis.
I’ve only really had trouble integrating Withings.
Working with Apple was also challenging because I had to purchase an Apple Watch or iPhone (the data is stored locally only, with no server or API to call, which is great from a privacy perspective) and then deploy specific code on the device.
I’m not sure if this helps your use case, but I was planning to make the API public and create a CLI (similar to Sentry or Grafana’s gcx) to access it. But if you want a local first option, not the best solution
Ross, who kindly wrote the first review, was a reviewer of the book before it was published. He is a real person with over 30 years of experience in software development.
A better permissions layer for coding agents. The tool works like auto-mode for Claude Code, so you can stay in the flow and only get prompted to allow or deny tool calls when it truly matters, but it is fully deterministic. My benchmarks surfaced that most Bash calls don’t need an LLM to be classified as safe, ambiguous, or dangerous. A deterministic classifier can auto-allow or block 95% of Bash tool calls as safe or dangerous, with only the remaining 5% being truly ambiguous or unknown.
Conclusion is permission reviews with LLMs like Claude’s auto mode or Codex auto review are like using a data center to flip a light switch - overkill.
The main benefit is that your agent’s autonomy can be governed deterministically through policies that can be stored at the user and repo level. The bonus is that you save tokens vs using auto modes.
With most OSS releases being MoEs, and modern GPUs optimized for MoEs, can somebody with knowledge of the topic explain or speculate why Mistral might have opted for a dense model?
The advantage to a dense model like this Mistral one is that it is as smart as a much larger MoE model so it can fit on less GPUs. The tradeoff is that it is much slower since it has to read 100% of its weights for every token, MoE models typically only read about a tenth (though sparsity levels vary).
Agent permissions layer are broken. We need better a permissions layer that doesn’t get in the way but stops destructive commands. Devs get pushed into running yolo mode cause classifying allow / deny by command is not enough. A sandbox would not have prevented this either.
“nah” is a context aware permission layer that clasifies commands based on what they actually do
nah exposes a type taxonomy: filesystem_delete, network_write, db_write, etc
nah inspects Write and Edit content before it hits disk so destructive patterns like os.unlink, rm -rf, shell injection get flagged. And executing the result (./evil) classifies as unknown resolves to ask, which the LLM can choose to blocks or ask you to approve.
But yeah, a truly adversarial agent needs a sandbox. It's a different threat model - nah is meant to catch the trusted but mistake-prone coding CLI, not a hostile agent.
great callout - tool call can have side-effects outside your box. So unless you run a sandbox with no internet access, you aren't ever 100% safe.
nah does guard some of this - reading .env or ~/.aws/credentials gets flagged, and Write/Edit content is inspected for secrets before it leaves the tool.
Docker + filtered mounts + something like nah on top is a solid layered approach that is still practical.