Attaching to the app is impractical to catch regressions in production. LLMs are probabilistic - this means you can have a regression without even changing the code / making a new deployment.
A metric to alert on could be task-completion rate using LLM as a judge or synthetic tests which are run on a schedule. Then the other metrics you mentioned are useful for debugging the problem.
A metric to alert on could be task-completion rate using LLM as a judge or synthetic tests which are run on a schedule. Then the other metrics you mentioned are useful for debugging the problem.