I would say its useful for all production-grad application. Trainer.diagnose hel...

I would say its useful for all production-grad application.

Trainer.diagnose helps you get a final eval score across different splits of datasets: train, val, test, and it logs all errors, including format errors so that you can manually diagnose and to decide if the evaluation is too low that you need further text-grad optimization.

if there is still a big gap between your optimized prompt vs performance on a more advanced model with the same prompt (say gpt4o), then you can use our "Learn-to-reason few-shot" to create demonstration from the advanced model to further close the performance gap. We have use cases optimized the performance all the way from 60% to 94% on gpt3.5 and the gpt4o has 98%.

We will give users some guideline in general.

We are the only library provides "diagnose" and "debug" feature and a clear optimization goal.