For the most part, I don’t do chatbots except for a couple of RAG based chatbots. It’s more behind the scenes stuff like image understanding, categorization, nuanced sentiment analsys, semantic alignment, etc.
I’ve created a framework that lets me test the quality in automated way between prompt changes and models and I compare costs/speed/quality.
The only thing that requires humans to judge the qualify out of all those are RAG results.
It depends. Amazon’s Nova Light gave me the best speed vs performance when I needed really quick real time inference for categorizing a users input (think call centers).
One of Anthropics models did the best with image understanding with Amazon’s Nova Pro being slightly behind.
For my tests, I used a customer’s specific set of test data.
For RAG I forgot. But is much more subjective. I just gave the customer an ability to configure the model and modify the prompt so they could choose.
Your experience matches mine then... I haven't noticed any clear, consistent differences. I'm always looking for second opinions on this (bc I've gotten fairly cynical). Appreciate it
I’ve created a framework that lets me test the quality in automated way between prompt changes and models and I compare costs/speed/quality.
The only thing that requires humans to judge the qualify out of all those are RAG results.