A friend of mine is a solo developer, he is creating a big intelligent actors platform using LLMs. I think his platform is overly abstract and use a lot of calls to LLMs. How can one measure the increase in intelligent behavior of this platform versus vanilla GPT4?, I am thinking in same use case that would allow him to show the strength of his idea without having a huge cost.
Edited: googling I found this one (), but don't know about the cost of testing the platform.
() https://openreview.net/pdf?id=zAdUB0aCTQ