That's true. Even small API or model version updates can shift evaluation behavior. G-Eval helps reduce that variance, but it doesn’t eliminate it completely. I think long-term stability will probably require some combination of fixed reference models and calibration datasets.
I haven’t come across any research showing that a specific LLM consistently outperforms others for this. It generally works best with strong reasoning models that produce consistent outputs.
To explain just briefly, SessionStack is just collecting events from the browser like user interactions, DOM changes, etc. and then tech info like crashes, network, debug messages. All of this data is being batched in memory and sent to our servers every few seconds. And that's all the work that we do on the client-side, almost the same as any analytics tool that you already use.
The "magic" happens on our end, where we're reconstructing all of this data, pulling the needed static resources, etc. in order to recreate the same series of events as they happened to your users and show you that information visually. And this happens in real-time.
Co-founder of SessionStack here :)
I'd be happy to show you a personal demo, explain how SessionStack works and explain why it won't impact the performance of your product. Can you ping me at alex@sessionstack.com?
Will do, and I was planning on trying out the service at some point (we use Sentry and discovered you through their integrations page), but I figured I'd take this chance to see what kind of experience others are having.